{"title": "Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model", "book": "Advances in Neural Information Processing Systems", "page_first": 5232, "page_last": 5242, "abstract": "This paper studies a curious phenomenon in learning energy-based model (EBM) using MCMC. In each learning iteration, we generate synthesized examples by running a non-convergent, non-mixing, and non-persistent short-run MCMC toward the current model, always starting from the same initial distribution such as uniform noise distribution, and always running a fixed number of MCMC steps. After generating synthesized examples, we then update the model parameters according to the maximum likelihood learning gradient, as if the synthesized examples are fair samples from the current model. We treat this non-convergent short-run MCMC as a learned generator model or a flow model. We provide arguments for treating the learned non-convergent short-run MCMC as a valid model. We show that the learned short-run MCMC is capable of generating realistic images. More interestingly, unlike traditional EBM or MCMC, the learned short-run MCMC is capable of reconstructing observed images and interpolating between images, like generator or flow models. The code can be found in the Appendix.", "full_text": "Learning Non-Convergent Non-Persistent Short-Run\n\nMCMC Toward Energy-Based Model\n\nErik Nijkamp\n\nMitch Hill\n\nUCLA Department of Statistics\n\nUCLA Department of Statistics\n\nenijkamp@ucla.edu\n\nmkhill@ucla.edu\n\nSong-Chun Zhu\n\nYing Nian Wu\n\nUCLA Department of Statistics\n\nUCLA Department of Statistics\n\nsczhu@stat.ucla.edu\n\nywu@stat.ucla.edu\n\nAbstract\n\nThis paper studies a curious phenomenon in learning energy-based model (EBM)\nusing MCMC. In each learning iteration, we generate synthesized examples by\nrunning a non-convergent, non-mixing, and non-persistent short-run MCMC toward\nthe current model, always starting from the same initial distribution such as uniform\nnoise distribution, and always running a \ufb01xed number of MCMC steps. After\ngenerating synthesized examples, we then update the model parameters according\nto the maximum likelihood learning gradient, as if the synthesized examples are fair\nsamples from the current model. We treat this non-convergent short-run MCMC\nas a learned generator model or a \ufb02ow model. We provide arguments for treating\nthe learned non-convergent short-run MCMC as a valid model. We show that\nthe learned short-run MCMC is capable of generating realistic images. More\ninterestingly, unlike traditional EBM or MCMC, the learned short-run MCMC is\ncapable of reconstructing observed images and interpolating between images, like\ngenerator or \ufb02ow models. The code can be found in the Appendix.\n\n1\n\nIntroduction\n\n1.1 Learning Energy-Based Model by MCMC Sampling\n\nThe maximum likelihood learning of the energy-based model (EBM) [32, 55, 22, 44, 33, 37, 8, 35,\n52, 53, 25, 9, 51] follows what Grenander [17] called \u201canalysis by synthesis\u201d scheme. Within each\nlearning iteration, we generate synthesized examples by sampling from the current model, and then\nupdate the model parameters based on the difference between the synthesized examples and the\nobserved examples, so that eventually the synthesized examples match the observed examples in\nterms of some statistical properties de\ufb01ned by the model. To sample from the current EBM, we need\nto use Markov chain Monte Carlo (MCMC), such as the Gibbs sampler [14], Langevin dynamics,\nor Hamiltonian Monte Carlo [36]. Recent work that parametrizes the energy function by modern\nconvolutional neural networks (ConvNets) [31, 29] suggests that the \u201canalysis by synthesis\u201d process\ncan indeed generate highly realistic images [52, 13, 24, 12].\n\nAlthough the \u201canalysis by synthesis\u201d learning scheme is intuitively appealing, the convergence of\nMCMC can be impractical, especially if the energy function is multi-modal, which is typically the\ncase if the EBM is to approximate the complex data distribution, such as that of natural images. For\nsuch EBM, the MCMC usually does not mix, i.e., MCMC chains from different starting points tend\nto get trapped in different local modes instead of traversing modes and mixing with each other.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Synthesis by short-run MCMC: Generating synthesized examples by running 100 steps\nof Langevin dynamics initialized from uniform noise for CelebA (64\u00d7 64).\n\nFigure 2: Synthesis by short-run MCMC: Generating synthesized examples by running 100 steps\nof Langevin dynamics initialized from uniform noise for CelebA (128\u00d7 128).\n\n1.2 Short-Run MCMC as Generator or Flow Model\n\nIn this paper, we investigate a learning scheme that is apparently wrong with no hope of learning a\nvalid model. Within each learning iteration, we run a non-convergent, non-mixing and non-persistent\nshort-run MCMC, such as 5 to 100 steps of Langevin dynamics, toward the current EBM. Here,\nwe always initialize the non-persistent short-run MCMC from the same distribution, such as the\nuniform noise distribution, and we always run the same number of MCMC steps. We then update\nthe model parameters as usual, as if the synthesized examples generated by the non-convergent and\nnon-persistent noise-initialized short-run MCMC are the fair samples generated from the current\nEBM. We show that, after the convergence of such a learning algorithm, the resulting noise-initialized\nshort-run MCMC can generate realistic images, see Figures 1 and 2.\n\nThe short-run MCMC is not a valid sampler of the EBM because it is short-run. As a result, the\nlearned EBM cannot be a valid model because it is learned based on a wrong sampler. Thus we learn\na wrong sampler of a wrong model. However, the short-run MCMC can indeed generate realistic\nimages. What is going on?\n\nThe goal of this paper is to understand the learned short-run MCMC. We provide arguments that it is\na valid model for the data in terms of matching the statistical properties of the data distribution. We\nalso show that the learned short-run MCMC can be used as a generative model, such as a generator\nmodel [15, 28] or the \ufb02ow model [10, 11, 27, 5, 16], with the Langevin dynamics serving as a\nnoise-injected residual network, with the initial image serving as the latent variables, and with the\ninitial uniform noise distribution serving as the prior distribution of the latent variables. We show\nthat unlike traditional EBM and MCMC, the learned short-run MCMC is capable of reconstructing\nthe observed images and interpolating different images, just like a generator or a \ufb02ow model can do.\nSee Figures 3 and 4. This is very unconventional for EBM or MCMC, and this is due to the fact that\nthe MCMC is non-convergent, non-mixing and non-persistent. In fact, our argument applies to the\nsituation where the short-MCMC does not need to have the EBM as the stationary distribution.\n\nWhile the learned short-run MCMC can be used for synthesis, the above learning scheme can be\ngeneralized to tasks such as image inpainting, super-resolution, style transfer, or inverse optimal\ncontrol [56, 2] etc., using informative initial distributions and conditional energy functions.\n\n2 Contributions and Related Work\n\nThis paper constitutes a conceptual shift, where we shift attention from learning EBM with unrealistic\nconvergent MCMC to the non-convergent short-run MCMC. This is a break away from the long\ntradition of both EBM and MCMC. We provide theoretical and empirical evidence that the learned\nshort-run MCMC is a valid generator or \ufb02ow model. This conceptual shift frees us from the\nconvergence issue of MCMC, and makes the short-run MCMC a reliable and ef\ufb01cient technology.\n\nMore generally, we shift the focus from energy-based model to energy-based dynamics. This appears\nto be consistent with the common practice of computational neuroscience [30], where researchers\noften directly start from the dynamics, such as attractor dynamics [23, 3, 40] whose express goal is to\n\n2\n\n\fFigure 3: Interpolation by short-run MCMC resembling a generator or \ufb02ow model: The transi-\n\ntion depicts the sequence M\u03b8 (z\u03c1 ) with interpolated noise z\u03c1 = \u03c1z1 +p1\u2212 \u03c1 2z2 where \u03c1 \u2208 [0, 1] on\nCelebA (64\u00d7 64). Left: M\u03b8 (z1). Right: M\u03b8 (z2). See Section 3.4.\n\nFigure 4: Reconstruction by short-run MCMC resembling a generator or \ufb02ow model: The\ntransition depicts M\u03b8 (zt ) over time t from random initialization t = 0 to reconstruction t = 200 on\nCelebA (64\u00d7 64). Left: Random initialization. Right: Observed examples. See Section 3.4.\nbe trapped in a local mode. It is our hope that our work may help to understand the learning of such\ndynamics. We leave it to future work.\n\nFor short-run MCMC, contrastive divergence (CD) [21] is the most prominent framework for theoreti-\ncal underpinning. The difference between CD and our study is that in our study, the short-run MCMC\nis initialized from noise, while CD initializes from observed images. CD has been generalized to\npersistent CD [48]. Compared to persistent MCMC, the non-persistent MCMC in our method is\nmuch more ef\ufb01cient and convenient. [38] performs a thorough investigation of various persistent\nand non-persistent, as well as convergent and non-convergent learning schemes. In particular, the\nemphasis is on learning proper energy function with persistent and convergent Markov chains. In all\nof the CD-based frameworks, the goal is to learn the EBM, whereas in our framework, we discard the\nlearned EBM, and only keep the learned short-run MCMC.\n\nOur theoretical understanding of short-run MCMC is based on generalized moment matching estima-\ntor. It is related to moment matching GAN [34], however, we do not learn a generator adversarially.\n\n3 Non-Convergent Short-Run MCMC as Generator Model\n\n3.1 Maximum Likelihood Learning of EBM\n\nLet x be the signal, such as an image. The energy-based model (EBM) is a Gibbs distribution\n\np\u03b8 (x) =\n\n1\n\nZ(\u03b8 )\n\nexp( f\u03b8 (x)),\n\n(1)\n\nwhere we assume x is within a bounded range. f\u03b8 (x) is the negative energy and is parametrized by a\nbottom-up convolutional neural network (ConvNet) with weights \u03b8 . Z(\u03b8 ) = R exp( f\u03b8 (x))dx is the\nnormalizing constant.\nSuppose we observe training examples xi, i = 1, ..., n \u223c pdata, where pdata is the data distribution.\nFor large n, the sample average over {xi} approximates the expectation with respect with pdata. For\nnotational convenience, we treat the sample average and the expectation as the same.\n\nThe log-likelihood is\n\nL(\u03b8 ) =\n\n1\n\nn\n\nn\n\u2211\ni=1\n\nlog p\u03b8 (xi)\n\n.\n= Epdata[log p\u03b8 (x)].\n\nThe derivative of the log-likelihood is\n\nL\u2032(\u03b8 ) = Epdata(cid:20) \u2202\n\n\u2202 \u03b8\n\nf\u03b8 (x)(cid:21)\u2212 Ep\u03b8 (cid:20) \u2202\n\n\u2202 \u03b8\n\nf\u03b8 (x)(cid:21) .\n\n=\n\n1\n\nn\n\nn\n\u2211\ni=1\n\n\u2202\n\u2202 \u03b8\n\nf\u03b8 (xi)\u2212\n\n1\n\nn\n\nn\n\u2211\ni=1\n\n\u2202\n\u2202 \u03b8\n\nf\u03b8 (x\u2212i ),\n\n(2)\n\n(3)\n\n3\n\n\fThe above equation leads to the \u201canalysis by synthesis\u201d learning algorithm. At iteration t, let\n\nwhere x\u2212i \u223c p\u03b8 (x) for i = 1, ..., n are the generated examples from the current model p\u03b8 (x).\n\u03b8t be the current model parameters. We generate x\u2212i \u223c p\u03b8t (x) for i = 1, ..., n. Then we update\n\u03b8t+1 = \u03b8t + \u03b7t L\u2032(\u03b8t ), where \u03b7t is the learning rate.\n\n3.2 Short-Run MCMC\n\nGenerating synthesized examples x\u2212i \u223c p\u03b8 (x) requires MCMC, such as Langevin dynamics (or\n\nHamiltonian Monte Carlo) [36], which iterates\n\nx\u03c4+\u2206\u03c4 = x\u03c4 +\n\nf \u2032\u03b8 (x\u03c4 ) +\u221a\u2206\u03c4U\u03c4 ,\n\n\u2206\u03c4\n2\n\n(4)\n\nwhere \u03c4 indexes the time, \u2206\u03c4 is the discretization of time, and U\u03c4 \u223c N(0, I) is the Gaussian noise term.\nf \u2032\u03b8 (x) = \u2202 f\u03b8 (x)/\u2202 x can be obtained by back-propagation. If p\u03b8 is of low entropy or low temperature,\nthe gradient term dominates the diffusion noise term, and the Langevin dynamics behaves like\ngradient descent.\n\nIf f\u03b8 (x) is multi-modal, then different chains tend to get trapped in different local modes, and they\ndo not mix. We propose to give up the sampling of p\u03b8 . Instead, we run a \ufb01xed number, e.g., K,\nsteps of MCMC, toward p\u03b8 , starting from a \ufb01xed initial distribution, p0, such as the uniform noise\ndistribution. Let M\u03b8 be the K-step MCMC transition kernel. De\ufb01ne\n\nq\u03b8 (x) = (M\u03b8 p0)(z) = Z p0(z)M\u03b8 (x|z)dz,\n\n(5)\n\nwhich is the marginal distribution of the sample x after running K-step MCMC from p0.\n\nIn this paper, instead of learning p\u03b8 , we treat q\u03b8 to be the target of learning. After learning, we keep\nq\u03b8 , but we discard p\u03b8 . That is, the sole purpose of p\u03b8 is to guide a K-step MCMC from p0.\n\n3.3 Learning Short-Run MCMC\n\nThe learning algorithm is as follows. Initialize \u03b80. At learning iteration t, let \u03b8t be the model\nparameters. We generate x\u2212i \u223c q\u03b8t (x) for i = 1, ..., m. Then we update \u03b8t+1 = \u03b8t + \u03b7t \u2206(\u03b8t ), where\n\n\u2206(\u03b8 ) = Epdata(cid:20) \u2202\n\n\u2202 \u03b8\n\nf\u03b8 (x)(cid:21)\u2212 Eq\u03b8 (cid:20) \u2202\n\n\u2202 \u03b8\n\nf\u03b8 (x)(cid:21) \u2248\n\nm\n\u2211\ni=1\n\n\u2202\n\u2202 \u03b8\n\nf\u03b8 (xi)\u2212\n\nm\n\u2211\ni=1\n\n\u2202\n\u2202 \u03b8\n\nf\u03b8 (x\u2212i ).\n\n(6)\n\nWe assume that the algorithm converges so that \u2206(\u03b8t ) \u2192 0. At convergence, the resulting \u03b8 solves\nthe estimating equation \u2206(\u03b8 ) = 0.\n\nTo further improve training, we smooth pdata by convolution with a Gaussian white noise distribution,\ni.e., injecting additive noises \u03b5i \u223c N(0, \u03c3 2I) to observed examples xi \u2190 xi + \u03b5i [46, 43]. This makes\nit easy for \u2206(\u03b8t ) to converge to 0, especially if the number of MCMC steps, K, is small, so that the\nestimating equation \u2206(\u03b8 ) = 0 may not have solution without smoothing pdata.\n\nThe learning procedure in Algorithm 1 is simple. The key to the above algorithm is that the generated\n\n{x\u2212i } are independent and fair samples from the model q\u03b8 .\n\nAlgorithm 1: Learning short-run MCMC. See code in Appendix 7.3.\n\ninput\n\n:Negative energy f\u03b8 (x), training steps T , initial weights \u03b80, observed examples {xi}n\nsize m, variance of noise \u03c3 2, Langevin descretization \u2206\u03c4 and steps K, learning rate \u03b7.\n\ni=1, batch\n\noutput :Weights \u03b8T +1.\nfor t = 0 : T do\n\n1. Draw observed images {xi}m\n2. Draw initial negative examples {x\u2212i }m\n3. Update observed examples xi \u2190 xi + \u03b5i where \u03b5i \u223c N(0, \u03c3 2I).\n4. Update negative examples {x\u2212i }m\n5. Update \u03b8t by \u03b8t+1 = \u03b8t + g(\u2206(\u03b8t ), \u03b7,t) where gradient \u2206(\u03b8t ) is (6) and g is ADAM [26].\n\ni=1 for K steps of Langevin dynamics (4).\n\ni=1.\n\ni=1 \u223c p0.\n\n4\n\n\f3.4 Generator or Flow Model for Interpolation and Reconstruction\n\nWe may consider q\u03b8 (x) to be a generative model,\n\nz \u223c p0(z); x = M\u03b8 (z, u),\n\n(7)\n\nwhere u denotes all the randomness in the short-run MCMC. For the K-step Langevin dynamics, M\u03b8\ncan be considered a K-layer noise-injected residual network. z can be considered latent variables,\nand p0 the prior distribution of z. Due to the non-convergence and non-mixing, x can be highly\ndependent on z, and z can be inferred from x. This is different from the convergent MCMC, where\nx is independent of z. When the learning algorithm converges, the learned EBM tends to have low\nentropy and the Langevin dynamics behaves like gradient descent, where the noise terms are disabled,\ni.e., u = 0. In that case, we simply write x = M\u03b8 (z).\n\nWe can perform interpolation as follows. Generate z1 and z2 from p0(z). Let z\u03c1 = \u03c1z1 +p1\u2212 \u03c1 2z2.\nThis interpolation keeps the marginal variance of z\u03c1 \ufb01xed. Let x\u03c1 = M\u03b8 (z\u03c1 ). Then x\u03c1 is the\ninterpolation of x1 = M\u03b8 (z1) and x2 = M\u03b8 (z2). Figure 3 displays x\u03c1 for a sequence of \u03c1 \u2208 [0, 1].\nFor an observed image x, we can reconstruct x by running gradient descent on the least squares loss\nfunction L(z) = kx\u2212 M\u03b8 (z)k2, initializing from z0 \u223c p0(z), and iterates zt+1 = zt \u2212 \u03b7t L\u2032(zt ). Figure 4\ndisplays the sequence of xt = M\u03b8 (zt ).\nIn general, z \u223c p0(z); x = M\u03b8 (z, u) de\ufb01nes an energy-based dynamics. K does not need to be \ufb01xed.\nIt can be a stopping time that depends on the past history of the dynamics. The dynamics can be\nmade deterministic by setting u = 0. This includes the attractor dynamics popular in computational\nneuroscience [23, 3, 40].\n\n4 Understanding the Learned Short-Run MCMC\n\n4.1 Exponential Family and Moment Matching Estimator\n\nAn early version of EBM is the FRAME (Filters, Random \ufb01eld, And Maximum Entropy) model\n[55, 49, 54], which is an exponential family model, where the features are the responses from a\nbank of \ufb01lters. The deep FRAME model [35] replaces the linear \ufb01lters by the pre-trained ConvNet\n\ufb01lters. This amounts to only learning the top layer weight parameters of the ConvNet. Speci\ufb01cally,\nf\u03b8 (x) = h\u03b8 , h(x)i, where h(x) are the top-layer \ufb01lter responses of a pre-trained ConvNet, and \u03b8\nconsists of the top-layer weight parameters. For such an f\u03b8 (x), \u2202\n\u2202 \u03b8 f\u03b8 (x) = h(x). Then, the maximum\n[h(x)] = Epdata[h(x)].\nlikelihood estimator of p\u03b8 is actually a moment matching estimator, i.e., Ep \u02c6\u03b8MLE\nIf we use the short-run MCMC learning algorithm, it will converge (assume convergence is attainable)\n(x)\nto a moment matching estimator, i.e., Eq \u02c6\u03b8MME\nis a valid estimator in that it matches to the data distribution in terms of suf\ufb01cient statistics de\ufb01ned by\nthe EBM.\n\n[h(x)] = Epdata[h(x)]. Thus, the learned model q \u02c6\u03b8MME\n\nFigure 5: The blue curve illustrates the model distributions corresponding to different values of\nparameter \u03b8 . The black curve illustrates all the distributions that match pdata (black dot) in terms of\n(green dot) is the intersection between \u0398 (blue curve) and \u2126 (black curve).\nE[h(x)]. The MLE p \u02c6\u03b8MLE\nThe MCMC (red dotted line) starts from p0 (hollow blue dot) and runs toward p \u02c6\u03b8MME\n(hollow red dot),\nbut the MCMC stops after K-step, reaching q \u02c6\u03b8MME\n(red dot), which is the learned short-run MCMC.\nConsider two families of distributions: \u2126 = {p : Ep[h(x)] = Epdata [h(x)]}, and \u0398 = {p\u03b8 (x) =\nexp(h\u03b8 , h(x)i)/Z(\u03b8 ),\u2200\u03b8}. They are illustrated by two curves in Figure 5. \u2126 contains all the\ndistributions that match the data distribution in terms of E[h(x)]. Both p \u02c6\u03b8MLE\nbelong to\n\u2126, and of course pdata also belongs to \u2126. \u0398 contains all the EBMs with different values of the\nparameter \u03b8 . The uniform distribution p0 corresponds to \u03b8 = 0, thus p0 belongs to \u0398.\n\nand q \u02c6\u03b8MME\n\n5\n\n\fdoes not belong to \u2126, and it may be quite far from p \u02c6\u03b8MLE\n.\n[h(x)] 6= Epdata [h(x)], that is, the corresponding EBM does not match the data\nis\n\nThe EBM under \u02c6\u03b8MME, i.e., p \u02c6\u03b8MME\nIn general, Ep \u02c6\u03b8MME\ndistribution as far as h(x) is concerned. It can be much further from the uniform p0 than p \u02c6\u03b8MLE\nfrom p0, and thus p \u02c6\u03b8MME\nFigure 5 illustrates the above idea. The red dotted line illustrates MCMC. Starting from p0, K-step\nMCMC leads to q \u02c6\u03b8MME\n. Thus\nis to serve as an unreachable target to guide the K-step MCMC which stops at the\nthe role of p \u02c6\u03b8MME\nmid-way q \u02c6\u03b8MME\n. One can say that the short-run MCMC is a wrong sampler of a wrong model, but it\nitself is a valid model because it belongs to \u2126.\n\n(x). If we continue to run MCMC for in\ufb01nite steps, we will get to p \u02c6\u03b8MME\n\nmay have a much lower entropy than p \u02c6\u03b8MLE\n\n.\n\nis the projection of pdata onto \u0398. Thus it belongs to \u0398. It also belongs to \u2126 as can\nThe MLE p \u02c6\u03b8MLE\nbe seen from the maximum likelihood estimating equation. Thus it is the intersection of \u2126 and \u0398.\nAmong all the distributions in \u2126, p \u02c6\u03b8MLE\nis the closest to p0. Thus it has the maximum entropy among\nall the distributions in \u2126.\n\nThe above duality between maximum likelihood and maximum entropy follows from the fol-\nlowing fact. Let \u02c6p \u2208 \u0398 \u2229 \u2126 be the intersection between \u0398 and \u2126. \u2126 and \u0398 are orthogonal\nin terms of the Kullback-Leibler divergence. For any p\u03b8 \u2208 \u0398 and for any p \u2208 \u2126, we have the\nPythagorean property [39]: KL(p|p\u03b8 ) = KL(p| \u02c6p) + KL( \u02c6p|p\u03b8 ). See Appendix 7.1 for a proof. Thus\n(1) KL(pdata|p\u03b8 ) \u2265 KL(pdata| \u02c6p), i.e., \u02c6p is MLE within \u0398. (2) KL(p|p0) \u2265 KL( \u02c6p|p0), i.e., \u02c6p has\nmaximum entropy within \u2126.\n\nWe can understand the learned q \u02c6\u03b8MME\n(1) Pythagorean for the right triangle formed by q0, q \u02c6\u03b8MME\n\nfrom two Pythagorean results.\n\n,\n) = KL(q \u02c6\u03b8MME|p0)\u2212 KL(p \u02c6\u03b8MLE|p0) = H(p \u02c6\u03b8MLE\n\n, and p \u02c6\u03b8MLE\n\nKL(q \u02c6\u03b8MME|p \u02c6\u03b8MLE\n\n)\u2212 H(q \u02c6\u03b8MME\n\n),\n\n(8)\n\nto be high in order for it to be a good approximation to p \u02c6\u03b8MLE\n\nwhere H(p) = \u2212Ep[log p(x)] is the entropy of p. See Appendix 7.1. Thus we want the entropy of\n. Thus for small K, it is important\nq \u02c6\u03b8MME\nto let p0 be the uniform distribution, which has the maximum entropy.\n(2) Pythagorean for the right triangle formed by p \u02c6\u03b8MME\n) = KL(q \u02c6\u03b8MME|p \u02c6\u03b8MLE\n\n(9)\nFor \ufb01xed \u03b8 , as K increases, KL(q\u03b8|p\u03b8 ) decreases monotonically [7]. The smaller KL(q \u02c6\u03b8MME|p \u02c6\u03b8MME\n)\nis, the smaller KL(q \u02c6\u03b8MME|p \u02c6\u03b8MLE\n) are. Thus, it is desirable to use large K as\nlong as we can afford the computational cost, to make both q \u02c6\u03b8MME\n\n, q \u02c6\u03b8MME\n,\n) + KL(p \u02c6\u03b8MLE|p \u02c6\u03b8MME\n\n) and KL(p \u02c6\u03b8MLE|p \u02c6\u03b8MME\n\nKL(q \u02c6\u03b8MME|p \u02c6\u03b8MME\n\nclose to p \u02c6\u03b8MLE\n\n, and p \u02c6\u03b8MLE\n\nand p \u02c6\u03b8MME\n\n).\n\n.\n\n4.2 General ConvNet-EBM and Generalized Moment Matching Estimator\n\nFor a general ConvNet f\u03b8 (x), the learning algorithm based on short-run MCMC solves the following\n\nestimating equation: Eq\u03b8 h \u2202\n\n\u2202 \u03b8 f\u03b8 (x)i = Epdatah \u2202\n\n\u2202 \u03b8 f\u03b8 (x)i , whose solution is \u02c6\u03b8MME, which can be\n\nconsidered a generalized moment matching estimator that in general solves the following estimating\nequation: Eq\u03b8 [h(x, \u03b8 )] = Epdata [h(x, \u03b8 )], where we generalize h(x) in the original moment matching\nestimator to h(x, \u03b8 ) that involves both x and \u03b8 . For our learning algorithm, h(x, \u03b8 ) = \u2202\n\u2202 \u03b8 f\u03b8 (x). That\nis, the learned q \u02c6\u03b8MME\nis still a valid estimator in the sense of matching to the data distribution. The\nabove estimating equation can be solved by Robbins-Monro\u2019s stochastic approximation [42], as long\nas we can generate independent fair samples from q\u03b8 .\n\nIn classical statistics, we often assume that the model is correct, i.e., pdata corresponds to a q\u03b8true\nfor some true value \u03b8true. In that case, the generalized moment matching estimator \u02c6\u03b8MME follows\nan asymptotic normal distribution centered at the true value \u03b8true. The variance of \u02c6\u03b8MME depends\non the choice of h(x, \u03b8 ). The variance is minimized by the choice h(x, \u03b8 ) = \u2202\n\u2202 \u03b8 log q\u03b8 (x), which\ncorresponds to the maximum likelihood estimate of q\u03b8 , and which leads to the Cramer-Rao lower\nbound and Fisher information. See Appendix 7.2 for a brief explanation.\n\u2202 \u03b8 log p\u03b8 (x) = \u2202\n\u2202 \u03b8 log q\u03b8 (x). Thus the learning algorithm will\nnot give us the maximum likelihood estimate of q\u03b8 . However, the validity of the learned q\u03b8 does\n\n\u2202 \u03b8 log Z(\u03b8 ) is not equal to \u2202\n\n\u2202 \u03b8 f\u03b8 (x)\u2212 \u2202\n\n\u2202\n\n6\n\n\fFigure 6: Generated samples for K = 100 MCMC steps. From left to right: (1) CIFAR-10 (32\u00d7 32),\n(2) CelebA (64\u00d7 64), (3) LSUN Bedroom (64\u00d7 64).\n\nModel\n\nCIFAR-10\n\nCelebA\n\nLSUN Bedroom\n\nModel\n\nCIFAR-10\n\nCelebA\n\nLSUN Bedroom\n\nVAE [28]\n\nDCGAN [41]\n\nOurs\n\nIS\n\n4.28\n\n6.16\n\n6.21\n\nFID\n\n79.09\n\n32.71\n\n23.02\n\nFID\n\n183.18\n\n54.17\n\n44.16\n\nVAE [28]\n\nDCGAN [41]\n\nOurs\n\nMSE\n\n0.0421\n\n0.0407\n\n0.0387\n\nMSE\n\n0.0341\n\n0.0359\n\n0.0271\n\nMSE\n\n0.0440\n\n0.0592\n\n0.0272\n\n(a) IS and FID scores for generated examples.\n\n(b) Reconstruction error (MSE per pixel).\n\nTable 1: Quality of synthesis and reconstruction for CIFAR-10 (32 \u00d7 32), CelebA (64 \u00d7 64), and LSUN\nBedroom (64\u00d7 64). The number of features n f is 128, 64, and 64, respectively, and K = 100.\n\nnot require h(x, \u03b8 ) to be \u2202\n\u2202 \u03b8 log q\u03b8 (x). In practice, one can never assume that the model is true. As a\nresult, the optimality of the maximum likelihood may not hold, and there is no compelling reason\nthat we must use MLE.\n\nThe relationship between pdata, q \u02c6\u03b8MME\nwe need to modify the de\ufb01nition of \u2126.\n\n, p \u02c6\u03b8MME\n\n, and p \u02c6\u03b8MLE\n\nmay still be illustrated by Figure 5, although\n\n5 Experimental Results\n\nIn this section, we will demonstrate (1) realistic synthesis, (2) smooth interpolation, (3) faithful\nreconstruction of observed examples, and, (4) the in\ufb02uence of hyperparameters. K denotes the\nnumber of MCMC steps in equation (4). n f denotes the number of output features maps in the \ufb01rst\nlayer of f\u03b8 . See Appendix for additional results.\n\nWe emphasize the simplicity of the algorithm and models, see Appendix 7.3 and 7.4, respectively.\n\n5.1 Fidelity\n\nWe evaluate the \ufb01delity of generated examples on various datasets, each reduced to 40, 000 observed\nexamples. Figure 6 depicts generated samples for various datasets with K = 100 Langevin steps for\nboth training and evaluation. For CIFAR-10 we set the number of features n f = 128, whereas for\nCelebA and LSUN we use n f = 64. We use 200, 000 iterations of model updates, then gradually\ndecrease the learning rate \u03b7 and injected noise \u03b5i \u223c N(0, \u03c3 2I) for observed examples. Table 1 (a)\ncompares the Inception Score (IS) [45, 4] and Fr\u00e9chet Inception Distance (FID) [20] with Inception v3\nclassi\ufb01er [47] on 40, 000 generated examples. Despite its simplicity, short-run MCMC is competitive.\n\n5.2\n\nInterpolation\n\nWe demonstrate interpolation between generated examples. We follow the procedure outlined in\nSection 3.4. Let x\u03c1 = M\u03b8 (z\u03c1 ) where M\u03b8 to denotes the K-step gradient descent with K = 100.\nFigure 3 illustrates x\u03c1 for a sequence of \u03c1 \u2208 [0, 1] on CelebA. The interpolation appears smooth\nand the intermediate samples resemble realistic faces. The interpolation experiment highlights that\nthe short-run MCMC does not mix, which is in fact an advantage instead of a disadvantage. The\ninterpolation ability goes far beyond the capacity of EBM and convergent MCMC.\n\n7\n\n\f5\n\n0.15\n213.08\n2.06\n7.78\n\n10\n\n0.1\n182.5\n2.27\n3.85\n\nK\n\n25\n\n0.05\n92.13\n4.06\n1.76\n\n50\n\n0.04\n68.28\n4.82\n0.97\n\n75\n\n0.03\n65.37\n4.88\n0.65\n\n100\n\n0.03\n63.81\n4.92\n0.49\n\n\u03c3\n\nFID\n\nIS\n\nk \u2202\n\u2202 x f\u03b8 (x)k2\n\nTable 2: In\ufb02uence of number of MCMC steps K on models with n f = 32 for CIFAR-10 (32\u00d7 32).\n\n0.10\n\n132.51\n\n4.05\n\n0.08\n\n117.36\n4.20\n\nFID\n\nIS\n\n\u03c3\n\n0.06\n\n94.72\n4.63\n\n0.05\n\n83.15\n4.78\n\n0.04\n\n65.71\n4.83\n\n0.03\n\n63.81\n4.92\n\n32\n\n63.81\n4.92\n\nFID\n\nIS\n\nn f\n\n64\n\n46.61\n5.49\n\n128\n\n44.50\n6.21\n\n(a) In\ufb02uence of additive noise \u03b5i \u223c N(0, \u03c3 2I).\n\n(b) In\ufb02uence of model complexity n f .\n\nTable 3: In\ufb02uence of noise and model complexity with K = 100 for CIFAR-10 (32\u00d7 32).\n\n5.3 Reconstruction\n\nWe demonstrate reconstruction of observed examples. For short-run MCMC, we follow the procedure\noutlined in Section 3.4. For an observed image x, we reconstruct x by running gradient descent\n\non the least squares loss function L(z) = kx\u2212 M\u03b8 (z)k2, initializing from z0 \u223c p0(z), and iterates\nzt+1 = zt \u2212 \u03b7t L\u2032(zt ). For VAE, reconstruction is readily available. For GAN, we perform Langevin\ninference of latent variables [19, 50]. Figure 4 depicts faithful reconstruction. Table 1 (b) illustrates\ncompetitive reconstructions in terms of MSE (per pixel) for 1, 000 observed leave-out examples.\nAgain, the reconstruction ability of the short-run MCMC is due to the fact that it is not mixing.\n\n5.4\n\nIn\ufb02uence of Hyperparameters\n\non synthesis and average magnitude k \u2202\n\nMCMC Steps. Table 2 depicts the in\ufb02uence of varying the number of MCMC steps K while training\n\u2202 x f\u03b8 (x)k2 over K-step Langevin (4). We observe: (1) the\nquality of synthesis decreases with decreasing K, and, (2) the shorter the MCMC, the colder the\nlearned EBM, and the more dominant the gradient descent part of the Langevin. With small K,\nshort-run MCMC fails \u201cgracefully\u201d in terms of synthesis. A choice of K = 100 appears reasonable.\nInjected Noise. To stabilize training, we smooth pdata by injecting additive noises \u03b5i \u223c N(0, \u03c3 2I) to\nobserved examples xi \u2190 xi + \u03b5i. Table 3 (a) depicts the in\ufb02uence of \u03c3 2 on the \ufb01delity of negative\nexamples in terms of IS and FID. That is, when lowering \u03c3 2, the \ufb01delity of the examples improves.\nHence, it is desirable to pick smallest \u03c3 2 while maintaining the stability of training. Further, to\nimprove synthesis, we may gradually decrease the learning rate \u03b7 and anneal \u03c3 2 while training.\n\nModel Complexity. We investigate the in\ufb02uence of the number of output features maps n f on\ngenerated samples with K = 100. Table 3 (b) summarizes the quality of synthesis in terms of IS and\nFID. As the number of features n f increases, so does the quality of the synthesis. Hence, the quality\nof synthesis may scale with n f until the computational means are exhausted.\n\n6 Conclusion\n\nDespite our focus on short-run MCMC, we do not advocate abandoning EBM all together. On the\ncontrary, we ultimately aim to learn valid EBM [38]. Hopefully, the non-convergent short-run MCMC\nstudied in this paper may be useful in this endeavor. It is also our hope that our work may help to\nunderstand the learning of attractor dynamics popular in neuroscience.\n\nAcknowledgments\n\nThe work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296;\nand ONR MURI project N00014-16-1-2007; and XSEDE grant ASC170063. We thank Prof. Stu\nGeman, Prof. Xianfeng (David) Gu, Diederik P. Kingma, Guodong Zhang, and Will Grathwohl for\nhelpful discussions.\n\n8\n\n\fReferences\n\n[1] 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,\n\nConference Track Proceedings. OpenReview.net, 2017.\n\n[2] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Carla E.\nBrodley, editor, Machine Learning, Proceedings of the Twenty-\ufb01rst International Conference (ICML 2004),\nBanff, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International Conference Proceeding Series.\nACM, 2004.\n\n[3] Daniel J. Amit. Modeling brain function: the world of attractor neural networks, 1st Edition. Cambridge\n\nUniv. Press, 1989.\n\n[4] Shane T. Barratt and Rishi Sharma. A note on the inception score. CoRR, abs/1801.01973, 2018.\n\n[5] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and J\u00f6rn-Henrik Jacobsen. Invertible\nresidual networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th\nInternational Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California,\nUSA, volume 97 of Proceedings of Machine Learning Research, pages 573\u2013582. PMLR, 2019.\n\n[6] Yoshua Bengio and Yann LeCun, editors. 3rd International Conference on Learning Representations,\n\nICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.\n\n[7] Thomas M. Cover and Joy A. Thomas. Elements of information theory (2. ed.). Wiley, 2006.\n\n[8] Jifeng Dai and Ying Nian Wu. Generative modeling of convolutional neural networks. In Bengio and\n\nLeCun [6].\n\n[9] Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard H. Hovy, and Aaron C. Courville. Calibrating\nenergy-based generative adversarial networks. In 5th International Conference on Learning Representa-\ntions, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings [1].\n\n[10] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: non-linear independent components estimation.\nIn Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations,\nICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015.\n\n[11] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In 5th\nInternational Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,\nConference Track Proceedings [1].\n\n[12] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. CoRR,\n\nabs/1903.08689, 2019.\n\n[13] Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative convnets via\nmulti-grid modeling and sampling. In 2018 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 9155\u20139164. IEEE Computer Society, 2018.\n\n[14] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration\n\nof images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721\u2013741, 1984.\n\n[15] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Zoubin Ghahramani, Max Welling,\nCorinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information\nProcessing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December\n8-13 2014, Montreal, Quebec, Canada, pages 2672\u20132680, 2014.\n\n[16] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD:\nfree-form continuous dynamics for scalable reversible generative models. In 7th International Conference\non Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.\n\n[17] Ulf Grenander and Michael I Miller. Pattern theory: from representation to inference. Oxford University\n\nPress, 2007.\n\n[18] Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan,\nand Roman Garnett, editors. Advances in Neural Information Processing Systems 30: Annual Conference\non Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017.\n\n[19] Tian Han, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Alternating back-propagation for generator\nnetwork.\nIn Satinder P. Singh and Shaul Markovitch, editors, Proceedings of the Thirty-First AAAI\nConference on Arti\ufb01cial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 1976\u2013\n1984. AAAI Press, 2017.\n\n[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs\ntrained by a two time-scale update rule converge to a local nash equilibrium. In Guyon et al. [18], pages\n6626\u20136637.\n\n9\n\n\f[21] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computa-\n\ntion, 14(8):1771\u20131800, 2002.\n\n[22] Geoffrey E. Hinton, Simon Osindero, Max Welling, and Yee Whye Teh. Unsupervised discovery of\n\nnonlinear structure using contrastive backpropagation. Cognitive Science, 30(4):725\u2013731, 2006.\n\n[23] John J Hop\ufb01eld. Neural networks and physical systems with emergent collective computational abilities.\n\nProceedings of the national academy of sciences, 79(8):2554\u20132558, 1982.\n\n[24] Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective learning for discriminative classi\ufb01cation. In\n\nAdvances in Neural Information Processing Systems, 2017.\n\n[25] Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation.\n\nCoRR, abs/1606.03439, 2016.\n\n[26] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Bengio and LeCun\n\n[6].\n\n[27] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\nSamy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol\u00f2 Cesa-Bianchi, and Roman\nGarnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural\nInformation Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages\n10236\u201310245, 2018.\n\n[28] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann\nLeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada,\nApril 14-16, 2014, Conference Track Proceedings, 2014.\n\n[29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, L\u00e9on Bottou,\nand Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual\nConference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6,\n2012, Lake Tahoe, Nevada, United States, pages 1106\u20131114, 2012.\n\n[30] Dmitry Krotov and John J. Hop\ufb01eld. Unsupervised learning by competing hidden units. Proc. Natl. Acad.\n\nSci. U.S.A., 116(16):7723\u20137731, 2019.\n\n[31] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[32] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning.\n\nPredicting structured data, 1(0), 2006.\n\n[33] Honglak Lee, Roger B. Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks\nfor scalable unsupervised learning of hierarchical representations. In Andrea Pohoreckyj Danyluk, L\u00e9on\nBottou, and Michael L. Littman, editors, Proceedings of the 26th Annual International Conference on\nMachine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM\nInternational Conference Proceeding Series, pages 609\u2013616. ACM, 2009.\n\n[34] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. MMD GAN: towards\n\ndeeper understanding of moment matching network. In Guyon et al. [18], pages 2203\u20132213.\n\n[35] Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Learning FRAME models using CNN \ufb01lters. In Dale\nSchuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial\nIntelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 1902\u20131910. AAAI Press, 2016.\n\n[36] Radford M Neal. MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011.\n\n[37] Jiquan Ngiam, Zhenghao Chen, Pang Wei Koh, and Andrew Y. Ng. Learning deep energy models. In\nLise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine\nLearning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 1105\u20131112. Omnipress,\n2011.\n\n[38] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of MCMC-\nbased maximum likelihood learning of energy-based models. Thirty-Fourth AAAI Conference on Arti\ufb01cial\nIntelligence, 2020.\n\n[39] Stephen Della Pietra, Vincent J. Della Pietra, and John D. Lafferty. Inducing features of random \ufb01elds.\n\nIEEE Trans. Pattern Anal. Mach. Intell., 19(4):380\u2013393, 1997.\n\n[40] Bruno Poucet and Etienne Save. Attractors in memory. Science, 308(5723):799\u2013800, 2005.\n\n[41] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convo-\nlutional generative adversarial networks. In Yoshua Bengio and Yann LeCun, editors, 4th International\nConference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference\nTrack Proceedings, 2016.\n\n10\n\n\f[42] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical\n\nstatistics, pages 400\u2013407, 1951.\n\n[43] Kevin Roth, Aur\u00e9lien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative\n\nadversarial networks through regularization. In Guyon et al. [18], pages 2018\u20132028.\n\n[44] Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. In David A. Van Dyk and\nMax Welling, editors, Proceedings of the Twelfth International Conference on Arti\ufb01cial Intelligence\nand Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, volume 5 of JMLR\nProceedings, pages 448\u2013455. JMLR.org, 2009.\n\n[45] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\ntechniques for training GANs. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon,\nand Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on\nNeural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226\u20132234,\n2016.\n\n[46] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised MAP\ninference for image super-resolution. In 5th International Conference on Learning Representations, ICLR\n2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings [1].\n\n[47] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking\nthe inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern\nRecognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2818\u20132826. IEEE Computer\nSociety, 2016.\n\n[48] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient.\nIn William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Machine Learning, Proceedings of\nthe Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of\nACM International Conference Proceeding Series, pages 1064\u20131071. ACM, 2008.\n\n[49] Ying Nian Wu, Song Chun Zhu, and Xiuwen Liu. Equivalence of julesz ensembles and FRAME models.\n\nInternational Journal of Computer Vision, 38(3):247\u2013265, 2000.\n\n[50] Jianwen Xie, Yang Lu, Ruiqi Gao, and Ying Nian Wu. Cooperative learning of energy-based model and\nlatent variable model via MCMC teaching. In Sheila A. McIlraith and Kilian Q. Weinberger, editors,\nProceedings of the Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, (AAAI-18), the 30th innovative\nApplications of Arti\ufb01cial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in\nArti\ufb01cial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 4292\u20134301.\nAAAI Press, 2018.\n\n[51] Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor\n\nand generator networks. IEEE transactions on pattern analysis and machine intelligence (PAMI), 2018.\n\n[52] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. A theory of generative convnet. In Maria-\nFlorina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on\nMachine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop\nand Conference Proceedings, pages 2635\u20132644. JMLR.org, 2016.\n\n[53] Junbo Jake Zhao, Micha\u00ebl Mathieu, and Yann LeCun. Energy-based generative adversarial networks. In\n5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,\nConference Track Proceedings [1].\n\n[54] Song Chun Zhu and David Mumford. Grade: Gibbs reaction and diffusion equations. In Computer Vision,\n\n1998. Sixth International Conference on, pages 847\u2013854, 1998.\n\n[55] Song Chun Zhu, Ying Nian Wu, and David Mumford. Filters, random \ufb01elds and maximum entropy\n(FRAME): towards a uni\ufb01ed theory for texture modeling. International Journal of Computer Vision,\n27(2):107\u2013126, 1998.\n\n[56] Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse\nreinforcement learning. In Dieter Fox and Carla P. Gomes, editors, Proceedings of the Twenty-Third\nAAAI Conference on Arti\ufb01cial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008, pages\n1433\u20131438. AAAI Press, 2008.\n\n11\n\n\f", "award": [], "sourceid": 2832, "authors": [{"given_name": "Erik", "family_name": "Nijkamp", "institution": "UCLA"}, {"given_name": "Mitch", "family_name": "Hill", "institution": "UCLA Department of Statistics"}, {"given_name": "Song-Chun", "family_name": "Zhu", "institution": "UCLA"}, {"given_name": "Ying Nian", "family_name": "Wu", "institution": "University of California, Los Angeles"}]}