{"title": "Improved Variational Inference with Inverse Autoregressive Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 4743, "page_last": 4751, "abstract": "The framework of normalizing flows provides a general strategy for flexible variational inference of posteriors over latent variables. We propose a new type of normalizing flow, inverse autoregressive flow (IAF), that, in contrast to earlier published flows, scales well to high-dimensional latent spaces. The proposed flow consists of a chain of invertible transformations, where each transformation is based on an autoregressive neural network. In experiments, we show that IAF significantly improves upon diagonal Gaussian approximate posteriors. In addition, we demonstrate that a novel type of variational autoencoder, coupled with IAF, is competitive with neural autoregressive models in terms of attained log-likelihood on natural images, while allowing significantly faster synthesis.", "full_text": "Improved Variational Inference\nwith Inverse Autoregressive Flow\n\nDiederik P. Kingma\n\ndpkingma@openai.com\n\nTim Salimans\n\ntim@openai.com\n\nRafal Jozefowicz\nrafal@openai.com\n\nXi Chen\n\npeter@openai.com\n\nIlya Sutskever\n\nilya@openai.com\n\nMax Welling\u21e4\n\nM.Welling@uva.nl\n\nAbstract\n\nThe framework of normalizing \ufb02ows provides a general strategy for \ufb02exible vari-\national inference of posteriors over latent variables. We propose a new type of\nnormalizing \ufb02ow, inverse autoregressive \ufb02ow (IAF), that, in contrast to earlier\npublished \ufb02ows, scales well to high-dimensional latent spaces. The proposed \ufb02ow\nconsists of a chain of invertible transformations, where each transformation is\nbased on an autoregressive neural network. In experiments, we show that IAF\nsigni\ufb01cantly improves upon diagonal Gaussian approximate posteriors. In addition,\nwe demonstrate that a novel type of variational autoencoder, coupled with IAF, is\ncompetitive with neural autoregressive models in terms of attained log-likelihood\non natural images, while allowing signi\ufb01cantly faster synthesis.\n\n1\n\nIntroduction\n\nStochastic variational inference (Blei et al., 2012; Hoffman et al., 2013) is a method for scalable\nposterior inference with large datasets using stochastic gradient ascent. It can be made especially\nef\ufb01cient for continuous latent variables through latent-variable reparameterization and inference\nnetworks, amortizing the cost, resulting in a highly scalable learning procedure (Kingma and Welling,\n2013; Rezende et al., 2014; Salimans et al., 2014). When using neural networks for both the\ninference network and generative model, this results in class of models called variational auto-\nencoders (Kingma and Welling, 2013) (VAEs). A general strategy for building \ufb02exible inference\nnetworks, is the framework of normalizing \ufb02ows (Rezende and Mohamed, 2015). In this paper we\npropose a new type of \ufb02ow, inverse autoregressive \ufb02ow (IAF), which scales well to high-dimensional\nlatent space.\nAt the core of our proposed method lie Gaussian autoregressive functions that are normally used\nfor density estimation: functions that take as input a variable with some speci\ufb01ed ordering such\nas multidimensional tensors, and output a mean and standard deviation for each element of the\ninput variable conditioned on the previous elements. Examples of such functions are autoregressive\nneural density estimators such as RNNs, MADE (Germain et al., 2015), PixelCNN (van den Oord\net al., 2016b) or WaveNet (van den Oord et al., 2016a) models. We show that such functions\ncan often be turned into invertible nonlinear transformations of the input, with a simple Jacobian\ndeterminant. Since the transformation is \ufb02exible and the determinant known, it can be used as a\nnormalizing \ufb02ow, transforming a tensor with relatively simple known density, into a new tensor with\nmore complicated density that is still cheaply computable. In contrast with most previous work on\n\n\u21e4University of Amsterdam, University of California Irvine, and the Canadian Institute for Advanced Research\n\n(CIFAR).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(a) Prior distribution\n\n(b) Posteriors in standard VAE (c) Posteriors in VAE with IAF\n\nFigure 1: Best viewed in color. We \ufb01tted a variational auto-encoder (VAE) with a spherical Gaussian\nprior, and with factorized Gaussian posteriors (b) or inverse autoregressive \ufb02ow (IAF) posteriors (c)\nto a toy dataset with four datapoints. Each colored cluster corresponds to the posterior distribution of\none datapoint. IAF greatly improves the \ufb02exibility of the posterior distributions, and allows for a\nmuch better \ufb01t between the posteriors and the prior.\n\nimproving inference models including previously used normalizing \ufb02ows, this transformation is well\nsuited to high-dimensional tensor variables, such as spatio-temporally organized variables.\nWe demonstrate this method by improving inference networks of deep variational auto-encoders.\nIn particular, we train deep variational auto-encoders with latent variables at multiple levels of the\nhierarchy, where each stochastic variable is a three-dimensional tensor (a stack of featuremaps), and\ndemonstrate improved performance.\n\n2 Variational Inference and Learning\n\nLet x be a (set of) observed variable(s), z a (set of) latent variable(s) and let p(x, z) be the parametric\nmodel of their joint distribution, called the generative model de\ufb01ned over the variables. Given a\ndataset X = {x1, ..., xN} we typically wish to perform maximum marginal likelihood learning of its\nparameters, i.e. to maximize\n\nlog p(X) =\n\nlog p(x(i)),\n\n(1)\n\nNXi=1\n\nbut in general this marginal likelihood is intractable to compute or differentiate directly for \ufb02exible\ngenerative models, e.g. when components of the generative model are parameterized by neural\nnetworks. A solution is to introduce q(z|x), a parametric inference model de\ufb01ned over the latent\nvariables, and optimize the variational lower bound on the marginal log-likelihood of each observation\nx:\n\nL(x; \u2713) = log p(x) DKL(q(z|x)||p(z|x))\n\nlog p(x) Eq(z|x) [log p(x, z) log q(z|x)] = L(x; \u2713)\n\n(2)\nwhere \u2713 indicates the parameters of p and q models. Keeping in mind that Kullback-Leibler diver-\ngences DKL(.) are non-negative, it\u2019s clear that L(x; \u2713) is a lower bound on log p(x) since it can be\nwritten as follows ):\n(3)\nThere are various ways to optimize the lower bound L(x; \u2713); for continuous z it can be done ef\ufb01ciently\nthrough a re-parameterization of q(z|x), see e.g. (Kingma and Welling, 2013; Rezende et al., 2014).\nAs can be seen from equation (3), maximizing L(x; \u2713) w.r.t. \u2713 will concurrently maximize log p(x)\nand minimize DKL(q(z|x)||p(z|x)). The closer DKL(q(z|x)||p(z|x)) is to 0, the closer L(x; \u2713) will\nbe to log p(x), and the better an approximation our optimization objective L(x; \u2713) is to our true objec-\ntive log p(x). Also, minimization of DKL(q(z|x)||p(z|x)) can be a goal in itself, if we\u2019re interested\nin using q(z|x) for inference after optimization. In any case, the divergence DKL(q(z|x)||p(z|x))\nis a function of our parameters through both the inference model and the generative model, and\nincreasing the \ufb02exibility of either is generally helpful towards our objective.\n\n2\n\n\fNote that in models with multiple latent variables, the inference model is typically factorized into\npartial inference models with some ordering; e.g. q(za, zb|x) = q(za|x)q(zb|za, x). We\u2019ll write\nq(z|x, c) to denote such partial inference models, conditioned on both the data x and a further context\nc which includes the previous latent variables according to the ordering.\n\n2.1 Requirements for Computational Tractability\nRequirements for the inference model, in order to be able to ef\ufb01ciently optimize the bound, are that it\nis (1) computationally ef\ufb01cient to compute and differentiate its probability density q(z|x), and (2)\ncomputationally ef\ufb01cient to sample from, since both these operations need to be performed for each\ndatapoint in a minibatch at every iteration of optimization. If z is high-dimensional and we want\nto make ef\ufb01cient use of parallel computational resources like GPUs, then parallelizability of these\noperations across dimensions of z is a large factor towards ef\ufb01ciency. This requirement restrict the\nclass of approximate posteriors q(z|x) that are practical to use. In practice this often leads to the use\nof diagonal posteriors, e.g. q(z|x) \u21e0N (\u00b5(x), 2(x)), where \u00b5(x) and (x) are often nonlinear\nfunctions parameterized by neural networks. However, as explained above, we also need the density\nq(z|x) to be suf\ufb01ciently \ufb02exible to match the true posterior p(z|x).\n2.2 Normalizing Flow\nNormalizing Flow (NF), introduced by (Rezende and Mohamed, 2015) in the context of stochastic\ngradient variational inference, is a powerful framework for building \ufb02exible posterior distributions\nthrough an iterative procedure. The general idea is to start off with an initial random variable with a\nrelatively simple distribution with known (and computationally cheap) probability density function,\nand then apply a chain of invertible parameterized transformations ft, such that the last iterate zT has\na more \ufb02exible distribution2:\n\n(4)\nAs long as the Jacobian determinant of each of the transformations ft can be computed, we can still\ncompute the probability density function of the last iterate:\n\nzt = ft(zt1, x) 8t = 1...T\n\nz0 \u21e0 q(z0|x),\n\nTXt=1\n\nlog det\n\ndzt\n\ndzt1\n\nlog q(zT|x) = log q(z0|x) \n\n(5)\n\nHowever, (Rezende and Mohamed, 2015) experiment with only a very limited family of such\ninvertible transformation with known Jacobian determinant, namely:\nft(zt1) = zt1 + uh(wT zt1 + b)\n\n(6)\nwhere u and w are vectors, wT is w transposed, b is a scalar and h(.) is a nonlinearity, such that\nuh(wT zt1 + b) can be interpreted as a MLP with a bottleneck hidden layer with a single unit. Since\ninformation goes through the single bottleneck, a long chain of transformations is required to capture\nhigh-dimensional dependencies.\n\n3\n\nInverse Autoregressive Transformations\n\nIn order to \ufb01nd a type of normalizing \ufb02ow that scales well to high-dimensional space, we consider\nGaussian versions of autoregressive autoencoders such as MADE (Germain et al., 2015) and the\nPixelCNN (van den Oord et al., 2016b). Let y be a variable modeled by such a model, with some\ni=1. We will use [\u00b5(y), (y)] to denote the function of the\nchosen ordering on its elements y = {yi}D\nvector y, to the vectors \u00b5 and . Due to the autoregressive structure, the Jacobian is lower triangular\nwith zeros on the diagonal: @[\u00b5i, i]/@yj = [0, 0] for j i. The elements [\u00b5i(y1:i1), i(y1:i1)]\nare the predicted mean and standard deviation of the i-th element of y, which are functions of only\nthe previous elements in y.\nSampling from such a model is a sequential transformation from a noise vector \u270f \u21e0N (0, I) to the\ncorresponding vector y: y0 = \u00b50 + 0 \u270f0, and for i > 0, yi = \u00b5i(y1:i1) + i(y1:i1) \u00b7 \u270fi. The\n2where x is the context, such as the value of the datapoint. In case of models with multiple levels of latent\n\nvariables, the context also includes the value of the previously sampled latent variables.\n\n3\n\n\fAlgorithm 1: Pseudo-code of an approximate posterior with Inverse Autoregressive Flow (IAF)\nData:\n\nx: a datapoint, and optionally other conditioning information\n\u2713: neural network parameters\nEncoderNN(x; \u2713): encoder neural network, with additional output h\nAutoregressiveNN[\u21e4](z, h; \u2713): autoregressive neural networks, with additional input h\nsum(.): sum over vector elements\nsigmoid(.): element-wise sigmoid function\nz: a random sample from q(z|x), the approximate posterior distribution\nl: the scalar value of log q(z|x), evaluated at sample \u2019z\u2019\n\nResult:\n\n[\u00b5, , h] EncoderNN(x; \u2713)\n\u270f \u21e0N (0, I)\nz \u270f + \u00b5\nl sum(log + 1\nfor t 1 to T do\n\n2 \u270f2 + 1\n\n2 log(2\u21e1))\n\n[m, s] AutoregressiveNN[t](z, h; \u2713)\n sigmoid(s)\nz z + (1 ) m\nl l sum(log )\n\nend\n\ncomputation involved in this transformation is clearly proportional to the dimensionality D. Since\nvariational inference requires sampling from the posterior, such models are not interesting for direct\nuse in such applications. However, the inverse transformation is interesting for normalizing \ufb02ows, as\nwe will show. As long as we have i > 0 for all i, the sampling transformation above is a one-to-one\ntransformation, and can be inverted: \u270fi = yi\u00b5i(y1:i1)\ni(y1:i1)\nWe make two key observations, important for normalizing \ufb02ows. The \ufb01rst is that this inverse\ntransformation can be parallelized, since (in case of autoregressive autoencoders) computations of\nthe individual elements \u270fi do not depend on eachother. The vectorized transformation is:\n\n.\n\n\u270f = (y \u00b5(y))/(y)\n\n(7)\n\nwhere the subtraction and division are elementwise.\nThe second key observation, is that this inverse autoregressive operation has a simple Jacobian\ndeterminant. Note that due to the autoregressive structure, @[\u00b5i, i]/@yj = [0, 0] for j i. As a\nresult, the transformation has a lower triangular Jacobian (@\u270fi/@yj = 0 for j > i), with a simple\ndiagonal: @\u270fi/@yi = i. The determinant of a lower triangular matrix equals the product of the\ndiagonal terms. As a result, the log-determinant of the Jacobian of the transformation is remarkably\nsimple and straightforward to compute:\n\nlog det\n\nd\u270f\n\ndy =\n\nDXi=1\n\n log i(y)\n\n(8)\n\nThe combination of model \ufb02exibility, parallelizability across dimensions, and simple log-determinant,\nmake this transformation interesting for use as a normalizing \ufb02ow over high-dimensional latent space.\n\n4\n\nInverse Autoregressive Flow (IAF)\n\nWe propose a new type normalizing \ufb02ow (eq. (5)), based on transformations that are equivalent to\nthe inverse autoregressive transformation of eq. (7) up to reparameterization. See algorithm 1 for\npseudo-code of an appproximate posterior with the proposed \ufb02ow. We let an initial encoder neural\nnetwork output \u00b50 and 0, in addition to an extra output h, which serves as an additional input to\neach subsequent step in the \ufb02ow. We draw a random sample \u270f \u21e0N (0, I), and initialize the chain\nwith:\n(9)\n\nz0 = \u00b50 + 0 \u270f\n\n4\n\n\fApproximate Posterior with Inverse Autoregressive Flow (IAF)\n\nx\n\nEncoder NN\n\n\u03c3\n\n\u00d7\n\n\u03b5\n\n\u03bc\n\n+\n\nh\n\nz\n\nIAF \nstep\n\nIAF \nstep\n\nz\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\nz\n\nIAF Step\n\nAutoregressive NN\n\n\u03c3\n\n\u00d7\n\n\u03bc\n\n+\n\nz\n\nh\n\nz\n\nFigure 2: Like other normalizing \ufb02ows, drawing samples from an approximate posterior with Inverse\nAutoregressive Flow (IAF) consists of an initial sample z drawn from a simple distribution, such as a\nGaussian with diagonal covariance, followed by a chain of nonlinear invertible transformations of z,\neach with a simple Jacobian determinants.\n\nApproximate Posterior with Inverse Autoregressive Flow (IAF)\n\nEncoder NN\n\nAutoregressive NN\n\nAutoregressive NN\n\nx\n\n\u03b5\n\n\u03c3\n\n\u00d7\n\n\u03bc\n\n+\n\nh\n\nz\n\n\u03c3\n\n\u03bc\n\n\u03c3\n\n\u03bc\n\nThe \ufb02ow consists of a chain of T of the following transformations:\n\nzT\n\nFlow\n\nz0\n\nGenerative \n\nNormalizing \n\nzt = \u00b5t + t zt1\n\n+\n(10)\nwhere at the t-th step of the \ufb02ow, we use a different autoregressive neural network with inputs zt1\nand h, and outputs \u00b5t and t. The neural network is structured to be autoregressive w.r.t. zt1, such\nthat for any choice of its parameters, the Jacobians d\u00b5t\nare triangular with zeros on the\ndzt1\ndiagonal. As a result, dzt\ni=1 t,i. (Note\ndzt1\nthat the Jacobian w.r.t. h does not have constraints.) Following eq. (5), the density under the \ufb01nal\niterate is:\n\nis triangular with t on the diagonal, with determinantQD\n\nand dt\ndzt1\n\nPosterior with Inverse Autoregressive Flow (IAF)\n\nInference \nmodel\n\nInitial Network\n\nIAF Step\n\nIAF step\n\nz\n\nx\n\nmodel\n\n\u00d7\n\n+\n\nx\n\nz\n\n\u00d7\n\n\u00b7\u00b7\u00b7\n\nz\n\nIAF Step\n\nContext \n\n(e.g. Encoder NN)\n\nAutoregressive NN\n\nlog q(zT|x) = \n\nEncoder\u2028\nx\nNN\ni + 1\n2 log(2\u21e1) +\n\n2 \u270f2\n\nDXi=1 1\n\nz0\n\nIAF \nstep\n\nlog t,i!\n\nTXt=0\n\nz1\n\nIAF \nstep\n\n\u00b7\u00b7\u00b7\n\nzT\n\n(11)\n\n\u03bct\n\n-\n\nzt\n\nThe \ufb02exibility of the distribution of the \ufb01nal iterate zT , and its ability to closely \ufb01t to the true posterior,\nincreases with the expressivity of the autoregressive models and the depth of the chain. See \ufb01gure 2\nfor an illustration.\nA numerically stable version, inspired by the LSTM-type update, is where we let the autoregressive\nnetwork output [mt, st], two unconstrained real-valued vectors:\n\nand compute zt as:\n\n[mt, st] AutoregressiveNN[t](zt, h; \u2713)\n\n(12)\n\nt = sigmoid(st)\nzt = t zt1 + (1 t) mt\n\n(13)\n(14)\nThis version is shown in algorithm 1. Note that this is just a particular version of the update of\neq. (10), so the simple computation of the \ufb01nal log-density of eq. (11) still applies.\nWe found it bene\ufb01cial for results to parameterize or initialize the parameters of each\nAutoregressiveNN[t] such that its outputs st are, before optimization, suf\ufb01ciently positive, such as\nclose to +1 or +2. This leads to an initial behaviour that updates z only slightly with each step of IAF.\nSuch a parameterization is known as a \u2019forget gate bias\u2019 in LSTMs, as investigated by Jozefowicz\net al. (2015).\nPerhaps the simplest special version of IAF is one with a simple step, and a linear autoregressive\nmodel. This transforms a Gaussian variable with diagonal covariance, to one with linear dependencies,\ni.e. a Gaussian distribution with full covariance. See appendix A for an explanation.\nAutoregressive neural networks form a rich family of nonlinear transformations for IAF. For non-\nconvolutional models, we use the family of masked autoregressive networks introduced in (Germain\net al., 2015) for the autoregressive neural networks. For CIFAR-10 experiments, which bene\ufb01ts more\nfrom scaling to high dimensional latent space, we use the family of convolutional autoregressive\nautoencoders introduced by (van den Oord et al., 2016b,c).\nWe found that results improved when reversing the ordering of the variables after each step in the IAF\nchain. This is a volume-preserving transformation, so the simple form of eq. (11) remains unchanged.\n\n5\n\n\f5 Related work\n\nInverse autoregressive \ufb02ow (IAF) is a member of the family of normalizing \ufb02ows, \ufb01rst discussed\nin (Rezende and Mohamed, 2015) in the context of stochastic variational inference. In (Rezende and\nMohamed, 2015) two speci\ufb01c types of \ufb02ows are introduced: planar \ufb02ows and radial \ufb02ows. These\n\ufb02ows are shown to be effective to problems relatively low-dimensional latent space (at most a few\nhundred dimensions). It is not clear, however, how to scale such \ufb02ows to much higher-dimensional\nlatent spaces, such as latent spaces of generative models of /larger images, and how planar and radial\n\ufb02ows can leverage the topology of latent space, as is possible with IAF. Volume-conserving neural\narchitectures were \ufb01rst presented in in (Deco and Brauer, 1995), as a form of nonlinear independent\ncomponent analysis.\nAnother type of normalizing \ufb02ow, introduced by (Dinh et al., 2014) (NICE), uses similar transforma-\ntions as IAF. In contrast with IAF, this type of transformations updates only half of the latent variables\nz1:D/2 per step, adding a vector f (zD/2+1:D) which is a neural network based function of the\nremaining latent variables zD/2+1:D. Such large blocks have the advantage of computationally cheap\ninverse transformation, and the disadvantage of typically requiring longer chains. In experiments,\n(Rezende and Mohamed, 2015) found that this type of transformation is generally less powerful than\nother types of normalizing \ufb02ow, in experiments with a low-dimensional latent space. Concurrently\nto our work, NICE was extended to high-dimensional spaces in (Dinh et al., 2016) (Real NVP). An\nempirical comparison would be an interesting subject of future research.\nA potentially powerful transformation is the Hamiltonian \ufb02ow used in Hamiltonian Variational\nInference (Salimans et al., 2014). Here, a transformation is generated by simulating the \ufb02ow\nof a Hamiltonian system consisting of the latent variables z, and a set of auxiliary momentum\nvariables. This type of transformation has the additional bene\ufb01t that it is guided by the exact\nposterior distribution, and that it leaves this distribution invariant for small step sizes. Such as\ntransformation could thus take us arbitrarily close to the exact posterior distribution if we can apply\nit for a suf\ufb01cient number of times. In practice, however, Hamiltonian Variational Inference is very\ndemanding computationally. Also, it requires an auxiliary variational bound to account for the\nauxiliary variables, which can impede progress if the bound is not suf\ufb01ciently tight.\nAn alternative method for increasing the \ufb02exiblity of the variational inference, is the introduction\nof auxiliary latent variables (Salimans et al., 2014; Ranganath et al., 2015; Tran et al., 2015) and\ncorresponding auxiliary inference models. Latent variable models with multiple layers of stochastic\nvariables, such as the one used in our experiments, are often equivalent to such auxiliary-variable\nmethods. We combine deep latent variable models with IAF in our experiments, bene\ufb01ting from both\ntechniques.\n\n6 Experiments\n\nWe empirically evaluate IAF by applying the idea to improve variational autoencoders. Please see\nappendix C for details on the architectures of the generative model and inference models. Code for\nreproducing key empirical results is available online3.\n\n6.1 MNIST\nIn this expermiment we follow a similar implementation of the convolutional VAE as in (Salimans\net al., 2014) with ResNet (He et al., 2015) blocks. A single layer of Gaussian stochastic units\nof dimension 32 is used. To investigate how the expressiveness of approximate posterior affects\nperformance, we report results of different IAF posteriors with varying degrees of expressiveness.\nWe use a 2-layer MADE (Germain et al., 2015) to implement one IAF transformation, and we stack\nmultiple IAF transformations with ordering reversed between every other transformation.\nResults: Table 1 shows results on MNIST for these types of posteriors. Results indicate that as\napproximate posterior becomes more expressive, generative modeling performance becomes better.\nAlso worth noting is that an expressive approximate posterior also tightens variational lower bounds\nas expected, making the gap between variational lower bounds and marginal likelihoods smaller. By\nmaking IAF deep and wide enough, we can achieve best published log-likelihood on dynamically\n\n3https://github.com/openai/iaf\n\n6\n\n\fTable 1: Generative modeling results on the dynamically sampled binarized MNIST version used\nin previous publications (Burda et al., 2015). Shown are averages; the number between brackets\nare standard deviations across 5 optimization runs. The right column shows an importance sampled\nestimate of the marginal likelihood for each model with 128 samples. Best previous results are repro-\nduced in the \ufb01rst segment: [1]: (Salimans et al., 2014) [2]: (Burda et al., 2015) [3]: (Kaae S\u00f8nderby\net al., 2016) [4]: (Tran et al., 2015)\n\nModel\n\nConvolutional VAE + HVI [1]\nDLGM 2hl + IWAE [2]\nLVAE [3]\nDRAW + VGP [4]\n\nVLB\n\n-83.49\n\n-79.88\n\nlog p(x) \u21e1\n-81.94\n-82.90\n-81.74\n\nDiagonal covariance\nIAF (Depth = 2, Width = 320)\nIAF (Depth = 2, Width = 1920)\nIAF (Depth = 4, Width = 1920)\nIAF (Depth = 8, Width = 1920)\n\n-84.08 (\u00b1 0.10)\n-82.02 (\u00b1 0.08)\n-81.17 (\u00b1 0.08)\n-80.93 (\u00b1 0.09)\n-80.80 (\u00b1 0.07)\n\n-81.08 (\u00b1 0.08)\n-79.77 (\u00b1 0.06)\n-79.30 (\u00b1 0.08)\n-79.17 (\u00b1 0.08)\n-79.10 (\u00b1 0.07)\n\n\u2026\n\n\u2026\n\n\u2026\n\n\u2026\n\n\u2026\n\nz3\n\nz2\n\nz1\n\n+\n\nz3\n\nz2\n\nz1\n\n=\n\nz3\n\nz2\n\nz1\n\nx\n\nx\n\nDeep \n\ngenerative model\n\nBidirectional\u2028\ninference model\n\nx\n\nx\n\nVAE with \n\nbidirectional inference\n\n+\n\nBottom-Up \nResNet Block\n\nTop-Down \nResNet Block\n\nELU\n\nELU\n\nLayer Prior:\u2028\nz ~ p(zi|z>i)\n\n+\n\nLayer Posterior:\u2028\nz ~ q(zi|z>i,x)\n\nELU\n\nELU\n\n+\n\n= Identity\nIdentity\n\n= Convolution\n\nELU\n\n= Nonlinearity\n\nFigure 3: Overview of our ResNet VAE with bidirectional inference. The posterior of each layer is\nparameterized by its own IAF.\n\nbinarized MNIST: -79.10. On Hugo Larochelle\u2019s statically binarized MNIST, our VAE with deep\nIAF achieves a log-likelihood of -79.88, which is slightly worse than the best reported result, -79.2,\nusing the PixelCNN (van den Oord et al., 2016b).\n\n6.2 CIFAR-10\n\nWe also evaluated IAF on the CIFAR-10 dataset of natural images. Natural images contain a much\ngreater variety of patterns and structure than MNIST images; in order to capture this structure well,\nwe experiment with a novel architecture, ResNet VAE, with many layers of stochastic variables, and\nbased on residual convolutional networks (ResNets) (He et al., 2015, 2016). Please see our appendix\nfor details.\n\nLog-likelihood. See table 2 for a comparison to previously reported results. Our architecture with\nIAF achieves 3.11 bits per dimension, which is better than other published latent-variable models,\nand almost on par with the best reported result using the PixelCNN. See the appendix for more\nexperimental results. We suspect that the results can be further improved with more steps of \ufb02ow,\nwhich we leave to future work.\n\nSynthesis speed. Sampling took about 0.05 seconds/image with the ResNet VAE model, versus\n52.0 seconds/image with the PixelCNN model, on a NVIDIA Titan X GPU. We sampled from the\nPixelCNN na\u00efvely by sequentially generating a pixel at a time, using the full generative model at each\niteration. With custom code that only evaluates the relevant part of the network, PixelCNN sampling\ncould be sped up signi\ufb01cantly; however the speedup will be limited on parallel hardware due to the\n\n7\n\n\fTable 2: Our results with ResNet VAEs on CIFAR-10 images, compared to earlier results, in average\nnumber of bits per data dimension on the test set. The number for convolutional DRAW is an upper\nbound, while the ResNet VAE log-likelihood was estimated using importance sampling.\nbits/dim \uf8ff\n\nMethod\n\nResults with tractable likelihood models:\nUniform distribution (van den Oord et al., 2016b)\nMultivariate Gaussian (van den Oord et al., 2016b)\nNICE (Dinh et al., 2014)\nDeep GMMs (van den Oord and Schrauwen, 2014)\nReal NVP (Dinh et al., 2016)\nPixelRNN (van den Oord et al., 2016b)\nGated PixelCNN (van den Oord et al., 2016c)\n\nResults with variationally trained latent-variable models:\nDeep Diffusion (Sohl-Dickstein et al., 2015)\nConvolutional DRAW (Gregor et al., 2016)\nResNet VAE with IAF (Ours)\n\n8.00\n4.70\n4.48\n4.00\n3.49\n3.00\n3.03\n\n5.40\n3.58\n3.11\n\nsequential nature of the sampling operation. Ef\ufb01cient sampling from the ResNet VAE is a parallel\ncomputation that does not require custom code.\n\n7 Conclusion\n\nWe presented inverse autoregressive \ufb02ow (IAF), a new type of normalizing \ufb02ow that scales well to\nhigh-dimensional latent space. In experiments we demonstrated that autoregressive \ufb02ow leads to\nsigni\ufb01cant performance gains compared to similar models with factorized Gaussian approximate\nposteriors, and we report close to state-of-the-art log-likelihood results on CIFAR-10, for a model\nthat allows much faster sampling.\n\nAcknowledgements\n\nWe thank Jascha Sohl-Dickstein, Karol Gregor, and many others at Google Deepmind for interesting\ndiscussions. We thank Harri Valpola for referring us to Gustavo Deco\u2019s relevant pioneering work on\na form of inverse autoregressive \ufb02ow applied to nonlinear independent component analysis.\n\nReferences\nBlei, D. M., Jordan, M. I., and Paisley, J. W. (2012). Variational Bayesian inference with Stochastic Search. In\n\nProceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1367\u20131374.\n\nBowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2015). Generating sentences\n\nfrom a continuous space. arXiv preprint arXiv:1511.06349.\n\nBurda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint\n\narXiv:1509.00519.\n\nClevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learning by Exponential\n\nLinear Units (ELUs). arXiv preprint arXiv:1511.07289.\n\nDeco, G. and Brauer, W. (1995). Higher order statistical decorrelation without information loss. Advances in\n\nNeural Information Processing Systems, pages 247\u2013254.\n\nDinh, L., Krueger, D., and Bengio, Y. (2014). Nice: non-linear independent components estimation. arXiv\n\npreprint arXiv:1410.8516.\n\nDinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density estimation using Real NVP. arXiv preprint\n\narXiv:1605.08803.\n\n8\n\n\fGermain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: Masked autoencoder for distribution\n\nestimation. arXiv preprint arXiv:1502.03509.\n\nGregor, K., Besse, F., Rezende, D. J., Danihelka, I., and Wierstra, D. (2016). Towards conceptual compression.\n\narXiv preprint arXiv:1604.08772.\n\nGregor, K., Mnih, A., and Wierstra, D. (2013). Deep AutoRegressive Networks. arXiv preprint arXiv:1310.8499.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint\n\narXiv:1512.03385.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. arXiv preprint\n\narXiv:1603.05027.\n\nHoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of\n\nMachine Learning Research, 14(1):1303\u20131347.\n\nJozefowicz, R., Zaremba, W., and Sutskever, I. (2015). An empirical exploration of recurrent network archi-\ntectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages\n2342\u20132350.\n\nKaae S\u00f8nderby, C., Raiko, T., Maal\u00f8e, L., Kaae S\u00f8nderby, S., and Winther, O. (2016). How to train deep\n\nvariational autoencoders and probabilistic ladder networks. arXiv preprint arXiv:1602.02282.\n\nKingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. Proceedings of the 2nd International\n\nConference on Learning Representations.\n\nRanganath, R., Tran, D., and Blei, D. M. (2015). Hierarchical variational models.\n\narXiv:1511.02386.\n\narXiv preprint\n\nRezende, D. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. In Proceedings of The\n\n32nd International Conference on Machine Learning, pages 1530\u20131538.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference\nIn Proceedings of the 31st International Conference on Machine Learning\n\nin deep generative models.\n(ICML-14), pages 1278\u20131286.\n\nSalimans, T. (2016). A structured variational auto-encoder for learning deep hierarchies of sparse features. arXiv\n\npreprint arXiv:1602.08734.\n\nSalimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training\n\nof deep neural networks. arXiv preprint arXiv:1602.07868.\n\nSalimans, T., Kingma, D. P., and Welling, M. (2014). Markov chain Monte Carlo and variational inference:\n\nBridging the gap. arXiv preprint arXiv:1410.6460.\n\nSohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning\n\nusing nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585.\n\nTran, D., Ranganath, R., and Blei, D. M. (2015). Variational gaussian process. arXiv preprint arXiv:1511.06499.\n\nvan den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,\nand Kavukcuoglu, K. (2016a). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.\n\nvan den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016b). Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759.\n\nvan den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. (2016c).\n\nConditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328.\n\nvan den Oord, A. and Schrauwen, B. (2014). Factoring variations in natural images with deep gaussian mixture\n\nmodels. In Advances in Neural Information Processing Systems, pages 3518\u20133526.\n\nZagoruyko, S. and Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.\n\nZeiler, M. D., Krishnan, D., Taylor, G. W., and Fergus, R. (2010). Deconvolutional networks. In Computer\n\nVision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2528\u20132535. IEEE.\n\n9\n\n\f", "award": [], "sourceid": 2411, "authors": [{"given_name": "Durk", "family_name": "Kingma", "institution": "OpenAI"}, {"given_name": "Tim", "family_name": "Salimans", "institution": "Algoritmica"}, {"given_name": "Rafal", "family_name": "Jozefowicz", "institution": "OpenAI"}, {"given_name": "Xi", "family_name": "Chen", "institution": "UC Berkeley and OpenAI"}, {"given_name": "Ilya", "family_name": "Sutskever", "institution": "Google"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam"}]}