{"title": "Attend, Infer, Repeat: Fast Scene Understanding with Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3225, "page_last": 3233, "abstract": "We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps. We use this scheme to learn to perform inference in partially specified 2D models (variable-sized variational auto-encoders) and fully specified 3D models (probabilistic renderers). We show that such models learn to identify multiple objects - counting, locating and classifying the elements of a scene - without any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network at unprecedented speed. We further show that the networks produce accurate inferences when compared to supervised counterparts, and that their structure leads to improved generalization.", "full_text": "Attend, Infer, Repeat:\n\nFast Scene Understanding with Generative Models\n\nS. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa,\n\nDavid Szepesvari, Koray Kavukcuoglu, Geoffrey E. Hinton\n\n{aeslami,heess,theophane,tassa,dsz,korayk,geoffhinton}@google.com\n\nGoogle DeepMind, London, UK\n\nAbstract\n\nWe present a framework for ef\ufb01cient inference in structured image models that ex-\nplicitly reason about objects. We achieve this by performing probabilistic inference\nusing a recurrent neural network that attends to scene elements and processes them\none at a time. Crucially, the model itself learns to choose the appropriate number\nof inference steps. We use this scheme to learn to perform inference in partially\nspeci\ufb01ed 2D models (variable-sized variational auto-encoders) and fully speci\ufb01ed\n3D models (probabilistic renderers). We show that such models learn to identify\nmultiple objects \u2013 counting, locating and classifying the elements of a scene \u2013\nwithout any supervision, e.g., decomposing 3D images with various numbers of\nobjects in a single forward pass of a neural network at unprecedented speed. We\nfurther show that the networks produce accurate inferences when compared to\nsupervised counterparts, and that their structure leads to improved generalization.\n\n1\n\nIntroduction\n\nThe human percept of a visual scene is highly structured. Scenes naturally decompose into objects\nthat are arranged in space, have visual and physical properties, and are in functional relationships with\neach other. Arti\ufb01cial systems that interpret images in this way are desirable, as accurate detection of\nobjects and inference of their attributes is thought to be fundamental for many problems of interest.\nConsider a robot whose task is to clear a table after dinner. To plan its actions it will need to determine\nwhich objects are present, what classes they belong to and where each one is located on the table.\nThe notion of using structured models for image understanding has a long history (e.g., \u2018vision\nas inverse graphics\u2019 [4]), however in practice it has been dif\ufb01cult to de\ufb01ne models that are: (a)\nexpressive enough to capture the complexity of natural scenes, and (b) amenable to tractable inference.\nMeanwhile, advances in deep learning have shown how neural networks can be used to make\nsophisticated predictions from images using little interpretable structure (e.g., [10]). Here we explore\nthe intersection of structured probabilistic models and deep networks. Prior work on deep generative\nmethods (e.g., VAEs [9]) have been mostly unstructured, therefore despite producing impressive\nsamples and likelihood scores their representations have lacked interpretable meaning. On the other\nhand, structured generative methods have largely been incompatible with deep learning, and therefore\ninference has been hard and slow (e.g., via MCMC).\nOur proposed framework achieves scene interpretation via learned, amortized inference, and it\nimposes structure on its representation through appropriate partly- or fully-speci\ufb01ed generative\nmodels, rather than supervision from labels. It is important to stress that by training generative\nmodels, the aim is not primarily to obtain good reconstructions, but to produce good representations,\nin other words to understand scenes. We show experimentally that by incorporating the right kinds of\nstructures, our models produce representations that are more useful for downstream tasks than those\nproduced by VAEs or state-of-the-art generative models such as DRAW [3].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe proposed framework crucially allows for reasoning about the complexity of a given scene\n(the dimensionality of its latent space). We demonstrate that via an Occam\u2019s razor type effect,\nthis makes it possible to discover the underlying causes of a dataset of images in an unsupervised\nmanner. For instance, the model structure will enforce that a scene is formed by a variable number\nof entities that appear in different locations, but the process of learning will identify what these\nscene elements look like and where they appear in any given image. The framework also combines\nhigh-dimensional distributed representations with directly interpretable latent variables (e.g., af\ufb01ne\npose). This combination makes it easier to avoid the pitfalls of models that are too unconstrained\n(leading to data-hungry learning) or too rigid (leading to failure via mis-speci\ufb01cation).\nThe main contributions of the paper are as follows. First, in Sec. 2 we formalize a scheme for\nef\ufb01cient variational inference in latent spaces of variable dimensionality. The key idea is to treat\ninference as an iterative process, implemented as a recurrent neural network that attends to one object\nat a time, and learns to use an appropriate number of inference steps for each image. We call the\nproposed framework Attend-Infer-Repeat (AIR). End-to-end learning is enabled by recent advances\nin amortized variational inference, e.g., combining gradient based optimization for continuous latent\nvariables with black-box optimization for discrete ones. Second, in Sec. 3 we show that AIR allows\nfor learning of generative models that decompose multi-object scenes into their underlying causes,\ne.g., the constituent objects, in an unsupervised manner. We demonstrate these capabilities on MNIST\ndigits (Sec. 3.1), overlapping sprites and Omniglot glyphs (appendices H and G). We show that\nmodel structure can provide an important inductive bias that is not easily learned otherwise, leading\nto improved generalization. Finally, in Sec. 3.2 we demonstrate how our inference framework can\nbe used to perform inference for a 3D rendering engine with unprecedented speed, recovering the\ncounts, identities and 3D poses of complex objects in scenes with signi\ufb01cant occlusion in a single\nforward pass of a neural network, providing a scalable approach to \u2018vision as inverse graphics\u2019.\n\n2 Approach\n\n\u2713(z)/p(x). In this view, the prior pz\n\nIn this paper we take a Bayesian perspective of scene interpretation, namely that of treating this task\nas inference in a generative model. Thus given an image x and a model px\n\u2713(z) parameterized\nby \u2713 we wish to recover the underlying scene description z by computing the posterior p(z|x) =\n\u2713(z) captures our assumptions about the underlying\npx\n\u2713 (x|z)pz\nscene, and the likelihood px\n\u2713 (x|z) is our model of how a scene description is rendered to form an\nimage. Both can take various forms depending on the problem at hand and we will describe particular\ninstances in Sec. 3. Together, they de\ufb01ne the language that we use to describe a scene.\nMany real-world scenes naturally decompose into objects. We therefore make the modeling assump-\ntion that the scene description is structured into groups of variables zi, where each group describes\nthe attributes of one of the objects in the scene, e.g., its type, appearance, and pose. Since the number\nof objects will vary from scene to scene, we assume models of the following form:\n\n\u2713 (x|z)pz\n\np\u2713(x) =\n\npN (n)Z pz\n\nNXn=1\n\n\u2713(z|n)px\n\n\u2713 (x|z)dz.\n\n(1)\n\n\u2713 (\u00b7|z). Since the indexing of objects is arbitrary, pz\n\n\u2713 (x|\u00b7) is permutation invariant, and therefore the posterior over z is exchangeable.\n\nThis can be interpreted as follows. We \ufb01rst sample the number of objects n from a suitable prior\n(for instance a Binomial distribution) with maximum value N. The latent, variable length, scene\ndescriptor z = (z1, z2, . . . , zn) is then sampled from a scene model z \u21e0 pz\n\u2713(\u00b7|n). Finally, we render\n\u2713(\u00b7) is exchangeable\nthe image according to x \u21e0 px\nand px\nThe prior and likelihood terms can take different forms. We consider two scenarios: For 2D scenes\n(Sec. 3.1), each object is characterized in terms of a learned distributed continuous representation for\nits shape, and a continuous 3-dimensional variable for its pose (position and scale). For 3D scenes\n(Sec. 3.2), objects are de\ufb01ned in terms of a categorical variable that characterizes their identity, e.g.,\nsphere, cube or cylinder, as well as their positions and rotations. We refer to the two kinds of variables\nfor each object i in both scenarios as zi\nwhere respectively, bearing in mind that their meaning\n(e.g., position and scale in pixel space vs. position and orientation in 3D space) and their data type\n(continuous vs. discrete) will vary. We further assume that zi are independent under the prior, i.e.,\n\u2713(zi), but non-independent priors, such as a distribution over hierarchical scene\npz\ngraphs (e.g., [28]), can also be accommodated. Furthermore, while the number of objects is bounded\nas per Eq. 1, it is relatively straightforward to relax this assumption.\n\n\u2713(z|n) =Qn\n\nwhat and zi\n\ni=1 pz\n\n2\n\n\fFigure 1: Left: A single random variable z produces the observation x (the image). The relationship\nbetween z and x is speci\ufb01ed by a model. Inference is the task of computing likely values of z given x.\nUsing an auto-encoding architecture, the model (red arrow) and its inference network (black arrow)\ncan be trained end-to-end via gradient descent. Right: For most images of interest, multiple latent\nvariables (e.g., multiple objects) give rise to the image. We propose an iterative, variable-length\ninference network (black arrows) that attends to one object at a time, and train it jointly with its model.\nThe result is fast, feed-forward, interpretable scene understanding trained without supervision.\n\nInference\n\n2.1\nDespite their natural appeal, inference for most models in the form of Eq. 1 is intractable due\nto the dimensionality of the integral. We therefore employ an amortized variational approxima-\ntion to the true posterior by learning a distribution q(z, n|x) parameterized by that minimizes\nKL [q(z, n|x)||pz\n\u2713(z, n|x)]. While such approximations have recently been used successfully in\na variety of works [21, 9, 18] the speci\ufb01c form of our model poses two additional dif\ufb01culties.\nTrans-dimensionality: As a challenging departure from classical latent space models, the size of the\nlatent space n (i.e., the number of objects) is a random variable itself, which necessitates evaluating\npN (n|x) =R pz\n\u2713(z, n|x)dz, for all n = 1...N. Symmetry: There are strong symmetries that arise, for\ninstance, from alternative assignments of objects appearing in an image x to latent variables zi.\nWe address these challenges by formulating inference as an iterative process implemented as a\nrecurrent neural network, which infers the attributes of one object at a time. The network is run\nfor N steps and in each step explains one object in the scene, conditioned on the image and on its\nknowledge of previously explained objects (see Fig. 1).\nTo simplify sequential reasoning about the number of objects, we parameterize n as a variable length\nlatent vector zpres using a unary code: for a given value n, zpres is the vector formed of n ones followed\nby one zero. Note that the two representations are equivalent. The posterior takes the following form:\n\nq(z, zpres|x) = q(zn+1\n\npres = 0|z1:n, x)\n\nnYi=1\n\nq(zi, zi\n\npres = 1|x, z1:i1).\n\n(2)\n\nq is implemented as a neural network that, in each step, outputs the parameters of the sampling\ndistributions over the latent variables, e.g., the mean and standard deviation of a Gaussian distribution\nfor continuous variables. zpres can be understood as an interruption variable: at each time step, if\nthe network outputs zpres = 1, it describes at least one more object and proceeds, but if it outputs\nzpres = 0, no more objects are described, and inference terminates for that particular datapoint.\nNote that conditioning of zi|x, z1:i1 is critical to capture dependencies between the latent variables\nzi in the posterior, e.g., to avoid explaining the same object twice. The speci\ufb01cs of the networks that\nachieve this depend on the particularities of the models and we will describe them in detail in Sec. 3.\n\n2.2 Learning\nWe can jointly optimize the parameters \u2713 of the model and of the inference network by maximizing\nthe lower bound on the marginal likelihood of an image under the model: log p\u2713(x) L (\u2713, ) =\nEqhlog p\u2713(x,z,n)\nq(z,n,|x)i with respect \u2713 and . L is called the negative free energy. We provide an outline\nof how to construct an estimator of the gradient of this quantity below, for more details see [23].\nComputing a Monte Carlo estimate of @\n@\u2713L is relatively straightforward: given a sample from the\napproximate posterior (z, zpres) \u21e0 q(\u00b7|x) (i.e., when the latent variables have been \u2018\ufb01lled in\u2019) we\ncan readily compute @\n\n@\u2713 log p\u2713(x, z, n) provided p is differentiable in \u2713.\n\n3\n\nxz1z3xzDecoderxyzDecoderyh2h3z1z2z3xh1\fComputing a Monte Carlo estimate of @\n@L is more involved. As discussed above, the RNN that\nimplements q produces the parameters of the sampling distributions for the scene variables z\nand presence variables zpres. For a time step i, denote with !i all the parameters of the sampling\ndistributions of variables in (zi\npres, zi). We parameterize the dependence of this distribution on\nz1:i1 and x using a recurrent function R(\u00b7) implemented as a neural network such that (!i, hi) =\nR(x, hi1) with hidden variables h. The full gradient is obtained via chain rule: @L/@ =\nPi @L/@!i \u21e5 @!i/. Below we explain how to compute @L/@!i. We \ufb01rst rewrite our cost function\nas follows: L(\u2713, ) = Eq [`(\u2713, , z, n)] where `(\u2713, , z, n) is de\ufb01ned as log p\u2713(x,z,n)\nq(z,n,|x). Let zi be an\npres) of type {what, where, pres}. How to proceed depends on\narbitrary element of the vector (zi, zi\nwhether zi is continuous or discrete.\n\nContinuous: Suppose zi is a continuous variable. We use the path-wise estimator (also known as\nthe \u2018re-parameterization trick\u2019, e.g., [9, 23]), which allows us to \u2018back-propagate\u2019 through the random\nvariable zi. For many continuous variables (in fact, without loss of generality), zi can be sampled as\nh(\u21e0, ! i), where h is a deterministic transformation function, and \u21e0 a random variable from a \ufb01xed\nnoise distribution p(\u21e0) giving the gradient estimate: @L@!i \u21e1 @`(\u2713, , z, n)/@zi \u21e5 @h/@!i.\n\nDiscrete: For discrete scene variables (e.g., zi\nj by\nback-propagation.\nInstead we use the likelihood ratio estimator [18, 23]. Given a posterior\nsample (z, n) \u21e0 q(\u00b7|x) we can obtain a Monte Carlo estimate of the gradient: @L/@!i \u21e1\n@ log q(zi|!i)/@!i `(\u2713, , z, n). In the raw form presented here this gradient estimate is likely\nto have high variance. We reduce its variance using appropriately structured neural baselines [18]\nthat are functions of the image and the latent variables produced so far.\n\npres) we cannot compute the gradient @L/@!i\n\n3 Models and Experiments\n\nwhat from the prior zi\n\nWe \ufb01rst apply AIR to a dataset of multiple MNIST digits, and show that it can reliably learn to detect\nand generate the constituent digits from scratch (Sec. 3.1). We show that this provides advantages over\nstate-of-the-art generative models such as DRAW [3] in terms of computational effort, generalization\nto unseen datasets, and the usefulness of the inferred representations for downstream tasks. We also\napply AIR to a setting where a 3D renderer is speci\ufb01ed in advance. We show that AIR learns to use\nthe renderer to infer the counts, identities and poses of multiple objects in synthetic and real table-top\nscenes with unprecedented speed (Sec. 3.2 and appendix J).\nDetails of the AIR model and networks used in the 2D experiments are shown in Fig. 2. The\ngenerative model (Fig. 2, left) draws n \u21e0 Geom(\u21e2) digits {yi\natt}, scales and shifts them according to\nwhere \u21e0N (0, \u2303) using spatial transformers, and sums the results {yi} to form the image. Each digit\nzi\nwhat \u21e0N (0, 1) and propagating it\nis obtained by \ufb01rst sampling a latent code zi\nthrough a decoder network. The learnable parameters of the generative model are the parameters of\nthis decoder network. The AIR inference network (Fig. 2, middle) produces three sets of variables\nfor each entity at every time-step: a 1-dimensional Bernoulli variable indicating the entity\u2019s presence,\na C-dimensional distributed vector describing its class or appearance (zi\nwhat), and a 3-dimensional\nvector specifying the af\ufb01ne parameters of its position and scale (zi\nwhere). Fig. 2 (right) shows the\ninteraction between the inference and generation networks at every time-step. The inferred pose is\nused to attend to a part of the image (using a spatial transformer) to produce xi\natt, which is processed\ncode and the reconstruction of the contents of the attention window\nto produce the inferred code zi\natt. The same pose information is used by the generative model to transform yi\natt to obtain yi. This\nyi\ncontribution is only added to the canvas y if zi\nFor the dataset of MNIST digits, we also investigate the behavior of a variant, difference-AIR\n(DAIR), which employs a slightly different recurrent architecture for the inference network (see\nFig. 8 in appendix). As opposed to AIR which computes zi via hi and x, DAIR reconstructs at\nevery time step i a partial reconstruction xi of the data x, which is set as the mean of the distribution\n\u2713 (x|z1, z2, . . . , zi1). We create an error canvas xi = xi x, and the DAIR inference equation\npx\nR is then speci\ufb01ed as (!i, hi) = R(xi, hi1).\n\npres was inferred to be true.\n\n4\n\n\fFigure 2: AIR in practice: Left: The assumed generative model. Middle: AIR inference for this\nmodel. The contents of the grey box are input to the decoder. Right: Interaction between the inference\nand generation networks at every time-step. In our experiments the relationship between xi\natt and yi\natt\nis modeled by a VAE, however any generative model of patches could be used (even, e.g., DRAW).\n\na\nt\na\nD\n\nk\n1\n\nk\n0\n1\n\nk\n0\n0\n2\n\nFigure 3: Multi-MNIST learning: Left above: Images from the dataset. Left below: Reconstructions\nat different stages of training along with a visualization of the model\u2019s attention windows. The 1st,\n2nd and 3rd time-steps are displayed using red, green and blue borders respectively. A video of this\nsequence is provided in the supplementary material. Above right: Count accuracy over time. The\nmodel detects the counts of digits accurately, despite having never been provided supervision. Chance\naccuracy is 25%. Below right: The learned scanning policy for 3 different runs of training (only\ndiffering in the random seed). We visualize empirical heatmaps of the attention windows\u2019 positions\n(red, and green for the \ufb01rst and second time-steps respectively). As expected, the policy is random.\nThis suggests that the policy is spatial, as opposed to identity- or size-based.\n\n3.1 Multi-MNIST\nWe begin with a 50\u21e550 dataset of multi-MNIST digits. Each image contains zero, one or two\nnon-overlapping random MNIST digits with equal probability. The desired goal is to train a network\nthat produces sensible explanations for each of the images. We train AIR with N = 3 on 60,000 such\nimages from scratch, i.e., without a curriculum or any form of supervision by maximizing L with\nrespect to the parameters of the inference network and the generative model. Upon completion of\ntraining we inspect the model\u2019s inferences (see Fig. 3, left). We draw the reader\u2019s attention to the\nfollowing observations. First, the model identi\ufb01es the number of digits correctly, due to the opposing\npressures of (a) wanting to explain the scene, and (b) the cost that arises from instantiating an object\nunder the prior. This is indicated by the number of attention windows in each image; we also plot\nthe accuracy of count inference over the course of training (Fig. 3, above right). Second, it locates\nthe digits accurately. Third, the recurrent network learns a suitable scanning policy to ensure that\ndifferent time-steps account for different digits (Fig. 3, below right). Note that we did not have to\nspecify any such policy in advance, nor did we have to build in a constraint to prevent two time-steps\nfrom explaining the same part of the image. Finally, that the network learns to not use the second\ntime-step when the image contains only a single digit, and to never use the third time-step (images\ncontain a maximum of two digits). This allows for the inference network to stop upon encountering\nthe \ufb01rst zi\nA video showing real-time inference using AIR has been included in the supplementary material.\nWe also perform experiments on Omniglot ([13], appendix G) to demonstrate AIR\u2019s ability to parse\nglyphs into elements resembling \u2018strokes\u2019, as well as a dataset of sprites where the scene\u2019s elements\nappear under signi\ufb01cant overlap (appendix H). See appendices for details and results.\n\npres equaling 0, leading to potential savings in computation during inference.\n\n5\n\nxzwhaty1z1zwherez1zwhaty2z2zwherez2atty1atty2Decoderyh2h3zpresz2zpresz3zwhatz2zwhatz3zwherez2zwherez3xh1zpresz1zwhatz1zwherez1xyzpreszwhatxattyatthizwhere...VAEyiiiiii...\fa\nt\na\nD\n\nI\n\nR\nA\nD\n\na\nt\na\nD\n\nW\nA\nR\nD\n\nFigure 4: Strong generalization: Left: Reconstructions of images with 3 digits made by DAIR\ntrained on 0, 1 or 2 digits, as well as a comparison with DRAW. Right: Variational lower bound, and\ngeneralizing / interpolating count accuracy. DAIR out-performs both DRAW and AIR at this task.\n\n3.1.1 Strong Generalization\nSince the model learns the concept of a digit independently of the positions or numbers of times it\nappears in each image, one would hope that it would be able to generalize, e.g., by demonstrating an\nunderstanding of scenes that have structural differences to training scenes. We probe this behavior\nwith the following scenarios: (a) Extrapolation: training on images each containing 0, 1 or 2 digits\nand then testing on images containing 3 digits, and (b) Interpolation: training on images containing\n0, 1 or 3 digits and testing on images containing 2 digits. The result of this experiment is shown in\nFig. 4. An AIR model trained on up to 2 digits is effectively unable to infer the correct count when\npresented with an image of 3 digits. We believe this to be caused by the LSTM which learns during\ntraining never to expect more than 2 digits. AIR\u2019s generalization performance is improved somewhat\nwhen considering the interpolation task. DAIR by contrast generalizes well in both tasks (and \ufb01nds\ninterpolation to be slightly easier than extrapolation). A closely related baseline is the Deep Recurrent\nAttentive Writer (DRAW, [3]), which like AIR, generates data sequentially. However, DRAW has a\n\ufb01xed and large number of steps (40 in our experiments). As a consequence generative steps do not\ncorrespond to easily interpretable entities, complex scenes are drawn faster and simpler ones slower.\nWe show DRAW\u2019s reconstructions in Fig. 4. Interestingly, DRAW learns to ignore precisely one digit\nin the image. See appendix for further details of these experiments.\n\n3.1.2 Representational Power\nA second motivation for the use of structured\nmodels is that their inferences about a scene\nprovides useful representations for downstream\ntasks. We examine this ability by \ufb01rst train-\ning an AIR model on 0, 1 or 2 digits and then\nproduce inferences for a separate collection of\nimages that contains precisely 2 digits. We split\nthis data into training and test and consider two\ntasks: (a) predicting the sum of the two digits\n(as was done in [1]), and (b) determining if the\ndigits appear in an ascending order. We compare\nwith a CNN trained from the raw pixels, as well\nas interpretations produced by a convolutional\nautoencoder (CAE) and DRAW (Fig. 5). We optimize each model\u2019s hyper-parameters (e.g. depth\nand size) for maximal performance. AIR achieves high accuracy even when data is scarce, indicating\nthe power of its disentangled, structured representation. See appendix for further details.\n\nFigure 5: Representational power: AIR achieves\nhigh accuracy using only a fraction of the labeled\ndata. Left: summing two digits. Right: detecting\nif they appear in increasing order. Despite produc-\ning comparable reconstructions, CAE and DRAW\ninferences are less interpretable than AIR\u2019s and\ntherefore lead to poorer downstream performance.\n\n3.2\n\n3D Scenes\n\nThe experiments above demonstrate learning of inference and generative networks in models where\nwe impose structure in the form of a variable-sized representation and spatial attention mechanisms.\nWe now consider an additional way of imparting knowledge to the system: we specify the generative\nmodel via a 3D renderer, i.e., we completely specify how any scene representation is transformed to\nproduce the pixels in an image. Therefore the task is to learn to infer the counts, identities and poses\nof several objects, given different images containing these objects and an implementation of a 3D\nrenderer from which we can draw new samples. This formulation of computer vision is often called\n\u2018vision as inverse graphics\u2019 (see e.g., [4, 15, 7]).\n\n6\n\n\fa\nt\na\nD\n\n)\na\n(\n\nI\n\nR\nA\n\n)\nb\n(\n\n.\n\np\nu\nS\n)\nc\n(\n\n.\nt\np\nO\n\n)\nd\n(\n\na\nt\na\nD\n\n)\ne\n(\n\nI\n\nR\nA\n\n)\nf\n(\n\nl\na\ne\nR\n\n)\ng\n(\n\nI\n\nR\nA\n\n)\nh\n(\n\nFigure 6: 3D objects: Left: The task is to infer the identity and pose of a single 3D object. (a) Images\nfrom the dataset. (b) Unsupervised AIR reconstructions. (c) Supervised reconstructions. Note poor\nperformance on cubes due to their symmetry. (d) Reconstructions after direct gradient descent. This\napproach is less stable and much more susceptible to local minima. Right: AIR can learn to recover\nthe counts, identities and poses of multiple objects in a 3D table-top scene. (e,g) Generated and real\nimages. (f,h) AIR produces fast and accurate inferences which we visualize using the renderer.\nThe primary challenge in this view of computer vision is that of inference. While it is relatively easy\nto specify high-quality models in the form of probabilistic renderers, posterior inference is either\nextremely expensive or prone to getting stuck in local minima (e.g., via optimization or MCMC). In\naddition, probabilistic renderers (and in particular renderers) typically are not capable of providing\ngradients with respect to their inputs, and 3D scene representations often involve discrete variables,\ne.g., mesh identities. We address these challenges by using \ufb01nite-differencing to obtain a gradient\nthrough the renderer, using the score function estimator to get gradients with respect to discrete\nvariables, and using AIR inference to handle correlated posteriors and variable-length representations.\nWe demonstrate the capabilities of this approach by \ufb01rst considering scenes consisting of only one\nof three objects: a red cube, a blue sphere, and a textured cylinder (see Fig. 6a). Since the scenes\nonly consist of single objects, the task is only to infer the identity (cube, sphere, cylinder) and pose\n(position and rotation) of the object present in the image. We train a single-step (N = 1) AIR\ninference network for this task. The network is only provided with unlabeled images and is trained to\nmaximize the likelihood of those images under the model speci\ufb01ed by the renderer. The quality of\nthe inferred scene representations produced is visually inspected in Fig. 6b. The network accurately\nand reliably infers the identity and pose of the object present in the scene. In contrast, an identical\nnetwork trained to predict the ground-truth identity and pose values of the training data (in a similar\nstyle to [11]) has much more dif\ufb01culty in accurately determining the cube\u2019s orientation (Fig. 6c).\nThe supervised loss forces the network to predict the exact angle of rotation. However this is not\nidenti\ufb01able from the image due to rotational symmetry, which leads to conditional probabilities that\nare multi-modal and dif\ufb01cult to represent using standard network architectures. We also compare\nwith direct optimization of the likelihood from scratch for every test image (Fig. 6d), and observe that\nthis method is slower, less stable and more susceptible to local minima. So not only does amortization\nreduce the cost of inference, but it also overcomes the pitfalls of independent gradient optimization.\nWe \ufb01nally consider a more complex setup, where we infer the counts, identities and positions of a\nvariable number of crockery items, as well as the camera position, in a table-top scene. This would\nbe of critical importance to a robot, say, which is tasked with clearing the table. The goal is to\nlearn to perform this task with as little supervision as possible, and indeed we observe that with\nAIR it is possible to do so with no supervision other than a speci\ufb01cation of the renderer. We show\nreconstructions of AIR\u2019s inferences on generated data, as well as real images of a table with varying\nnumbers of plates, in Fig. 6 and Fig. 7. AIR\u2019s inferences of counts, identities and positions are\naccurate for the most part. For transfer to real scenes we perform random color and size pertubations\nto rendered objects during training, however we note that robust transfer remains a challenging\nproblem in general. We provide a quantitative comparison of AIR\u2019s inference robustness and accuracy\non generated scenes with that of a fully supervised network in Fig. 7. We consider two scenarios: one\nwhere each object type only appears exactly once, and one where objects can repeat in the scene. A\nnaive supervised setup struggles with object repetitions or when an arbitrary ordering of the objects\nis imposed by the labels, however training is more straightforward when there are no repetitions. AIR\nachieves competitive reconstruction and counts despite the added dif\ufb01culty of object repetitions.\n\n7\n\n\fFigure 7: 3D scenes details: Left: Ground-truth object and camera positions with inferred positions\noverlayed in red (note that inferred cup is closely aligned with ground-truth, thus not clearly visible).\nWe demonstrate fast inference of all relevant scene elements using the AIR framework. Middle: AIR\nproduces signi\ufb01cantly better reconstructions and count accuracies than a supervised method on data\nthat contains repetitions, and is even competitive on simpler data. Right: Heatmap of object locations\nat each time-step (top). The learned policy appears to be more dependent on identity (bottom).\n\n4 Related Work\n\nDeep neural networks have had great success in learning to predict various quantities from images,\ne.g., object classes [10], camera positions [8] and actions [20]. These methods work best when\nlarge labeled datasets are available for training. At the other end of the spectrum, e.g., in \u2018vision\nas inverse graphics\u2019, only a generative model is speci\ufb01ed in advance and prediction is treated as an\ninference problem, which is then solved using MCMC or message passing at test-time. These models\nrange from highly speci\ufb01ed [17, 16], to partially speci\ufb01ed [28, 24, 25], to largely unspeci\ufb01ed [22].\nInference is very challenging and almost always the bottle-neck in model design.\nSeveral works exploit data-driven predictions to empower the \u2018vision as inverse graphics\u2019 paradigm\n[5, 7]. For instance, in PICTURE [11], the authors use a deep network to distill the results of slow\nMCMC, speeding up predictions at test-time. Variational auto-encoders [21, 9] and their discrete\ncounterparts [18] made the important contribution of showing how the gradient computations for\nlearning of amortized inference and generative models could be interleaved, allowing both to be\nlearned simultaneously in an end-to-end fashion (see also [23]). Works like that of [12] aim to learn\ndisentangled representations in an auto-encoding framework using special network structures and / or\ncareful training schemes. It is also worth noting that attention mechanisms in neural networks have\nbeen studied in discriminative and generative settings, e.g., [19, 6, 3].\nAIR draws upon, extends and links these ideas. By its nature AIR is also related to the following\nproblems: counting [14, 27], pondering [2], and gradient estimation through renderers [15]. It is the\ncombination of these elements that unlocks the full capabilities of the proposed approach.\n\n5 Discussion\n\nIn this paper our aim has been to learn unsupervised models that are good at scene understanding, in\naddition to scene reconstruction. We presented several principled models that learn to count, locate,\nclassify and reconstruct the elements of a scene, and do so in a fraction of a second at test-time. The\nmain ingredients are (a) building in meaning using appropriate structure, (b) amortized inference that\nis attentive, iterative and variable-length, and (c) end-to-end learning.\nWe demonstrated that model structure can provide an important inductive bias that gives rise to\ninterpretable representations that are not easily learned otherwise. We also showed that even for\nsophisticated models or renderers, fast inference is possible. We do not claim to have found an ideal\nmodel for all images; many challenges remain, e.g., the dif\ufb01culty of working with the reconstruction\nloss and that of designing models rich enough to capture all natural factors of variability.\nLearning in AIR is most successful when the variance of the gradients is low and the likelihood is well\nsuited to the data. It will be of interest to examine the scaling of variance with the number of objects\nand alternative likelihoods. It is straightforward to extend the framework to semi- or fully-supervised\nsettings. Furthermore, the framework admits a plug-and-play approach where existing state-of-the-art\ndetectors, classi\ufb01ers and renderers are used as sub-components of an AIR inference network. We\nplan to investigate these lines of research in future work.\n\n8\n\n\fReferences\n[1] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple Object Recognition with\n\nVisual Attention. In ICLR, 2015.\n\n[2] Alex Graves. Adaptive computation time for recurrent neural networks. abs/1603.08983, 2016.\n[3] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. DRAW: A\n\nRecurrent Neural Network For Image Generation. In ICML, 2015.\n\n[4] Ulf Grenander. Pattern Synthesis: Lectures in Pattern Theory. 1976.\n[5] Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and Randford M. Neal. The \"wake-sleep\"\n\nalgorithm for unsupervised neural networks. Science, 268(5214), 1995.\n\n[6] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial Trans-\n\nformer Networks. 2015.\n\n[7] Varun Jampani, Sebastian Nowozin, Matthew Loper, and Peter V. Gehler. The Informed\nSampler: A Discriminative Approach to Bayesian Inference in Generative Computer Vision\nModels. CVIU, 2015.\n\n[8] Alex Kendall, Matthew Grimes, and Roberto Cipolla. PoseNet: A Convolutional Network for\n\nReal-Time 6-DOF Camera Relocalization. In ICCV, 2015.\n\n[9] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classi\ufb01cation with Deep\n\nConvolutional Neural Networks. In NIPS 25, 2012.\n\n[11] Tejas D. Kulkarni, Pushmeet Kohli, Joshua B. Tenenbaum, and Vikash K. Mansinghka. Picture:\n\nA probabilistic programming language for scene perception. In CVPR, 2015.\n\n[12] Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep Convolu-\n\ntional Inverse Graphics Network. In NIPS 28. 2015.\n\n[13] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266), 2015.\n\n[14] Victor Lempitsky and Andrew Zisserman. Learning To Count Objects in Images. In NIPS 23.\n\n[15] Matthew M. Loper and Michael J. Black. OpenDR: An Approximate Differentiable Renderer.\n\n2010.\n\nIn ECCV, volume 8695, 2014.\n\n[16] Vikash Mansinghka, Tejas Kulkarni, Yura Perov, and Josh Tenenbaum. Approximate Bayesian\n\nImage Interpretation using Generative Probabilistic Graphics Programs. In NIPS 26. 2013.\n\n[17] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L. Ong, and Andrey\nKolobov. BLOG: Probabilistic Models with Unknown Objects. In International Joint Confer-\nence on Arti\ufb01cial Intelligence, pages 1352\u20131359, 2005.\n\n[18] Andriy Mnih and Karol Gregor. Neural Variational Inference and Learning. In ICML, 2014.\n[19] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent Models of\n\nVisual Attention. In NIPS 27, 2014.\n\n[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518, 2015.\n\n[21] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and\n\nApproximate Inference in Deep Generative Models. In ICML, 2014.\n\n[22] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, 2009.\n[23] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient Estimation\n\nUsing Stochastic Computation Graphs. In NIPS 28. 2015.\n\n[24] Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Tensor Analyzers. In ICML, 2013.\n[25] Yichuan Tang, Nitish Srivastava, and Ruslan Salakhutdinov. Learning Generative Models With\n\nVisual Attention. In NIPS 27, 2014.\n\ncontrol. In ICIRS, 2012.\n\n[26] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based\n\n[27] Jianming Zhang, Shuga Ma, Mehrnoosh Sameki, Stan Sclaroff, Margrit Betke, Zhe Lin, Xiaohui\n\nShen, Brian Price, and Radom\u00edr M\u02d8ech. Salient Object Subitizing. In CVPR, 2015.\n\n[28] Song-Chun Zhu and David Mumford. A Stochastic Grammar of Images. Foundations and\n\nTrends in Computer Graphics and Vision, 2(4), 2006.\n\n9\n\n\f", "award": [], "sourceid": 1612, "authors": [{"given_name": "S. M. Ali", "family_name": "Eslami", "institution": "Google DeepMind"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Google DeepMind"}, {"given_name": "Theophane", "family_name": "Weber", "institution": "Google DeepMind"}, {"given_name": "Yuval", "family_name": "Tassa", "institution": "Google DeepMind"}, {"given_name": "David", "family_name": "Szepesvari", "institution": "Google DeepMind"}, {"given_name": "koray", "family_name": "kavukcuoglu", "institution": "Google DeepMind"}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": "Google"}]}