{"title": "Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 99, "abstract": "We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods, which have tackled this problem in a deterministic or non-parametric way, we propose a novel approach which models future frames in a probabilistic manner. Our proposed method is therefore able to synthesize multiple possible next frames using the same model. Solving this challenging problem involves low- and high-level image and motion understanding for successful image synthesis. Here, we propose a novel network structure, namely a Cross Convolutional Network, that encodes images as feature maps and motion information as convolutional kernels to aid in synthesizing future frames. In experiments, our model performs well on both synthetic data, such as 2D shapes and animated game sprites, as well as on real-wold video data. We show that our model can also be applied to tasks such as visual analogy-making, and present analysis of the learned network representations.", "full_text": "Visual Dynamics: Probabilistic Future Frame\nSynthesis via Cross Convolutional Networks\n\nTianfan Xue*1\n\nJiajun Wu*1 Katherine L. Bouman1 William T. Freeman1,2\n\n{tfxue, jiajunwu, klbouman, billf}@mit.edu\n\n1 Massachusetts Institute of Technology\n\n2 Google Research\n\nAbstract\n\nWe study the problem of synthesizing a number of likely future frames from a\nsingle input image. In contrast to traditional methods, which have tackled this\nproblem in a deterministic or non-parametric way, we propose to model future\nframes in a probabilistic manner. Our probabilistic model makes it possible for us\nto sample and synthesize many possible future frames from a single input image.\nTo synthesize realistic movement of objects, we propose a novel network structure,\nnamely a Cross Convolutional Network; this network encodes image and motion\ninformation as feature maps and convolutional kernels, respectively. In experiments,\nour model performs well on synthetic data, such as 2D shapes and animated game\nsprites, as well as on real-world video frames. We also show that our model can be\napplied to visual analogy-making, and present an analysis of the learned network\nrepresentations.\n\nIntroduction\n\n1\nFrom just a single snapshot, humans are often able to imagine how a scene will visually change\nover time. For instance, due to the pose of the girl in Figure 1, most would predict that her arms\nare stationary but her leg is moving. However, the exact motion is often unpredictable due to an\nintrinsic ambiguity. Is the girl\u2019s leg moving up or down? In this work, we study the problem of\nvisual dynamics: modeling the conditional distribution of future frames given an observed image.\nWe propose to tackle this problem using a probabilistic, content-aware motion prediction model that\nlearns this distribution without using annotations. Sampling from this model allows us to visualize\nthe many possible ways that an input image is likely to change over time.\nModeling the conditional distribution of future frames given only a single image as input is a very\nchallenging task for a number of reasons. First, natural images come from a very high dimensional\ndistribution that is dif\ufb01cult to model. Designing a generative model for realistic images is a very\nchallenging problem. Second, in order to properly predict motion distributions, the model must \ufb01rst\nlearn about image parts and the correlation of their respective motions in an unsupervised fashion.\nIn this work, we tackle the visual dynamics problem using a neural network structure, based on a\nvariational autoencoder [Kingma and Welling, 2014] and our newly proposed cross convolutional layer.\nDuring training, the network observes a set of consecutive image pairs in videos, and automatically\ninfers the relationship between them without any supervision. During testing, the network then\npredicts the conditional distribution, P (J|I), of future RGB images J (Figure 1b) given an RGB\ninput image I that was not in the training set (Figure 1a). Using this distribution, the network is able\nto synthesize multiple different image samples corresponding to possible future frames of the input\nimage (Figure 1c). Our network contains a number of key components that contribute to its success:\n\u2022 We use a conditional variational autoencoder to model the complex conditional distribution\nof future frames [Kingma and Welling, 2014, Yan et al., 2016]. This allows us to approximate\na sample, J, from the distribution of future images by using a trainable function J = f (I, z).\n\n\u2217 indicates equal contributions.\n\n\fFigure 1: Predicting the movement of an object from a single snapshot is often ambiguous. For\ninstance, is the girl\u2019s leg in (a) moving up or down? We propose a probabilistic, content-aware motion\nprediction model (b) that learns the conditional distribution of future frames. Using this model we\nare able to synthesize various future frames (c) that are all consistent with the observed input (a).\n\nThe argument z is a sample from a simple distribution, e.g. Gaussian, which introduces\nrandomness into the sampling of J. This formulation makes the problem of learning the\ndistribution much more tractable than explicitly modeling the distribution.\n\u2022 We model motion using a set of image-dependent convolution kernels operating over an\nimage pyramid. Unlike normal convolutional layers, these kernels vary between images,\nas different images may have different motions. Our proposed cross convolutional layer\nconvolves image-dependent kernels with feature maps from an observed frame, to synthesize\na probable future frame.\n\nWe test the proposed model on two synthetic datasets as well as a dataset generated from real videos.\nWe show that, given an RGB input image, the algorithm can successfully model a distribution of\npossible future frames, and generate different samples that cover a variety of realistic motions. In\naddition, we demonstrate that our model can be easily applied to tasks such as visual analogy-making,\nand present an analysis of the learned network representations.\n2 Related Work\nMotion priors Research studying the human visual system and motion priors provides evidence\nfor low-level statistics of object motion. Pioneering work by Weiss and Adelson [1998] found that\nthe human visual system prefers slow and smooth motion \ufb01elds. More recent work by Roth and\nBlack [2005] analyzed the response of spatial \ufb01lters applied to optical \ufb02ow \ufb01elds. Fleet et al. [2000]\nalso found that a local motion \ufb01eld can be represented by a linear combination of a small number\nof bases. All these works focus on the distribution of a motion \ufb01eld itself without considering any\nimage information. On the contrary, our context-aware model captures the relationship between an\nobserved image and its motion \ufb01eld.\nMotion or future prediction Our problem is closely related to the motion or feature prediction\nproblem. Given an observed image or a short video sequence, models have been proposed to predict\na future motion \ufb01eld [Liu et al., 2011, Pintea et al., 2014, Xue et al., 2014, Walker et al., 2015, 2016],\na future trajectory of objects [Walker et al., 2014, Wu et al., 2015], or a future visual representa-\ntion [Vondrick et al., 2016b]. Most of these works use deterministic prediction models [Pintea et al.,\n2014, Vondrick et al., 2016b]. Recently, and concurrently with our own work, Walker et al. [2016]\nfound that there is an intrinsic ambiguity in deterministic prediction, and propose a probabilistic\nprediction framework. Our model is also a probabilistic prediction model, but it directly predicts the\npixel values, rather than motion \ufb01elds or image features.\nParametric image synthesis Early work in parametric image synthesis mostly focus on texture\nsynthesis using hand-crafted features [Portilla and Simoncelli, 2000]. More recently, works in\nimage synthesis have begun to produce impressive results by training variants of neural network\nstructures to produce novel images [Gregor et al., 2015, Xie et al., 2016a,b, Zhou et al., 2016].\nGenerative adversarial networks [Goodfellow et al., 2014, Denton et al., 2015, Radford et al., 2016]\nand variational autoencoders [Kingma and Welling, 2014, Yan et al., 2016] have been used to model\nand sample from natural image distributions. Our proposed algorithm is also based on the variational\nautoencoder, but unlike in this previous work, we also model temporal consistency.\nVideo synthesis Techniques that exploit the periodic structure of motion in videos have also been\nsuccessful at generating novel frames from an input sequence. Early work in video textures pro-\nposed to shuf\ufb02e frames from an existing video to generate a temporally consistent, looping image\nsequence [Sch\u00f6dl et al., 2000]. These ideas were later extended to generate cinemagraphies [Joshi\net al., 2012], seamlessly looping videos containing a variety of objects with different motion pat-\nterns [Agarwala et al., 2005, Liao et al., 2013], or video inpainting [Wexler et al., 2004]. While\n\n2\n\n(a) Input Image(c) Output Image Samples(b) Probabilistic ModelConditional Distributionof Future Frame\fFigure 2: A toy world example. See Section 3.2 for details.\n\nhigh-resolution and realistic looking videos are generated using these techniques, they are often\nlimited to periodic motion and require an input reference video. In contrast, we build an image\ngeneration model that does not require a reference video at test time.\nRecently, several network structures have been proposed to synthesize a new frame from observed\nframes. They infer the future motion either from multiple previous frames Srivastava et al. [2015],\nMathieu et al. [2016], user-supplied action labels Oh et al. [2015], Finn et al. [2016], or a random\nvector Vondrick et al. [2016a]. In contrast to these approaches, our network takes a single frame as\ninput and learns the distribution of future frames without any supervision.\n3 Formulation\n3.1 Problem De\ufb01nition\nIn this section, we describe how to sample future frames from a current observation image. Here we\nfocus on next frame synthesis; given an RGB image I observed at time t, our goal is to model the\nconditional distribution of possible frames observed at time t + 1.\nFormally, let {(I (1), J (1)), . . . , (I (n), J (n))} be the set of image pairs in the training set, where\nI (i) and J (i) are images observed at two consecutive time steps. Using this data, our task is to\nmodel the distribution p\u03b8(J|I) of all possible next frames J for a new, previously unseen test image\nI, and then to sample new images from this distribution. In practice, we choose not to directly\npredict the next frame, but instead to predict the difference image v = J \u2212 I, also known as the\nEulerian motion, between the observed frame I and the future frame J; these two problems are\nequivalent. The task is then to learn the conditional distribution p\u03b8(v|I) from a set of training pairs\n{(I (1), v(1)), . . . , (I (n), v(n))}.\n3.2 A Toy Example\nConsider a simple toy world that only consists of circles and squares. All circles move vertically,\nwhile all squares move horizontally, as shown in the Figure 2(a). Although in practice we choose v to\nbe the difference image between consecutive frames, for this toy example we show v as a 2D motion\n\ufb01eld for a more intuitive visualization. Consider the three models shown in Figure 2.\n(1) Deterministic motion prediction In this structure, the model tries to \ufb01nd a deterministic\nrelationship between the input image and object motion (Figure 2(b)). To do this, it attempts to \ufb01nd\ni ||v(i) \u2212 f (I (i))|| on a training set. Thus,\nit cannot capture the multiple possible motions that a shape can have, and the algorithm can only\nlearn a mean motion for each object. In the case of zero-mean, symmetric motion distributions, the\nalgorithm would produce an output frame with almost no motion.\n(2) Motion prior A simple way to model the multiple possible motions of future frames is to use a\nvariational autoencoder [Kingma and Welling, 2014], as shown in Figure 2(c). The network consists\nof an encoder network (gray) and a decoder network (yellow), and the latent representation z encodes\nthe intrinsic dimensionality of the motion \ufb01elds. A shortcoming of this model is that it does not see\nthe input image during inference. Therefore, it will only learn a global motion \ufb01eld of both circles\nand squares, without distinguishing the particular motion pattern for each class of objects.\n(3) Probabilistic frame predictor\nIn this work, we combine the deterministic motion prediction\nstructure with a motion prior, to model the uncertainty in a motion \ufb01eld and the correlation between\nmotion and image content. We extend the decoder in (2) to take two inputs, the intrinsic motion\nrepresentation z and an image I (see the yellow network in Figure 2(d), which corresponds to\np(v|I, z)). Therefore, instead of modeling a joint distribution of motion v, it will learn a conditional\ndistribution of motion given the input image I.\n\na function f that minimizes the reconstruction error(cid:80)\n\n3\n\n(d) Probabilistic frame predictor(c) Motion prior(b) Deterministic motion prediction\ud835\udc3c(a) A toy world\ud835\udc3c\ud835\udc63\ud835\udc63\ud835\udc63\ud835\udc67\ud835\udc67\ud835\udc63\ud835\udc67\ud835\udc63\ud835\udc63\ud835\udc3c\ud835\udc63\ud835\udc63\ud835\udc67\ud835\udc3c\ud835\udc5d(\ud835\udc63|\ud835\udc67)\ud835\udc5d(\ud835\udc67|\ud835\udc63)\ud835\udc5d(\ud835\udc63|\ud835\udc3c,\ud835\udc67)\ud835\udc5d(\ud835\udc67|\ud835\udc3c,\ud835\udc63)\ud835\udc53\fIn this toy example, since squares and circles only move in one (although different) direction, we\nwould only need a scalar z \u2208 R for encoding the velocity of the object. The model is then able to\ninfer the location and direction of motion conditioned on the shape that appears in the input image.\n3.3 Conditional Variational Autoencoder\nIn this section, we will formally derive the training objective of our model, following the similar\nderivations as those in Kingma and Welling [2014], Kingma et al. [2014], Yan et al. [2016]. Consider\nthe following generative process that samples a future frame conditioned on an observed image, I.\nFirst, the algorithm samples the hidden variable z from a prior distribution pz(z); in this work, we\nassume pz(z) is a multivariate Gaussian distribution where each dimension is i.i.d. with zero-mean\nand unit-variance. Then, given a value of z, the algorithm samples the intensity difference image v\nfrom the conditional distribution p\u03b8(v|I, z). The \ufb01nal image, J = I + v, is then returned as output.\nIn the training stage, the algorithm attempts to maximize the log-likelihood of the conditional marginal\ni log p(v(i)|I (i)). Assuming I and z are independent, the marginal distribution is\nz p(v(i)|I (i), z)pz(z)dz. Directly maximizing this marginal distribution is hard,\nthus we instead maximize its variational upper-bound, as proposed by Kingma and Welling [2014].\nEach term in the marginal distribution is upper-bounded by\nL(\u03b8, \u03c6, v(i)|I (i)) \u2248 \u2212DKL(q\u03c6(z|v(i), I (i))||pz(z)) +\n\ndistribution(cid:80)\nexpanded as(cid:80)\n\nlog p\u03b8(v(i)|z(i,l), I (i))\n\ni log(cid:82)\n\n(cid:105)\n\nL(cid:88)\n\n(cid:104)\n\n,\n\n(1)\n\n1\nL\n\nl=1\n\nwhere DKL is the KL-divergence, q\u03c6(z|v(i), I (i)) is the variational distribution that approximates\nthe posterior p(z|v(i), I (i)), and z(i,l) are samples from the variational distribution. For simplicity,\nwe refer to the conditional data distribution, p\u03b8(\u00b7), as the generative model, and the variational\ndistribution, q\u03c6(\u00b7), as the recognition model.\nWe assume Gaussian distributions for both the generative model and recognition model\u2217, where the\nmean and variance of the distributions are functions speci\ufb01ed by neural networks, that is\u2020:\n\np\u03b8(v(i)|z(i,l), I (i)) = N (v(i); fmean(z(i,l), I (i)), \u03c32I),\nq\u03c6(z(i,l)|v(i), I (i)) = N (z(i,l); gmean(v(i), I (i)), gvar(v(i), I (i))),\n\n(2)\n(3)\nwhere N ( \u00b7 ; a, b) is a Gaussian distrubtion with mean a and variance b. fmean is a function that\npredicts the mean of the generative model, de\ufb01ned by the generative network (the yellow network in\nFigure 2(d)). gmean and gvar are functions that predict the mean and variance of the recognition model,\nrespectively, de\ufb01ned by the recognition network (the gray network in Figure 2(d)). Here we assume\nthat all dimensions of the generative model have the same variance \u03c32, where \u03c3 is a hand-tuned hyper\nparameter. In the next section, we will describe the details of both network structures.\n4 Method\nIn this section we present a trainable neural network structure, which de\ufb01nes the generative function\nfmean and recognition functions gmean, and gvar. Once trained, these functions can be used in conjunction\nwith an input image to sample future frames. We \ufb01rst describe our newly proposed cross convolutional\nlayer, which naturally characterizes a layered motion representation [Wang and Adelson, 1993]. We\nthen explain our network structure and demonstrate how we integrate the cross convolutional layer\ninto the network for future frame synthesis.\n4.1 Layered Motion Representations and Cross Convolutional Networks\nMotion can often be decomposed in a layer-wise manner [Wang and Adelson, 1993]. Intuitively,\ndifferent semantic segments in an image should have different distributions over all possible motions;\nfor example, a building is often static, but a river \ufb02ows.\nTo model layered motion, we propose a novel cross convolutional network (Figure 3). The network\n\ufb01rst decomposes an input image pyramid into multiple feature maps through an image encoder\n(Figure 3(c)). It then convolves these maps with different kernels (Figure 3(d)), and uses the outputs\nto synthesize a difference image (Figure 3(e)). This network structure naturally \ufb01ts a layered motion\nrepresentation, as each feature map characterizes an image layer (note this is different from a network\n\u2217A complicated distribution can be approximated by a function of a simple distribution, e.g. Gaussian, which\n\u2020Here the bold I denotes an identity matrix, whereas the normal-font I denotes the observed image.\n\nis referred as the reparameterization trick in [Kingma and Welling, 2014].\n\n4\n\n\fFigure 3: Our network consists of \ufb01ve components: (a) a motion encoder, (b) a kernel decoder, (c) an\nimage encoder, (d) a cross convolution layer, and (e) a motion decoder. Our image encoder takes\nimages at four scales as input. For simplicity, we only show two scales in this \ufb01gure.\nlayer) and the corresponding kernel characterizes the motion of that layer. In other words, we model\nmotions as convolutional kernels, which are applied to feature maps of images at multiple scales.\nUnlike a traditional convolutional network, these kernels should not be identical for all inputs, as\ndifferent images typically have different motions (kernels). We therefore propose a cross convolutional\nlayer to tackle this problem. The cross convolutional layer does not learn the weights of the kernels\nitself. Instead, it takes both kernel weights and feature maps as input and performs convolution during\na forward pass; for back propagation, it computes the gradients of both convolutional kernels and\nfeature maps. Concurrent works from Finn et al. [2016], Brabandere et al. [2016] also explored\nsimilar ideas. While they applied the learned kernels on input images, we jointly learn feature maps\nand kernels without direct supervision.\n4.2 Network Structure\nAs shown in Figure 3, our network consists of \ufb01ve components: (a) a motion encoder, (b) a kernel\ndecoder, (c) an image encoder, (d) a cross convolutional layer, and (e) a motion decoder. The\nrecognition functions gmean and gvar are de\ufb01ned by the motion encoder, whereas the generative\nfunction fmean is de\ufb01ned by the remaining network.\nDuring training, our variational motion encoder (Figure 3(a)) takes two adjacent frames in time\nas input, both at a resolution of 128 \u00d7 128, and outputs a 3,200-dimensional mean vector and a\n3,200-dimensional variance vector. The network samples the latent motion representation z using\nthese mean and variance vectors. Next, the kernel decoder (Figure 3(b)) sends the 3,200 = 128\u00d75\u00d75\ntensor into two additional convolutional layers, producing four sets of 32 motion kernels of size\n5 \u00d7 5. Our image encoder (Figure 3(c)) operates on four different scaled versions of the input image\nI (256 \u00d7 256, 128 \u00d7 128, 64 \u00d7 64, and 32 \u00d7 32). The output sizes of the feature maps in these four\nchannels are 32\u00d7 64\u00d7 64, 32\u00d7 32\u00d7 32, 32\u00d7 16\u00d7 16, and 32\u00d7 8\u00d7 8, respectively. This multi-scale\nconvolutional network allows us to model both global and local structures in the image, which may\nhave different motions. See appendix for more details.\nThe core of our network is a cross convolutional layer (Figure 3(d)) which, as discussed in Section 4.1,\napplies the kernels learned by the kernel decoder to the feature maps learned by the image encoder,\nrespectively. The output size of the cross convolutional layer is identical to that of the image encoder.\nFinally, our motion decoder (Figure 3(e)) uses the output of the cross convolutional layer to regress\nthe output difference image.\nTraining and testing details During training, the image encoder takes a single frame I (i) as input,\nand the motion encoder takes both I (i) and the difference image v(i) = J (i) \u2212 I (i) as input, where\nJ (i) is the next frame. The network aims to regress the difference image that minimizes the (cid:96)2 loss.\nDuring testing, the image encoder still sees a single image I; however, instead of using a motion\nencoder, we directly sample motion vectors z(j) from the prior distribution pz(z). In practice, we use\nan empirical distribution of z over all training samples as an approximation to the prior, as we \ufb01nd it\nproduces better synthesis results. The network synthesizes possible difference images v(j) by taking\n\n5\n\n\ud835\udc67Difference imagePyramid of the current frameFeature maps(e) Motion decoder(c) Image encoderDifferenceimage(d) Cross convolution(a)Motion encoder\u2026(b) Kernel decoder\u2026\u2026Upsample\fFigure 4: Results on the shapes dataset containing circles (C) squares (S) and triangles (T). For each\n\u2018Frame 2\u2019 we show the RGB image along with an overlay of green and magenta versions of the 2\nconsecutive frames, to help illustrate motion. See text and our project page for more details and a\nbetter visualization.\n\nKL divergence (DKL(pgt || ppred)) between\npredicted and ground truth distributions\n\nMethod\n\nFlow\nAE\nOurs\n\nC.\n6.77\n8.76\n1.70\n\nShapes\nT.\n6.07\n10.36\n1.14\n\nS.\n7.07\n12.37\n2.48\n\nC.-T.\n8.42\n10.58\n2.46\n\nFigure 5: Left: for each object, comparison between its ground-truth motion distribution and the\ndistribution predicted by our method. Right: KL divergence between ground-truth distributions and\ndistributions predicted by three different algorithms.\nthe sampled latent representation z(j) and an RGB image I as input. We then generate a set of future\nframes {J (j)} from these difference images: J (j) = I + v(j).\n5 Evaluations\nWe now present a series of experiments to evaluate our method. All experimental results, along with\nadditional visualizations, are also available on our project page\u2021.\nMovement of 2D shapes We \ufb01rst evaluate our method using a dataset of synthetic 2D shapes. This\ndataset serves to benchmark our model on objects with simple, yet nontrivial, motion distributions. It\ncontains three types of objects: circles, squares, and triangles. Circles always move vertically, squares\nhorizontally, and triangles diagonally. The motion of circles and squares are independent, while the\nmotion of circles and triangles are correlated. The shapes can be heavily occluded, and their sizes,\npositions, and colors are chosen randomly. There are 20,000 pairs for training, and 500 for testing.\nResults are shown in Figure 4. Figure 4(a) and (b) show a sample of consecutive frames in the dataset,\nand Figure 4(c) shows the reconstruction of the second frame after encoding and decoding with the\nground truth images. Figure 4(d) and (e) show samples of the second frame; in these results the\nnetwork only takes the \ufb01rst image as input, and the compact motion representation, z, is randomly\nsampled. Note that the network is able to capture the distinctive motion pattern for each shape,\nincluding the strong correlation of triangle and circle motion.\nTo quantitatively evaluate our algorithm, we compare the displacement distributions of circles,\nsquares, and triangles in the sampled images with their ground truth distributions. We sampled 50,000\nimages and used the optical \ufb02ow package by Liu [2009] to calculate the movement of each object. We\ncompare our algorithm with a simple baseline that copies the optical \ufb02ow \ufb01eld from the training set\n(\u2018Flow\u2019 in Figure 5 right); for each test image, we \ufb01nd its 10-nearest neighbors in the training set, and\nrandomly transfer one of the corresponding optical \ufb02ow \ufb01elds. To illustrate the advantage of using\na variational autoencoder over a standard autoencoder, we also modify our network by removing\nthe KL-divergence loss and sampling layer (\u2018AE\u2019 in Figure 5 right). Figure 5 shows our predicted\ndistribution is very close to the ground-truth distribution. It also shows that a variational autoencoder\nhelps to capture the true distribution of future frames.\n\n\u2021Our project page: http://visualdynamics.csail.mit.edu\n\n6\n\n(a) Frame 1(b) Frame 2(ground truth)(c) Frame 2(Reconstruction)(d) Frame 2(Sample 1)(e) Frame 2(Sample 2)-505-505-505-505SquareTrianglesCircles-Triangles-505-505-505-505Circles-505-505-505-505-505-505-505-505VyVxVyVxVyVxVyVxVyVxVyVxVcircVTriVcircVTriGround truth distributionPredicted distribution\fLabeled real (%)\n\nMethod\n\nFlow\nOurs\n\nResolution\n\n32\u00d732\n29.7\n41.2\n\n64\u00d764\n21.0\n35.7\n\nFigure 6: Left: Sampling results on the Sprites dataset. Motion is illustrated using the overlay\ndescribed in Figure 4. Right: Probability that a synthesized result is labeled as real by humans in\nMechanical Turk behavioral experiments\n\nLabeled real (%)\n\nMethod\n\nFlow\nOurs\n\nResolution\n\n32\u00d732\n31.3\n36.7\n\n64\u00d764\n25.5\n31.3\n\nFigure 7: Results on Exercise dataset. Left: Sampling results on Exercise dataset. Motion is illustrated\nusing the overlay described in Figure 4. Right: probability that a synthesized result is labeled as real\nby humans in Mechanical Turk behavior experiments\nMovement of video game sprites We evaluate our framework on a video game sprites dataset\u00a7,\nalso used by Reed et al. [2015]. The dataset consists of 672 unique characters, and for each character\nthere are 5 animations (spellcast, thrust, walk, slash, shoot) from 4 different viewpoints. Each\nanimation ranges from 6 to 13 frames. We collect 102,364 pairs of neighboring frames for training,\nand 3,140 pairs for testing. The same character does not appear in both the training and test sets.\nSynthesized sample frames are shown in Figure 6. The results show that from a single input frame,\nour method can capture various possible motions that are consistent with those in the training set.\nFor a quantitative evaluation, we conduct behavioral experiments on Amazon Mechanical Turk. We\nrandomly select 200 images, sample possible next frames using our algorithm, and show them to\nmultiple human subjects as an animation side by side with the ground truth animation. We then ask\nthe subject to choose which animation is real (not synthesized). An ideal algorithm should achieve a\nsuccess rate of 50%. In our experiments, we present the animation in both the original resolution\n(64 \u00d7 64) and a lower resolution (32 \u00d7 32). We only evaluate on subjects that have a past approval\nrating of > 95% and also pass our quali\ufb01cation tests. Figure 6 shows that our algorithm signi\ufb01cantly\nout-performs a baseline algorithm that warps an input image by transferring a randomly selected \ufb02ow\n\ufb01eld from the training set. Subjects are more easily fooled by the 32 \u00d7 32 pixel images, as it is harder\nto hallucinate realistic details in high-resolution images.\nMovement in real videos captured in the wild To demonstrate that our algorithm can also handle\nreal videos, we collect 20 workout videos from YouTube, each about 30 to 60 minutes long. We \ufb01rst\napply motion stabilization to the training data as a pre-processing step to remove camera motion.\nWe then extract 56,838 pairs of frames for training and 6,243 pairs for testing. The training and\ntesting pairs come from different video sequences. Figure 7 shows that our framework works well in\npredicting the movement of the legs and torso. Additionally, Mechanical Turk behavioral experiments\nshow that the synthesized frames are visually realistic.\nZero-shot visual analogy-making Recently, Reed et al. [2015] studied the problem of inferring the\nrelationship between a pair of reference images and synthesizing a new analogy-image by applying\nthe inferred relationship to a test image. Our network is also able to preform this task, without\neven requiring supervision. Speci\ufb01cally, we extract the motion vector, z, from two reference frames\nusing our motion encoder (Figure 3(a)). We then use the extracted motion vector z to synthesize an\nanalogy-image given a new test image.\nOur network successfully transfers the motion in reference pairs to a test image. For example, in\nFigure 8(a), it learns that the character leans toward to the right, and in Figure 8(b) it learns that\nthe girl spreads her feet apart. A quantitative evaluation is also shown in Figure 9. Even without\n\n\u00a7Liberated pixel cup: http://lpc.opengameart.org\n\n7\n\n(a) Frame 1(b) Frame 2(ground truth)(c) Frame 2(Sample 1)(d) Frame 2(Sample 2)(a) Frame 1(b) Frame 2(ground truth)(c) Frame 2(Sample 1)(d) Frame 2(Sample 2)\fModel\nAdd\nDis\nDis+Cls\nOurs\n\nspellcast\n\n41.0\n40.8\n13.3\n9.5\n\nthrust walk slash shoot average\n53.8\n55.8\n24.6\n11.5\n\n55.7 52.1\n52.6 53.5\n17.2 18.9\n11.1 28.2\n\n77.6\n79.8\n40.8\n19.0\n\n56.0\n56.5\n23.0\n15.9\n\nFigure 8: Visual analogy-making (pre-\ndicted frames are marked in red)\n\nFigure 9: Mean squared pixel error on test analogies,\nby animation. The \ufb01rst three models (Add, Dis, and\nDis+Cls) are from Reed et al. [2015].\n\nFigure 10: Learned feature maps on the shapes dataset (left), the sprites dataset (top right), and the\nexercise dataset (bottom right)\nsupervision, our method out-performs the algorithm by Reed et al. [2015], which requires visual\nanalogy labels during training.\nVisualizing feature maps We visualize the learned feature maps (see Figure 3(b)) in Figure 10.\nEven without supervision, our network learns to detect objects or contours in the image. For example,\nwe see that the network automatically learns object detectors and edge detectors on the shape dataset.\nIt also learns a hair detector and a body detector on the sprites and exercise datasets, respectively.\nVisualizing latent representations By visualizing the latent representations of z we have found\nthat each dimension corresponds to a certain type of motion. For instance, in the excerise dataset,\nvarying one dimension of z causes the girl to stand-up and another causes her to move a leg. Please\nrefer to our project page for this visualization.\nDimension of latent representation Although our latent motion representation, z, has 3,200\ndimensions, its intrinsic dimensionality is much smaller. First, zmean is very sparse. The non-zero\nelements of zmean for each dataset are 299 in shapes, 54 in sprites, and 978 in exercise. Second, the\nindependent components of z are even fewer. We run principle component analysis (PCA) on the zmeans\nobtained from a set of training images, and \ufb01nd that for each dataset, a small fraction of components\ncover at least 95% of the variance in zmean (5 in shapes, 2 in sprites, and 27 in exercise). This indicates\nthat our network has learned a compact representation of motion in an unsupervised fashion, and\nencodes high-level knowledge using a small number of bits, rather than simply memorizing training\nsamples. The KL-divergence criterion in Eq. 1 forces the latent representation, z, to carry minimal\ninformation, as discussed by Hinton and Van Camp [1993] and concurrently by Higgins et al. [2016].\n6 Conclusion\nIn this paper, we have proposed a novel framework that samples future frames from a single input im-\nage. Our method incorporates a variational autoencoder for learning compact motion representations,\nand a novel cross convolutional layer for regressing Eulerian motion maps. We have demonstrated\nthat our framework works well on both synthetic, and real-life videos.\nMore generally, results suggest that our probabilistic visual dynamics model may be useful for\nadditional applications, such as inferring objects\u2019 higher-order relationships by examining correlations\nin their motion distributions. Furthermore, this learned representation could be potentially used as a\nsophisticated motion prior in other computer vision and computational photography applications.\nAcknowledgement The authors thank Yining Wang for helpful discussions. This work is supported\nby NSF Robust Intelligence 1212849, NSF Big Data 1447476, ONR MURI 6923196, Adobe, and\nShell Research. The authors would also like to thank Nvidia for GPU donations.\n\n8\n\nReferenceInputPrediction+(a)(b)InputImagesScale 2Map 20Scale 2Map 28Scale 1Map 25Scale 2Map 31InputImagesScale 1Map 32Scale 2Map 5InputImagesScale 1Map 1\fReferences\nAseem Agarwala, Ke Colin Zheng, Chris Pal, Maneesh Agrawala, Michael Cohen, Brian Curless, David Salesin,\n\nand Richard Szeliski. Panoramic video textures. ACM TOG, 24(3):821\u2013827, 2005. 2\n\nBert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic \ufb01lter networks. NIPS, 2016. 5\nEmily L Denton, Soumith Chintala, and Rob Fergus. Deep generative image models using an laplacian pyramid\n\nChelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video\n\nof adversarial networks. In NIPS, 2015. 2\n\nprediction. In NIPS, 2016. 3, 5\n\nDavid J Fleet, Michael J Black, Yaser Yacoob, and Allan D Jepson. Design and use of linear models for image\n\nmotion analysis. IJCV, 36(3):171\u2013193, 2000. 2\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 2\n\nKarol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent\n\nneural network for image generation. In ICML, 2015. 2\n\nIrina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed,\nand Alexander Lerchner. Early visual concept learning with unsupervised deep learning. arXiv preprint\narXiv:1606.05579, 2016. 8\n\nGeoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description\n\nlength of the weights. In COLT, 1993. 8\n\nNeel Joshi, Sisil Mehta, Steven Drucker, Eric Stollnitz, Hugues Hoppe, Matt Uyttendaele, and Michael Cohen.\n\nCliplets: juxtaposing still and dynamic imagery. In UIST, 2012. 2\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. 1, 2, 3, 4\nDiederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning\n\nwith deep generative models. In NIPS, 2014. 4\n\nZicheng Liao, Neel Joshi, and Hugues Hoppe. Automated video looping with progressive dynamism. ACM\n\nTOG, 32(4):77, 2013. 2\n\nCe Liu. Beyond pixels: exploring new representations and applications for motion analysis. PhD thesis,\n\nMassachusetts Institute of Technology, 2009. 6\n\nCe Liu, Jenny Yuen, and Antonio Torralba. SIFT \ufb02ow: Dense correspondence across scenes and its applications.\n\nIEEE TPAMI, 33(5):978\u2013994, 2011. 2\n\nerror. In ICLR, 2016. 3\n\nMichael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square\n\nJunhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video\n\nprediction using deep networks in atari games. In NIPS, 2015. 3\n\nSilvia L Pintea, Jan C van Gemert, and Arnold WM Smeulders. Dejavu: Motion prediction in static images. In\n\nECCV, 2014. 2\n\n3\n\n2014. 2\n\n2015. 2\n\nJavier Portilla and Eero P Simoncelli. A parametric texture model based on joint statistics of complex wavelet\n\ncoef\ufb01cients. IJCV, 40(1):49\u201370, 2000. 2\n\nAlec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. In ICLR, 2016. 2\n\nScott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In NIPS, 2015. 7, 8\nStefan Roth and Michael J Black. On the spatial statistics of optical \ufb02ow. In ICCV, 2005. 2\nArno Sch\u00f6dl, Richard Szeliski, David H Salesin, and Irfan Essa. Video textures. ACM TOG, 7(5):489\u2013498,\nNitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations\n\n2000. 2\n\nusing LSTMs. In ICML, 2015. 3\n\nCarl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. NIPS, 2016a.\n\nCarl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled\n\nvideo. In CVPR, 2016b. 2\n\nJacob Walker, Abhinav Gupta, and Martial Hebert. Patch to the future: unsupervised visual prediction. In CVPR,\n\nJacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical \ufb02ow prediction from a static image. In ICCV,\n\nJacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An uncertain future: Forecasting from static\n\nimages using variational autoencoders. In ECCV, 2016. 2\n\nJohn YA Wang and Edward H Adelson. Layered representation for motion analysis. In CVPR, 1993. 4\nYair Weiss and Edward H Adelson. Slow and Smooth: a Bayesian theory for the combination of local motion\nsignals in human vision. Center for Biological and Computational Learning Paper, 158(1624):1\u201342, 1998. 2\n\nYonatan Wexler, Eli Shechtman, and Michal Irani. Space-time video completion. In CVPR, 2004. 2\nJiajun Wu, Ilker Yildirim, Joseph J Lim, William T Freeman, and Joshua B Tenenbaum. Galileo: Perceiving\n\nphysical object properties by integrating a physics engine with deep learning. In NIPS, 2015. 2\n\nJianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Synthesizing dynamic textures and sounds by spatial-temporal\n\ngenerative convnet. arXiv preprint arXiv:1606.00972, 2016a. 2\n\nJunyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep\n\nconvolutional neural networks. arXiv preprint arXiv:1604.03650, 2016b. 2\n\nTianfan Xue, Michael Rubinstein, Neal Wadhwa, Anat Levin, Fredo Durand, and William T Freeman. Refraction\n\nwiggles for measuring \ufb02uid depth and velocity from video. In ECCV, 2014. 2\n\nXinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation\n\nfrom visual attributes. In ECCV, 2016. 1, 2, 4\n\nTinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis by appearance\n\n\ufb02ow. ECCV, 2016. 2\n\n9\n\n\f", "award": [], "sourceid": 63, "authors": [{"given_name": "Tianfan", "family_name": "Xue", "institution": "MIT CSAIL"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}, {"given_name": "Katherine", "family_name": "Bouman", "institution": "MIT"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}]}