{"title": "Unsupervised learning of object structure and dynamics from videos", "book": "Advances in Neural Information Processing Systems", "page_first": 92, "page_last": 102, "abstract": "Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.", "full_text": "Unsupervised Learning of Object Structure and\n\nDynamics from Videos\n\nMatthias Minderer\u2217 Chen Sun Ruben Villegas\nKevin Murphy Honglak Lee\n\nForrester Cole\n\n{mjlm, chensun, rubville, fcole, kpmurphy, honglak}@google.com\n\nGoogle Research\n\nAbstract\n\nExtracting and predicting object structure and dynamics from videos without\nsupervision is a major challenge in machine learning. To address this challenge,\nwe adopt a keypoint-based image representation and learn a stochastic dynamics\nmodel of the keypoints. Future frames are reconstructed from the keypoints and\na reference frame. By modeling dynamics in the keypoint coordinate space, we\nachieve stable learning and avoid compounding of errors in pixel space. Our\nmethod improves upon unstructured representations both for pixel-level video\nprediction and for downstream tasks requiring object-level understanding of motion\ndynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset,\nthe Human3.6M dataset, and datasets based on continuous control tasks from\nthe DeepMind Control Suite. The spatially structured representation outperforms\nunstructured representations on a range of motion-related tasks such as object\ntracking, action recognition and reward prediction.\n\n1\n\nIntroduction\n\nVideos provide rich visual information to understand the dynamics of the world. However, extracting\na useful representation from videos (e.g. detection and tracking of objects) remains challenging and\ntypically requires expensive human annotations. In this work, we focus on unsupervised learning of\nobject structure and dynamics from videos.\nOne approach for unsupervised video understanding is to learn to predict future frames [17, 16, 9, 15,\n24, 30, 8, 3, 14]. Based on this body of work, we identify two main challenges: First, it is hard to\nmake pixel-level predictions because motion in videos becomes highly stochastic for horizons beyond\nabout a second. Since semantically insigni\ufb01cant deviations can lead to large error in pixel space, it is\noften dif\ufb01cult to distinguish good from bad predictions based on pixel losses. Second, even if good\npixel-level prediction is achieved, this is rarely the desired \ufb01nal task. The representations of a model\ntrained for pixel-level reconstruction are not guaranteed to be useful for downstream tasks such as\ntracking, motion prediction and control.\nHere, we address both of these challenges by using an explicit, interpretable keypoint-based represen-\ntation of object structure as the core of our model. Keypoints are a natural representation of dynamic\nobjects, commonly used for face and pose tracking. Training keypoint detectors, however, generally\nrequires supervision. We learn the keypoint-based representation directly from video, without any\nsupervision beyond the pixel data, in two steps: \ufb01rst encode individual frames to keypoints, then\nmodel the dynamics of those points. As a result, the representation of the dynamics model is spatially\nstructured, though the model is trained only with a pixel reconstruction loss. We show that enforcing\nspatial structure signi\ufb01cantly improves video prediction quality and performance for tasks such as\naction recognition and reward prediction.\n\n\u2217Google AI Resident\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBy decoupling pixel generation from dynamics prediction, we avoid compounding errors in pixel\nspace because we never condition on predicted pixels. This approach has been shown to be bene\ufb01cial\nfor supervised video prediction [25]. Furthermore, modeling dynamics in keypoint coordinate\nspace allows us to sample and evaluate predictions ef\ufb01ciently. Errors in coordinate space are\nmore meaningful than in pixel space, since distance between keypoints is more closely related to\nsemantically relevant differences than pixel-space distance. We exploit this by using a best-of-many-\nsamples objective [4] during training to achieve stochastic predictions that are both highly diverse\nand of high quality, outperforming the predictions of models lacking spatial structure.\nFinally, because we build spatial structure into our model a priori, its internal representation is biased\nto contain object-level information that is useful for downstream applications. This bias leads to\nbetter results on tasks such as trajectory prediction, action recognition and reward prediction.\nOur contributions are: (1) a novel architecture and optimization techniques for unsupervised video\nprediction with a structured internal representation; (2) a model that outperforms recent work [8,\n28] and our unstructured baseline in pixel-level video prediction; (3) improved performance vs.\nunstructured models on downstream tasks requiring object-level understanding.\n\n2 Related work\n\nUnsupervised learning of keypoints. Previous work explores learning to \ufb01nd keypoints in an image\nby applying an autoencoding architecture with keypoint-coordinates as a representational bottleneck\n[12, 33]. The bottleneck forces the image to be encoded in a small number of points. We build on\nthese methods by extending them to the video setting.\nStochastic sequence prediction. Successful video prediction requires modeling uncertainty. We\nadopt the VRNN [6] architecture, which adds latent random variables to the standard RNN archi-\ntecture, to sample from possible futures. More sophisticated approaches to stochastic prediction of\nkeypoints have been recently explored [31, 21], but we \ufb01nd the basic VRNN architecture suf\ufb01cient\nfor our applications.\nUnsupervised video prediction. A large body of work explores learning to predict video frames\nusing only a pixel-reconstruction loss [18, 20, 17, 9, 24, 7]. Most similar to our work are approaches\nthat perform deterministic image generation from a latent sample produced by stochastic sampling\nfrom a prior conditioned on previous timesteps [8, 3, 14]. Our approach replaces the unstructured\nimage representation with a structured set of keypoints, improving performance on video prediction\nand downstream tasks compared with SVG [8] (Section 5).\nRecent methods also apply adversarial training to improve prediction quality and diversity of sam-\nples [22, 14]. EPVA [28] predicts dynamics in a high-level feature space and applies an adversarial\nloss to the predicted features. We compare against EPVA and show improvement without adversarial\ntraining, but adversarial training is compatible with our method and is a promising future direction.\nVideo prediction with spatially structured representations. Like our approach, several recent\nmethods explore explicit, spatially structured representations for video prediction. Xu et al. [29]\nproposed to discover object parts and structure by watching how they move in videos. Vid2Vid [27]\nproposed a video-to-video translation network from segmentation masks, edge masks and human\npose. The method is also used for predicting a few frames into the future by predicting the structure\nrepresentations \ufb01rst. Villegas et al. [25] proposed to train a human pose predictor and then use the\npredicted pose to generate future frames of human motion. In [26], a method is proposed where\nfuture human pose is predicted using a stochastic network and the pose is then used to generate future\nframes. Recent methods on video generation have used spatially structured representations for video\nmotion transfer between humans [1, 5]. In contrast, our model is able to \ufb01nd spatially structured\nrepresentation without supervision while using video frames as the only learning signal.\n\n3 Architecture\n\nOur model is composed of two parts: a keypoint detector that encodes each frame into a low-\ndimensional, keypoint-based representation, and a dynamics model that predicts dynamics in the\nkeypoint space (Figure 1).\n\n2\n\n\fFigure 1: Architecture of our model. Variables are black, functions blue, losses red. Some arrows are\nomitted for clarity, see Equations 1 to 4 for details.\n\n3.1 Unsupervised keypoint detector\n\nThe keypoint detection architecture is inspired by [12], which we adapt for the video setting. Let\nv1:T \u2208 RH\u00d7W\u00d7C be a video sequence of length T . Our goal is to learn a keypoint detector\n\u03d5det(vt) = xt that captures the spatial structure of the objects in each frame in a set of keypoints xt.\nThe detector \u03d5det is a convolutional neural network that produces K feature maps, one for each\nkeypoint. Each feature map is normalized and condensed into a single (x, y)-coordinate by computing\nthe spatial expectation of the map. The number of heatmaps K is a hyperparameter that represents\nthe maximum expected number of keypoints necessary to model the data.\nFor image reconstruction, we learn a generator \u03d5rec that reconstructs frame vt from its keypoint\nrepresentation. The generator also receives the \ufb01rst frame of the sequence v1 to capture the static\nappearance of the scene: vt = \u03d5rec(v1, xt). Together, the keypoint detector \u03d5det and generator \u03d5rec\nform an autoencoder architecture with a representational bottleneck that forces the structure of each\nframe to be encoded in a keypoint representation [12].\nThe generator is also a convolutional neural network. To supply the keypoints to the network, each\npoint is converted into a heatmap with a Gaussian-shaped blob at the keypoint location. The K\nheatmaps are concatenated with feature maps from the \ufb01rst frame v1. We also concatenate the\nkeypoint-heatmaps for the \ufb01rst frame v1 to the decoder input for subsequent frames vt, to help the\ndecoder to \"inpaint\" background regions that were occluded in the \ufb01rst frame. The resulting tensor\nforms the input to the generator. We add skip connections from the \ufb01rst frame of the sequence to the\ngenerator output such that the actual task of the generator is to predict vt \u2212 v1.\nWe use the mean intensity \u00b5k of each keypoint feature map returned by the detector as a continuous-\nvalued indicator of the presence of the modeled object. When converting keypoints back into\nheatmaps, each map is scaled by the corresponding \u00b5k. The model can use \u00b5k to encode the presence\nor absence of individual objects on a frame-by-frame basis.\n\n3.2 Stochastic dynamics model\n\nTo model the dynamics in the video, we use a variational recurrent neural network (VRNN) [6]. The\ncore of the dynamics model is a latent belief z over keypoint locations x. In the VRNN architecture,\nthe prior belief is conditioned on all previous timesteps through the hidden state ht\u22121 of an RNN,\nand thus represents a prediction of the current keypoint locations before observing the image:\n\n(1)\nWe obtain the posterior belief by combining the previous hidden state with the unsupervised keypoint\ncoordinates xt = \u03d5det(vt) detected in the current frame:\n\np(zt|x<t, z<t) = \u03d5prior(ht\u22121)\n\nq(zt|x\u2264t, z<t) = \u03d5enc(ht\u22121, xt)\n\nPredictions are made by decoding the latent belief:\n\np(xt|z\u2264t, x<t) = \u03d5dec(zt, ht\u22121)\n\n3\n\n(2)\n\n(3)\n\nStop-gradient\fFinally, the RNN is updated to pass information forward in time:\n\nht = \u03d5RNN(xt, zt, ht\u22121).\n\n(4)\nNote that to compute the posterior (Eq. 2), we obtain xt from the keypoint detector, but for the\nrecurrence in Eq. 4, we obtain xt by decoding the latent belief. We can therefore predict into\nthe future without observing images by decoding xt from the prior belief. Because the model\nhas both deterministic and stochastic pathways across time, predictions can account for long-term\ndependencies as well as future uncertainty [10, 6].\n\n4 Training\n\n4.1 Keypoint detector\n\nThe keypoint detector is trained with a simple L2 image reconstruction loss Limage =(cid:80)\n\nt ||v \u2212 \u02c6v||2\n2,\nwhere v is the true and \u02c6v is the reconstructed image. Errors from the dynamics model are not\nbackpropagated into the keypoint detector.2\nIdeally, the representation should use as few keypoints as possible to encode each object. To encourage\nsuch parsimony, we add two additional losses to the keypoint detector:\nTemporal separation loss. Image features whose motion is highly correlated are likely to belong to\nthe same object and should ideally be represented jointly by a single keypoint. We therefore add a\nseparation loss that encourages keypoint trajectories to be decorrelated in time. The loss penalizes\n\"overlap\" between trajectories within a Gaussian radius \u03c3sep:\nexp(\u2212 dkk(cid:48)\n2\u03c32\nsep\n\n(cid:88)\n\n(cid:88)\n\nLsep =\n\n(5)\n\n)\n\n(cid:80)\nt ||(xt,k \u2212 (cid:104)xk(cid:105)) \u2212 (xt,k(cid:48) \u2212 (cid:104)xk(cid:48)(cid:105))||2\n\nk(cid:48)\n\nk\n\n2 is the distance between the trajectories\nwhere dkk(cid:48) = 1\nof keypoints k and k(cid:48), computed after subtracting the temporal mean (cid:104)x(cid:105) from each trajectory.\nT\n|| \u00b7 ||2\nk |\u00b5k| on the keypoint\n\nKeypoint sparsity loss. For similar reasons, we add an L1 penalty Lsparse =(cid:80)\n\n2 denotes the squared Euclidean norm.\n\nscales \u00b5 to encourage keypoints to be sparsely active.\nIn Section 5.3, we show that both Lsep and Lsparse contribute to stable keypoint detection.\n\n4.2 Dynamics model\n\nThe standard VRNN [6] is trained to encode the detected keypoints by maximizing the evidence lower\nbound (ELBO), which is composed of a reconstruction loss and a KL term between the Gaussian\nprior N prior\n\n= N (zt|\u03d5prior(ht\u22121)) and posterior distribution N enc\n\nt = N (zt|\u03d5enc(ht\u22121, xt)):\n\nt\n\n(cid:105)\n\nLVRNN = \u2212 T(cid:88)\n\nE(cid:104)\n\nlog p(xt|z\u2264t, x<t) \u2212 \u03b2KL(N enc\n\nt\n\n(cid:107) N prior\n\nt\n\n)\n\n(6)\n\nt=1\n\nThe KL term regularizes the latent representation. In the VRNN architecture, it is also responsible for\ntraining the RNN, since it encourages the prior to predict the posterior based on past information. To\nbalance reliance on predictions with \ufb01delity to observations, we add the hyperparameter \u03b2 (see also\n[2]). We found it essential to tune \u03b2 for each dataset to achieve a balance between reconstruction\nquality (lower \u03b2) and prediction diversity.\nThe KL term only trains the dynamics model for single-step predictions because the model receives\nobservations after each step [10]. To encourage learning of long-term dependencies, we add a pure\nreconstruction loss, without the KL term, for multiple future timesteps:\nE [log p(xt|z\u2264t, x\u2264T )]\n\nLfuture = \u2212 T +\u2206T(cid:88)\n\n(7)\n\n2We found this to be necessary to maintain a keypoint-structured representation. If the image model is\ntrained based on errors from the dynamics model, the image model may adopt the poorly structured code of an\nincompletely trained dynamics model, rather than the dynamics model adopting the keypoint-structured code.\n\nt=T +1\n\n4\n\n\f(a) Basketball\n\n(b) Human3.6M\n\nFigure 2: Main datasets used in our experiments. First row: Ground truth images. Second row:\nDecoded coordinates (black dots; \u02c6xt in Figure 1) and past trajectories (gray lines). Third row:\nReconstructed image. Green borders indicate observed frames, red indicate predicted frames.\nThe standard approach to estimate log p(xt|z\u2264t, x\u2264t) in Eq. 6 and 7 is to sample a single zt. To\nfurther encourage diverse predictions, we instead use the best of a number of samples [4] at each\ntimestep during training:\n\n(cid:0) log p(xt|zi,t, z<t, x<t)(cid:1),\n\nmax\n\ni\n\n(8)\n\nt\n\nt\n\nfor observed steps and zi,t \u223c N prior\n\nwhere zi,t \u223c N enc\nfor predicted steps. By giving the model\nseveral chances to make a good prediction, it is encouraged to cover a range of likely data modes,\nrather than just the most likely. Sampling and evaluating several predictions at each timestep would\nbe expensive in pixel space. However, since we learn the dynamics in the low-dimensional keypoint\nspace, we can evaluate sampled predictions without reconstructing pixels. Due to the keypoint\nstructure, the L2 distance of samples from the observed keypoints meaningfully captures sample\nquality. This would not be guaranteed for an unstructured latent representation. As shown in Section 5,\nthe best-of-many objective is crucial to the performance of our model.\nThe combined loss of the whole model is:\n\nLimage + \u03bbsepLsep + \u03bbsparseLsparse + LVRNN + Lfuture,\n\n(9)\nwhere \u03bbsep and \u03bbsparse are scale parameters for the keypoint separation and sparsity losses. See Sec-\ntion S1 for implementation details, including a list of hyperparameters and tuning ranges (Table S1).\n\n5 Results\n\nWe \ufb01rst show that the structured representation of our model improves prediction quality on two\nvideo datasets, and then show that it is more useful than unstructured representations for downstream\ntasks that require object-level information.\n\n5.1 Structured representation improves video prediction\n\nWe evaluate frame prediction on two video datasets (Figure 2). The Basketball dataset consists of a\nsynthetic top-down view of a basketball court containing \ufb01ve offensive players and the ball, all drawn\nas colored dots. The videos are generated from real basketball player trajectories [32], testing the\nability of our model to detect and stably represent multiple objects with complex dynamics. The\ndataset contains 107,146 training and 13,845 test sequences. The Human3.6 dataset [11] contains\nvideo sequences of human actors performing various actions. We use subjects S1, S5, S6, S7, and\nS9 for training (600 videos), and subjects S9 and S11 for evaluation (239 videos). For both datasets,\nground truth object coordinates are available for evaluation, but are not used by the model. The\nBasketball dataset contains the coordinates of each of the 5 players and the ball. The Human dataset\ncontains 32 motion capture points, of which we select 12 for evaluation.\nWe compare the full model (Struct-VRNN) to a series of baselines and ablations: the Struct-VRNN\n(no BoM) model was trained without the best-of-many objective; the Struct-RNN is deterministic;\nthe CNN-VRNN architecture uses the same stochastic dynamics model as the Struct-VRNN, but uses\nan unstructured deep feature vector as its internal representation instead of structured keypoints. All\nstructured models use K = 12 for Basketball, and K = 48 for Human3.6M, and were conditioned on\n\n5\n\nTruet=0t=5t=10t=15t=20t=24Coord.Recon.Truet=0t=5t=10t=20t=30t=40Coord.Recon.\fFigure 3: Video generation quality on Human3.6M. Our stochastic structured model (Struct-VRNN)\noutperforms our deterministic baseline (Struct-RNN), our unstructured baseline (CNN-VRNN),\nand the SVG [8] and SAVP [14] models. Top: Example observed (green borders) and predicted\n(red borders) frames (best viewed as video: https://mjlm.github.io/video_structure/).\nExample sequences are the closest or furthest samples from ground truth according to VGG cosine\nsimilarity, as indicated. Note that for Struct-VRNN, even the samples furthest from ground truth are\nof high visual quality. Bottom left: Mean VGG cosine similarity of the the samples closest to ground\ntruth (left) and furthest from ground truth (right). Higher is better. Plots show mean performance\nacross 5 model initializations, with the 95% con\ufb01dence interval shaded. Bottom right: Fr\u00e9chet\nVideo Distance [23], using all samples. Lower is better. Dots represents separate model initializations.\nEPVA [28] is not stochastic, so we compare performance with a single sample from our method on\ntheir test set.\n\n8 frames and trained to predict 8 future frames. For the CNN-VRNN, which lacks keypoint structure,\nwe use a latent representation with 3K elements, such that its capacity is at least as large as that of\nthe Struct-VRNN representation. Finally, we compare to three published models: SVG [8], SAVP\n[14] and EPVA [28] (Figure 3).\nThe Struct-VRNN model matches or outperforms the other models in perceptual image and video\nquality as measured by VGG [19] feature cosine similarity and Fr\u00e9chet Video Distance [23] (Figure 3).\nResults for the lower-level metrics SSIM and PSNR are similar (see supplemental material).\nThe ablations suggest that the structured representation, the stochastic belief, and the best-of-many\nobjective all contribute to model performance. The full Struct-VRNN model generates the best\nreconstructions of ground truth, and also generates the most diverse samples (i.e., samples that are\nfurthest from ground truth; Figure 3 bottom left). In contrast, the ablated models and SVG show both\nlower best-case accuracy and smaller differences between closest and furthest samples, indicating\nless diverse samples. SAVP is closer, performing well on the single-frame metric (VGG cosine\nsim.), but still worse on FVD than the structured model. Qualitatively, Struct-VRNN exhibits sharper\nimages and longer object permanence than the unstructured models (Figure 3, top; note limb detail\nand dynamics). This is true even for the samples that are far from ground truth (Figure 3 top, row\n\"Struct-VRNN (furthest)\"), which suggests that our model produces diverse high-quality samples,\nrather than just a few good samples among many diverse but unrealistic ones. This conclusion is\n\n6\n\nTruet=0t=3t=5t=10t=15t=20t=25t=30t=35t=40t=45t=50Struct-VRNN(closest)Struct-VRNN(furthest)CNN-VRNN(closest)SVG(closest)SAVP(closest)1020304050Timestep0.80.91.0VGG cosine sim.Closest sampleStruct-VRNNSVRNN (no BoM)Struct-RNNCNN-VRNNSVGSAVP1020304050TimestepFurthest sampleStruct-VRNNSVRNN (no BoM)Struct-RNNCNN-VRNNSVGSAVP20040060080010001200FVDOur test setStruct-VRNNEPVAEPVA-GANEPVA test set\fFigure 4: Prediction error for ground-truth trajectories by linear regression from predicted keypoints.\n(sup.) indicates supervised baseline.\n\nFigure 5: Ablating either the temporal separation loss or the keypoint sparsity loss reduces model\nperformance and stability. In the FVD plots, each dot corresponds to a different model initialization.\nCoordinate error plots show the prediction error when regressing the ground-truth object coordinates\non the discovered keypoints. Lines show the mean of \ufb01ve model initializations, with the 95%\ncon\ufb01dence intervals shaded.\n\nbacked up by the FVD (Figure 3 bottom right), which measures the overall quality of a distribution\nof videos [23].\n\n5.2 The learned keypoints track objects\n\nWe now examine how well the learned keypoints track the location of objects. Since we do not expect\nthe keypoints to align exactly with human-labeled objects, we \ufb01t a linear regression from the keypoints\nto the ground truth object positions and measure trajectory prediction error on held-out sequences\n(Figure 4). The trajectory error is the average distance between true and predicted coordinates at each\ntimestep. To account for stochasticity, we sample 50 predictions and report the error of the best.3\nAs a baseline, we train Struct-VRNN and CNN-VRNN models with additional supervision that forces\nthe learned keypoints to match the ground-truth keypoints. The keypoints learned by the unsupervised\nStruct-VRNN model are nearly as predictive as those trained with supervision, indicating that the\nlearned keypoints represent useful spatial information. In contrast, prediction from the internal\nrepresentation of the unsupervised CNN-VRNN is poor. When trained with supervision, however, the\nCNN-VRNN reaches similar performance as the supervised Struct-VRNN. In other words, both the\nStruct-VRNN and the CNN-VRNN can learn a spatial internal representation, but the Struct-VRNN\nlearns it without supervision.\nAs expected, the less diverse predictions of the Struct-VRNN (no BoM) and Struct-RNN perform\nworse on the coordinate regression task. Finally, for comparison, we remove the dynamics model\nentirely and simply predict the last observed keypoint locations for all future timepoints. All models\nexcept unsupervised CNN-VRNN outperform this baseline.\n\n5.3 Simple inductive biases improve object tracking\n\nIn Section 4.1, we described losses intended to add inductive biases such as keypoint sparsity and\nuncorrelated object trajectories to the keypoint detector. We \ufb01nd that these losses improve object\ntracking performance and stability. Figure 5 shows that models without Lsep and Lsparse show reduced\nvideo prediction and tracking performance. The increased variability between model initializations\nwithout Lsep and Lsparse suggests that these losses improve the learnability of a stable keypoint\n3For Human3.6M, we choose the best sample based on the average error of all coordinates. For Basketball,\n\nwe choose the best sample separately for each player.\n\n7\n\nNo dynamicsSVRNN (no BoM)Struct-VRNN (sup.)\fFigure 6: Unsupervised keypoints allow human-guided exploration of object dynamics. We manipu-\nlated the observed coordinates for Player 1 (black arrow) to change the original (blue) trajectory. The\nother players were not manipulated. The dynamics were then rolled out into the future to predict how\nthe players will behave in the manipulated (red) scenario. Black crosses mark initial player positions.\nLight-colored parts of the trajectories are observed, dark-colored parts are predicted. Dots indicate\n\ufb01nal position. Lines of the same color indicate different samples conditioned on the same observed\ncoordinates.\n\nstructure (also see Figure S6). In summary, we \ufb01nd that training and \ufb01nal performance is most stable\nif K is chosen to be larger than the expected number of objects, such that the model can use \u00b5 in\ncombination with Lsparse and Lsep to activate the optimal number of keypoints.\n\n5.4 Manipulation of keypoints allows interaction with the model\n\nSince the learned keypoints track objects, the\nmodel\u2019s predictions can be intuitively manipu-\nlated by directly adjusting the keypoints.\nOn the Basketball dataset, we can explore coun-\nterfactual scenarios such as predicting how the\nother players react if one player moves left as\nopposed to right (Figure 6). We simply manip-\nulate the sequence observed keypoint locations\nbefore they are passed to the RNN, thus condi-\ntioning the RNN states and predictions on the\nmanipulated observations.\nFor the Human3.6M dataset, we can indepen-\ndently manipulate body parts and generate poses\nthat are not present in the training set (Fig-\nure 7; please see https://mjlm.github.io/\nvideo_structure/for videos). The model\nlearns to associate keypoints with local areas of the body, such that moving keypoints near an\narm moves the arm without changing the rest of the image.\n\nFigure 7: Keypoints learned by our method may be\nmanipulated to change peoples\u2019 poses. Note that\nboth manipulations and effects are spatially local.\nBest viewed in video (https://mjlm.github.\nio/video_structure/).\n\n5.5 Structured representation retains more semantic information\n\nThe learned keypoints are also useful for downstream\ntasks such as action recognition and reward prediction in\nreinforcement learning.\nTo test action recognition performance, we train a sim-\nple 3-layer RNN to classify Human3.6M actions from a\nsequence of keypoints (see Section S2.2 for model details).\nThe keypoints learned by the structured models perform\nbetter than the unstructured features learned by the CNN-\nVRNN (Figure 8). Future prediction is not needed, so the\nRNN and VRNN models perform similarly.\nOne major application we anticipate for our model is plan-\nning and reinforcement learning of spatially de\ufb01ned tasks.\nAs a \ufb01rst step, we trained our model on a dataset collected\n\n8\n\nFigure 8: Action recognition on the\nHuman3.6M dataset. Solid line: null\nmodel (predict the most frequent action).\nDashed line: prediction from ground-\ntruth coordinates. Sup., supervised.\n\nManipulationManipulationOriginalLeft legRight legLeft armRight armS9 exampleS11 exampleStruct-VRNN (sup.)Struct-VRNNStruct-RNNCNN-VRNN (sup.)CNN-VRNN0.000.250.50Action recognitionaccuracy\fFigure 9: Predicting rewards on the DeepMind Control Suite continuous control domains. We chose\ndomains with dense rewards to ensure the random policy would provide a suf\ufb01cient reward signal\nfor this analysis. To make scales comparable across domains, errors are normalized to a null model\nwhich predicts the mean training-set-reward at all timesteps. Lines show the mean across test-set\nexamples and 5 random model initializations, with the 95% con\ufb01dence interval shaded.\n\nfrom six tasks in the DeepMind Control Suite (DMCS),\na set of simulated continuous control environments (Figure 9). Image observations and rewards\nwere collected from the DMCS environments using random actions, and we modi\ufb01ed our model\nto condition predictions on the agent\u2019s actions by feeding the actions as an additional input to the\nRNN. Models were trained without access to the task reward function. We used the latent state of the\ndynamics model as an input to a separate reward prediction model for each task (see Section S2.3 for\ndetails). The dynamics learned by the Struct-VRNN give better reward prediction performance than\nthe unstructured CNN-VRNN baseline, suggesting our architecture may be a useful addition to plan-\nning and reinforcement learning models. Concurrent work that applies a similar keypoint-structured\nmodel to control tasks con\ufb01rms these results [13].\n\n6 Discussion\n\nA major question in machine learning is to what degree prior knowledge should be built into a\nmodel, as opposed to learning it from the data. This question is especially important for unsupervised\nvision models trained on raw pixels, which are typically far removed from the information that is\nof interest for downstream tasks. We propose a model with a spatial inductive bias, resulting in a\nstructured, keypoint-based internal representation. We show that this structure leads to superior results\non downstream tasks compared to a representation derived from a CNN without a keypoint-based\nrepresentational bottleneck.\nThe proposed spatial prior using keypoints represents a middle ground between unstructured repre-\nsentations and an explicitly object-centric approach. For example, we do not explicitly model object\nmasks, occlusions, or depth. Our architecture either leaves these phenomena unmodeled, or learns\nthem from the data. By choosing to not build this kind of structure into the architecture, we keep our\nmodel simple and achieve stable training (see variability across initializations in Figures 3, 4, and 5)\non diverse datasets, including multiple objects and complex, articulated human shapes.\nWe also note the importance of stochasticity for the prediction of videos and object trajectories. In\nnatural videos, any sequence of conditioning frames is consistent with an astronomical number of\nplausible future frames. We found that methods that increase sample diversity (e.g. the best-of-many\nobjective [4]) led to large gains in FVD, which measures the similarity of real and predicted videos\non the level of distributions over entire videos. Conversely, due to the diversity of plausible futures,\nframe-wise measures of similarity to ground truth (e.g. VGG cosine similarity, PSNR, and SSIM) are\nnear-meaningless for measuring long-term video prediction quality.\nBeyond image-based measures, the most meaningful evaluation of a predictive model is to apply it to\ndownstream tasks of interest, such as planning and reinforcement learning for control tasks. Because\nof its simplicity, our architecture is straightforward to combine with existing architectures for tasks\nthat may bene\ufb01t from spatial structure. Applying our model to such tasks is an important future\ndirection of this work.\n\n9\n\n\fReferences\n[1] K. Aberman, R. Wu, D. Lischinski, B. Chen, and D. Cohen-Or. Learning Character-Agnostic\n\nMotion for Motion Retargeting in 2D. In SIGGRAPH, 2019.\n\n[2] A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a Broken\n\nELBO. In ICML, 2018.\n\n[3] M. Babaeizadeh, C. Finn, R. Erhan, Dumitru an Campbell, and S. Levine. Stochastic variational\n\nvideo prediction. In ICLR, 2018.\n\n[4] A. Bhattacharyya, B. Schiele, and M. Fritz. Accurate and Diverse Sampling of Sequences based\n\non a \"Best of Many\" Sample Objective. In CVPR, 2018.\n\n[5] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody Dance Now. In CoRR, volume\n\nabs/1808.07371, 2018.\n\n[6] J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio. A Recurrent Latent\n\nVariable Model for Sequential Data. In NeurIPS, 2015.\n\n[7] E. Denton and V. Birodkar. Unsupervised Learning of Disentangled Representations from\n\nVideo. In NeurIPS, 2017.\n\n[8] E. Denton and R. Fergus. Stochastic Video Generation with a Learned Prior. In ICML, 2018.\n\n[9] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through\n\nvideo prediction. In NIPS, 2016.\n\n[10] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning Latent\n\nDynamics for Planning from Pixels. In ICML, 2019.\n\n[11] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and\n\npredictive methods for 3d human sensing in natural environments. In PAMI, 2014.\n\n[12] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi. Conditional Image Generation for Learning the\n\nStructure of Visual Objects. In NeurIPS, 2018.\n\n[13] T. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih.\nUnsupervised learning of object keypoints for perception and control. In arXiv: 1906.11883,\n2019.\n\n[14] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic Adversarial Video\n\nPrediction. In CoRR, volume abs/1804.01523, 2018.\n\n[15] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and\n\nunsupervised learning. In ICLR, 2017.\n\n[16] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square\n\nerror. In ICLR, 2016.\n\n[17] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using\n\ndeep networks in atari games. In NeurIPS, 2015.\n\n[18] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language)\nmodeling: a baseline for generative models of natural videos. arXiv preprint:1412.6604, 2014.\n\n[19] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In CoRR, 2014.\n\n[20] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised Learning of Video Represen-\n\ntations using LSTMs. In ICML, 2015.\n\n[21] C. Sun, P. Karlsson, J. Wu, J. B. Tenenbaum, and K. Murphy. Stochastic Prediction of Multi-\n\nAgent Interactions From Partial Observations. In ICLR, 2019.\n\n10\n\n\f[22] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for\n\nvideo generation. In CVPR, 2018.\n\n[23] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards\n\nAccurate Generative Models of Video: A New Metric & Challenges. In CoRR, 2018.\n\n[24] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing Motion and Content for\n\nNatural Video Sequence Prediction. In ICLR, 2017.\n\n[25] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to Generate Long-term\n\nFuture via Hierarchical Prediction. In ICML, 2017.\n\n[26] J. Walker, K. Marino, A. Gupta, and M. Hebert. The Pose Knows: Video Forecasting by\n\nGenerating Pose Futures. In NeurIPS, 2018.\n\n[27] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-Video\n\nSynthesis. In NeurIPS, 2018.\n\n[28] N. Wichers, R. Villegas, D. Erhan, and H. Lee. Hierarchical Long-term Video Prediction\n\nwithout Supervision. In ICML, 2018.\n\n[29] Z. Xu, Z. Liu, C. Sun, K. Murphy, W. T. Freeman, J. B. Tenenbaum, and J. Wu. Unsupervised\n\nDiscovery of Parts, Structure, and Dynamics. In ICLR, 2019.\n\n[30] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame\n\nsynthesis via cross convolutional networks. In NeurIPS, 2016.\n\n[31] X. Yan, A. Rastogi, R. Villegas, K. Sunkavalli, E. Shechtman, S. Hadap, E. Yumer, and H. Lee.\nMt-vae: Learning motion transformations to generate multimodal human dynamics. In ECCV,\n2018.\n\n[32] E. Zhan, S. Zheng, Y. Yue, L. Sha, and P. Lucey. Generating Multi-Agent Trajectories using\n\nProgrammatic Weak Supervision. In ICLR, 2019.\n\n[33] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee. Unsupervised Discovery of Object\n\nLandmarks as Structural Representations. In CVPR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 53, "authors": [{"given_name": "Matthias", "family_name": "Minderer", "institution": "Google Research"}, {"given_name": "Chen", "family_name": "Sun", "institution": "Google Research"}, {"given_name": "Ruben", "family_name": "Villegas", "institution": "Adobe Research / U. Michigan"}, {"given_name": "Forrester", "family_name": "Cole", "institution": "Google Research"}, {"given_name": "Kevin", "family_name": "Murphy", "institution": "Google"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "Google Brain"}]}