{"title": "Few-shot Video-to-Video Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 5013, "page_last": 5024, "abstract": "Video-to-video synthesis (vid2vid) aims at converting an input semantic video, such as videos of human poses or segmentation masks, to an output photorealistic video. While the state-of-the-art of vid2vid has advanced significantly, existing approaches share two major limitations. First, they are data-hungry. Numerous images of a target human subject or a scene are required for training. Second, a learned model has limited generalization capability. A pose-to-human vid2vid model can only synthesize poses of the single person in the training set. It does not generalize to other humans that are not in the training set. To address the limitations, we propose a few-shot vid2vid framework, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time. Our model achieves this few-shot generalization capability via a novel network weight generation module utilizing an attention mechanism. We conduct extensive experimental validations with comparisons to strong baselines using several large-scale video datasets including human-dancing videos, talking-head videos, and street-scene videos. The experimental results verify the effectiveness of the proposed framework in addressing the two limitations of existing vid2vid approaches.", "full_text": "Few-shot Video-to-Video Synthesis\n\nTing-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, Bryan Catanzaro\n\n{tingchunw,mingyul,atao,guilinl,jkautz,bcatanzaro}@nvidia.com\n\nNVIDIA Corporation\n\nAbstract\n\nVideo-to-video synthesis (vid2vid) aims at converting an input semantic video,\nsuch as videos of human poses or segmentation masks, to an output photorealistic\nvideo. While the state-of-the-art of vid2vid has advanced signi\ufb01cantly, existing\napproaches share two major limitations. First, they are data-hungry. Numerous\nimages of a target human subject or a scene are required for training. Second, a\nlearned model has limited generalization capability. A pose-to-human vid2vid\nmodel can only synthesize poses of the single person in the training set. It does not\ngeneralize to other humans that are not in the training set. To address the limitations,\nwe propose a few-shot vid2vid framework, which learns to synthesize videos of\npreviously unseen subjects or scenes by leveraging few example images of the target\nat test time. Our model achieves this few-shot generalization capability via a novel\nnetwork weight generation module utilizing an attention mechanism. We conduct\nextensive experimental validations with comparisons to strong baselines using\nseveral large-scale video datasets including human-dancing videos, talking-head\nvideos, and street-scene videos. The experimental results verify the effectiveness\nof the proposed framework in addressing the two limitations of existing vid2vid\napproaches. Code is available at our website.\n\nIntroduction\n\n1\nVideo-to-video synthesis (vid2vid) refers to the task of converting an input semantic video to an\noutput photorealistic video. It has a wide range of applications, including generating a human-dancing\nvideo using a human pose sequence [7, 12, 57, 67], or generating a driving video using a segmentation\nmask sequence [57]. Typically, to obtain such a model, one begins with collecting a training dataset\nfor the target task. It could be a set of videos of a target person performing diverse actions or a set\nof street-scene videos captured by using a camera mounted on a car driving in a city. The dataset is\nthen used to train a model that converts novel input semantic videos to corresponding photorealistic\nvideos at test time. In other words, we expect a vid2vid model for humans can generate videos of\nthe same person performing novel actions that are not in the training set and a street-scene vid2vid\nmodel can videos of novel street-scenes with the same style as those in the training set. With the\nadvance of the generative adversarial networks (GANs) framework [13] and its image-conditional\nextensions [22, 58], existing vid2vid approaches have shown promising results.\nWe argue that generalizing to novel input semantic videos is insuf\ufb01cient. One should also aim for\na model that can generalize to unseen domains, such as generating videos of human subjects that\nare not included in the training dataset. More ideally, a vid2vid model should be able to synthesize\nvideos of unseen domains by leveraging just a few example images given at test time. If a vid2vid\nmodel cannot generalize to unseen persons or scene styles, then we must train a model for each new\nsubject or scene style. Moreover, if a vid2vid model cannot achieve this domain generalization\ncapability with only a few example images, then one has to collect many images for each new subject\nor scene style. This would make the model not easily scalable. Unfortunately, existing vid2vid\napproaches suffer from these drawbacks as they do not consider such generalization.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Comparison between the vid2vid (left) and the proposed few-shot vid2vid (right).\nExisting vid2vid methods [7, 12, 57] do not consider generalization to unseen domains. A trained\nmodel can only be used to synthesize videos similar to those in the training set. For example, a\nvid2vid model can only be used to generate videos of the person in the training set. To synthesize\na new person, one needs to collect a dataset of the new person and uses it to train a new vid2vid\nmodel. On the other hand, our few-shot vid2vid model does not have the limitations. Our model\ncan synthesize videos of new persons by leveraging few example images provided at the test time.\n\nTo address these limitations, we propose the few-shot vid2vid framework. The few-shot vid2vid\nframework takes two inputs for generating a video, as shown in Figure 1. In addition to the input\nsemantic video as in vid2vid, it takes a second input, which consists of a few example images\nof the target domain made available at test time. Note that this is absent in existing vid2vid\napproaches [7, 12, 57, 67]. Our model uses these few example images to dynamically con\ufb01gure\nthe video synthesis mechanism via a novel network weight generation mechanism. Speci\ufb01cally, we\ntrain a module to generate the network weights using the example images. We carefully design the\nlearning objective function to facilitate learning the network weight generation module.\nWe conduct extensive experimental validation with comparisons to various baseline approaches using\nseveral large-scale video datasets including dance videos, talking head videos, and street-scene videos.\nThe experimental results show that the proposed approach effectively addresses the limitations of\nexisting vid2vid frameworks. Moreover, we show that the performance of our model is positively\ncorrelated with the diversity of the videos in the training dataset, as well as the number of example\nimages available at test time. When the model sees more different domains in the training time, it can\nbetter generalize to deal with unseen domains (Figure 7(a)). When giving the model more example\nimages at test time, the quality of synthesized videos improves (Figure 7(b)).\n2 Related Work\nGANs. The proposed few-shot vid2vid model is based on GANs [13]. Speci\ufb01cally, we use\na conditional GAN framework. Instead of generating outputs by converting samples from some\nnoise distribution [13, 42, 32, 14, 25], we generate outputs based on user input data, which allows\nmore \ufb02exible control over the outputs. The user input data can take various forms, including\nimages [22, 68, 30, 41], categorical labels [39, 35, 65, 4], textual descriptions [43, 66, 62], and\nvideos [7, 12, 57, 67]. Our model belongs to the last one. However, different from the existing video-\nconditional GANs, which take the video as the sole data input, our model also takes a set of example\nimages. These example images are provided at test time, and we use them to dynamically determine\nthe network weights of our video synthesis model through a novel network weight generation module.\nThis helps the network generate videos of unseen domains.\nImage-to-image synthesis, which transfers an input image from one domain to a corresponding\nimage in another domain [22, 50, 3, 46, 68, 30, 21, 69, 58, 8, 41, 31, 2], is the foundation of vid2vid.\nFor videos, the new challenge lies in generating sequences of frames that are not only photorealistic\nindividually but also temporally consistent as a whole. Recently, the FUNIT [31] was proposed for\ngenerating images of unseen domains via the adaptive instance normalization technique [19]. Our\nwork is different in that we aim for video synthesis and achieve generalization to unseen domains via\na network weight generation scheme. We compare these techniques in the experiment section.\nVideo generative models can be divided into three main categories, including 1) unconditional video\nsynthesis models [54, 45, 51], which convert random noise samples to video clips, 2) future video\nprediction models [48, 24, 11, 34, 33, 63, 55, 56, 10, 53, 29, 27, 18, 28, 16, 40], which generate\nfuture video frames based on the observed ones, and 3) vid2vid models [57, 7, 12, 67], which\nconvert semantic input videos to photorealistic videos. Our work belongs to the last category, but\n\n2\n\nfew-shot vid2vidExample images of person 1Example images of person NOutput videosInput videosvid2vid for person 1vid2vid for person NOutput videosInput videos\fin contrast to the prior works, we aim for a vid2vid model that can synthesize videos of unseen\ndomains by leveraging few example images given at test time.\nAdaptive networks refer to networks where part of the weights are dynamically computed based\non the input data. This class of networks has a different inductive bias to regular networks and has\nfound use in several tasks including sequence modeling [15], image \ufb01ltering [23, 59, 49], frame\ninterpolation [38, 37], and neural architecture search [64]. Here, we apply it to the vid2vid task.\nHuman pose transfer synthesizes a human in an unseen pose by utilizing an image of the human in\na different pose. To achieve high quality generation results, existing human pose transfer methods\nlargely utilize human body priors such as body part modeling [1] or human surface-based coordinate\nmapping [36]. Our work differs from these works in that our method is more general. We do not use\nspeci\ufb01c human body priors other than the input semantic video. As a result, the same model can\nbe directly used for other vid2vid tasks such as street scene video synthesis, as shown in Figure 5.\nMoreover, our model is designed for video synthesis, while existing human pose transfer methods are\nmostly designed for still image synthesis and do not consider the temporal aspect of the problem. As\na result, our method renders more temporally consistent results (Figure 4).\n\n3 Few-shot Video-to-Video Synthesis\nVideo-to-video synthesis aims at learning a mapping function that can convert a sequence of input\n1 \u2261 \u02dcx1, \u02dcx2, ..., \u02dcxT , in a\nsemantic images1, sT\nway that the conditional distribution of \u02dcxT\n1 is similar to the conditional distribution of the\n1 . In other words, it aims to achieve\nground truth image sequence, xT\n1 )) \u2192 0, where D is a distribution divergence measure such as the Jensen-\nD(p(\u02dcxT\nShannon divergence or the Wasserstein distance. To model the conditional distribution, existing\nworks make a simpli\ufb01ed Markov assumption, leading to a sequential generative model given by\n\n1 \u2261 s1, s2, ..., sT , to a sequence of output images, \u02dcxT\n\n1 \u2261 x1, x2, ..., xT , given sT\n\n1 given sT\n\n1 ), p(xT\n\n1 |sT\n\n1 |sT\n\n\u02dcxt = F (\u02dcxt\u22121\n\nt\u2212\u03c4 , st\n\nt\u2212\u03c4 )\n\n(1)\n\nIn other words, it generates the output image, \u02dcxt, based on the observed \u03c4 + 1 input semantic images,\nt\u2212\u03c4 , and the past \u03c4 generated images, \u02dcxt\u22121\nt\u2212\u03c4 . The sequential generator F can be modeled in several\nst\ndifferent ways [7, 12, 57, 67]. A popular choice is to use an image matting function given by\n\nt\u2212\u03c4 , st\n\nF (\u02dcxt\u22121\n\nt\u2212\u03c4 ) = (1 \u2212 \u02dcmt) (cid:12) \u02dcwt\u22121(\u02dcxt\u22121) + \u02dcmt (cid:12) \u02dcht\n\n(2)\nwhere the symbol 1 is an image of all ones, (cid:12) is the element-wise product operator, \u02dcmt is a soft\nocclusion map, \u02dcwt\u22121 is the optical \ufb02ow from t \u2212 1 to t, and \u02dcht is a synthesized intermediate image.\nFigure 2(a) visualizes the vid2vid architecture and the matting function, which shows the output\nimage \u02dcxt is generated by combining the optical-\ufb02ow warped version of the last generated image,\n\u02dcwt\u22121(\u02dcxt\u22121), and the synthesized intermediate image, \u02dcht. The soft occlusion map, \u02dcmt, dictates\nhow these two images are combined at each pixel location. Intuitively, if a pixel is observed in the\npreviously generated frame, it would favor duplicating the pixel value from the warped image. In\npractice, these quantities are generated via neural network-parameterized functions M, W , and H:\n\n\u02dcmt = M\u03b8M (\u02dcxt\u22121\nt\u2212\u03c4 , st\n\u02dcwt\u22121 = W\u03b8W (\u02dcxt\u22121\n\u02dcht = H\u03b8H (\u02dcxt\u22121\nt\u2212\u03c4 , st\n\nt\u2212\u03c4 , st\nt\u2212\u03c4 )\n\nt\u2212\u03c4 ),\n\nt\u2212\u03c4 ),\n\n(3)\n(4)\n(5)\n\nwhere \u03b8M , \u03b8W , and \u03b8H are learnable parameters. They are kept \ufb01xed once the training is done.\nFew-shot vid2vid. While the sequential generator in (1) is trained for converting novel input semantic\nvideos, it is not trained for synthesizing videos of unseen domains. For example, a model trained\nfor a particular person can only be used to generate videos of the same person. In order to adapt\nF to unseen domains, we let F depend on extra inputs. Speci\ufb01cally, we let F take two more input\narguments: one is a set of K example images {e1, e2, ..., eK} of the target domain, and the other is\nthe set of their corresponding semantic images {se1, se2, ..., seK}. That is\n\n\u02dcxt = F (\u02dcxt\u22121\n\nt\u2212\u03c4 , st\n\nt\u2212\u03c4 ,{e1, e2, ..., eK},{se1, se2, ..., seK}).\n\n(6)\n\n1For example, a segmentation mask or an image denoting a human pose.\n\n3\n\n\fFigure 2: (a) Architecture of the vid2vid framework [57]. (b) Architecture of the proposed few-shot\nvid2vid framework. It consists of a network weight generation module E that maps example images\nto part of the network weights for video synthesis. The module E consists of three sub-networks: EF ,\nEP , and EA (used when K > 1). The sub-network EF extracts features q from the example images.\nWhen there are multiple example images (K > 1), EA combines the extracted features by estimating\nsoft attention maps \u03b1 and weighted averaging different extracted features. The \ufb01nal representation is\nthen fed into the network EP to generate the weights \u03b8H for the image synthesis network H.\n\n\u03b8H = E(\u02dcxt\u22121\n\nt\u2212\u03c4 , st\n\nThis modeling allows F to leverage the example images given at the test time to extract some useful\npatterns to synthesize videos of the unseen domain. We propose a network weight generation module\nE for extracting the patterns. Speci\ufb01cally, E is designed to extract patterns from the provided\nexample images and use them to compute network weights \u03b8H for the intermediate image synthesis\nnetwork H:\n\nt\u2212\u03c4 ,{e1, e2, ..., eK},{se1 , se2, ..., seK}).\n\n(7)\nNote that the network E does not generate the weights \u03b8M or \u03b8W because the \ufb02ow prediction network\nW and the soft occlusion map prediction network W are designed for warping the last generated\nimage, and warping is a mechanism that is naturally shared across domains.\nWe build our few-shot vid2vid framework based on Wang et al. [57], which is the state-of-the-art\nfor the vid2vid task. Speci\ufb01cally, we reuse their proposed \ufb02ow prediction network W and the soft\nocclusion map prediction network M. The intermediate image synthesis network H is a conditional\nimage generator. Instead of reusing the architecture proposed by Wang et al. [57], we adopt the\nSPADE generator [41], which is the current state-of-the-art semantic image synthesis model.\nThe SPADE generator contains several spatial modulation branches and a main image synthesis branch.\nOur network weight generation module E only generates the weights for the spatial modulation\nbranches. This has two main advantages. First, it greatly reduces the number of parameters that E\nhas to generate, which avoids the over\ufb01tting problem. Second, it avoids creating a shortcut from\nthe example images to the output image, since the generated weights are only used in the spatial\nmodulation modules, which generates the modulation values for the main image synthesis branch. In\nthe following, we discuss details of the design of the network E and the learning objective.\nNetwork weight generation module. As discussed above, the goal of the network weight generation\nmodule E is to learn to extract appearance patterns that can be injected into the video synthesis branch\nby controlling its weights. We \ufb01rst consider the case where only one example image is available\n(K = 1). We then extend the discussion to handle the case of multiple example images.\nWe decompose E into two sub-networks: an example feature extractor EF , and a multi-layer\nperceptron EP . The network EF consists of several convolutional layers and is applied on the\nexample image e1 to extract an appearance representation q. The representation q is then fed into\nEP to generate the weights \u03b8H in the intermediate image synthesis network H.\nLet the image synthesis network H has L layers H l, where l \u2208 [1, L]. We design the weight\ngeneration network E to also have L layers, each El generates the weights for the corresponding H l.\nSpeci\ufb01cally, to generate the weights \u03b8l\nH for layer H l, we \ufb01rst take the output ql from l-th layer in\n\n4\n\nweight generation module (E)SPADEResBlkSPADEResBlkSPADEResBlkconvconvmodified SPADE generator (H)\ud835\udf3d\ud835\udc6f\ud835\udfd0\ud835\udf3d\ud835\udc6f\ud835\udfd1\ud835\udf3d\ud835\udc6f\ud835\udfcfFC(c) Intermediate image synthesis network (for \ud835\udc3e=1)(a) vid2vid\ud835\udc3b\ud835\udf3d\ud835\udc3b\ud835\udc40\ud835\udc4a\ud835\udc94\ud835\udc61\u2212\ud835\udc47\ud835\udc61\u0de5\ud835\udc99\ud835\udc61\u2212\ud835\udc47\ud835\udc61\u22121\u0de5\ud835\udc8e\ud835\udc61\u0de9\ud835\udc89\ud835\udc61\u0de5\ud835\udc98\ud835\udc61\u0de5\ud835\udc99\ud835\udc61\u2212\ud835\udfcf\u0de5\ud835\udc98\ud835\udc61(\u0de5\ud835\udc99\ud835\udc61\u2212\ud835\udfcf)WarpMatting\u0de5\ud835\udc99\ud835\udc61\ud835\udf3d\ud835\udc40\ud835\udf3d\ud835\udc4afixed(b) Few-shot vid2vid\ud835\udc38\ud835\udc39\ud835\udc38\ud835\udc39\ud835\udc38\ud835\udc34\ud835\udc38\ud835\udc34\ud835\udc861\ud835\udc86\ud835\udc3e\ud835\udf361\ud835\udf36\ud835\udc3e\ud835\udc38\ud835\udc34\ud835\udc821\ud835\udc82\ud835\udc3e\ud835\udc82\ud835\udc61\ud835\udc82\ud835\udc61\ud835\udc94\ud835\udc861\ud835\udc94\ud835\udc86\ud835\udc72\ud835\udc921\ud835\udc92\ud835\udc3e\ud835\udc94t\u0dcd\ud835\udc58=1\ud835\udc3e\ud835\udc92\ud835\udc58\u2297\ud835\udf36\ud835\udc58\ud835\udc38\ud835\udc43\ud835\udc3b\ud835\udf3d\ud835\udc3bdynamic\ud835\udc94t\u0de9\ud835\udc89\ud835\udc61\ud835\udc861\ud835\udc38\ud835\udc39\ud835\udc38\ud835\udc43FCFC\ud835\udc2a\ud835\udfcf\ud835\udc2a\ud835\udfd0\ud835\udc2a\ud835\udfd1\ud835\udc91\ud835\udc7a\ud835\udfcf\ud835\udc91\ud835\udc7a\ud835\udfd0\ud835\udc91\ud835\udc7a\ud835\udfd1AvgPoolAvgPoolAvgPoolconvconvconvconvconvconvconv\f(cid:1), otherwise\n\nif l = 0\n\n(cid:126) \u03b8l\n\nS\n\n\u03b3l = pl\nS\n\n\u03b3, \u03b2l = pl\nS\n\n(cid:126) \u03b8l\n\n\u03b2\n\n(cid:126) \u03b8l\nH = \u03b3l (cid:12) \u02c6pl\npl\n\nH + \u03b2l\n\n(8)\n\n(9)\n\n(10)\n\nH = El\n\nF (ql\u22121), and \u03b8l\n\nP to generate the weights \u03b8l\n\nEF . Then, we average pool ql (since ql might be still a feature map with spatial dimensions.) and\nH. Mathematically, if we de\ufb01ne q0 \u2261 e1,\napply a multi-layer perceptron El\nthen ql = El\nP (ql). These generated weights are then used to convolve the\ncurrent input semantic map st to generate the normalization parameters used in SPADE (Figure 2(c)).\nFor each layer in the main SPADE generator, we use \u03b8l\nH to compute the denormalization parameters\n\u03b3l and \u03b2l to denormalize the input features. We note that, in the original SPADE module, the scale\nmap \u03b3l and bias map \u03b2l are generated by \ufb01xed weights operated on the input semantic map st. In\nH contains three sets\nour setting, these maps are generated by dynamic weights, \u03b8l\nS, \u03b8l\nof weights: \u03b8l\n\u03b3 and \u03b8l\n\u03b3 and \u03b8l\n\u03b2\ntake the output of \u03b8l\nS to generate \u03b3l and \u03b2l maps, respectively. For each BatchNorm layer in Gl, we\ncompute the denormalized features pl\n\nS acts as a shared layer to extract common features, and \u03b8l\n\nH from the normalized features \u02c6pl\n\nH. Moreover, \u03b8l\n\n\u03b2. \u03b8l\n\nH by\n\n(cid:26)st,\n\u03c3(cid:0)pl\u22121\n\nS\n\npl\n\nS =\n\nwhere (cid:126) stands for convolution, and \u03c3 is the nonlinearity function.\nAttention-based aggregation (K > 1). In addition, we want E to be capable of extracting the\npatterns from an arbitrary number of example images. As different example images may carry\ndifferent appearance patterns, and they have different degrees of relevance to different input images,\nwe design an attention mechanism [61, 52] to aggregate the extracted appearance patterns q1,...,qK.\nTo this end, we construct a new attention network EA, which consists of several fully convolutional\nlayers. EA is applied to each of the semantic images of the example images, sek. This results in\na key vector ak \u2208 RC\u00d7N , where C is the number of channels and N = H \u00d7 W is the spatial\ndimension of the feature map. We also apply EA to the current input semantic image st to extract its\nkey vector at \u2208 RC\u00d7N . We then compute the attention weight \u03b1k \u2208 RN\u00d7N by taking the matrix\nproduct \u03b1k = (ak)T \u2297 at. The attention weights are then used to compute a weighted average of the\nk=1 qk \u2297 \u03b1k, which is then fed into the multi-layer perceptron EP\nto generate the network weights (Figure 2(b)). This aggregation mechanism is helpful when different\nexample images contain different parts of the subject. For example, when example images include\nboth front and back of the target person, the attention maps can help capture corresponding body\nparts during synthesis (Figure 7(c)).\nWarping example images. To ease the burden of the image synthesis network, we can also (op-\ntionally) warp the given example image and combine it with the intermediate synthesized output \u02dcht.\nSpeci\ufb01cally, we make the model estimate an additional \ufb02ow \u02dcwet and mask \u02dcmet, which are used to\nwarp the example image e1 to the current input semantics, similar to how we warp and combine with\nprevious frames. The new intermediate image then becomes\n\nappearance representation q =(cid:80)K\n\nt = (1 \u2212 \u02dcmet) (cid:12) \u02dcwet(e1) + \u02dcmet (cid:12) \u02dcht\n\u02dch(cid:48)\n\n(11)\nIn the case of multiple example images, we pick e1 to be the image that has the largest similarity\nscore to the current frame by looking at the attention weights \u03b1. In practice, we found this helpful\nwhen example and target images are similar in most regions, such as synthesizing poses where the\nbackground remains static.\nTraining. We use the same learning objective as in the vid2vid framework [57]. But instead of\ntraining the vid2vid model using data from one domain, we use data from multiple domains. In\nFigure 7(a), we show the performance of our few-shot vid2vid model is positively correlated\nwith the number of domains included in the training dataset. This shows that our model can gain from\nincreased visual experiences. Our framework is trained in the supervised setting where paired sT\n1 ,\nand xT\n1 by using K example images randomly\nsampled from x. We adopt a progressive training technique, which gradually increases the length of\ntraining sequences. Initially, we set T = 1, which means the network only generates single frames.\nAfter that, we double the sequence length (T ) for every few epochs.\n\n1 are available. We train our model to convert sT\n\n1 to xT\n\n5\n\n\fInference. At test time, our model can take an arbitrary number of example images. In Figure 7(b),\nwe show that our performance is positively correlated with the number of example images. Moreover,\nwe can also (optionally) \ufb01netune the network using the given example images to improve performance.\nNote that we only \ufb01netune the weight generation module E and the intermediate image synthesis\nnetwork H, and leave all parameters related to \ufb02ow estimation (\u03b8M , \u03b8H) \ufb01xed. We found this can\nbetter preserve the person identity in the example image.\n\n4 Experiments\nImplementation details. Our training procedure follows the procedure from the vid2vid work [57].\nWe use the ADAM optimizer [26] with lr = 0.0004 and (\u03b21, \u03b22) = (0.5, 0.999). Training was\nconducted using an NVIDIA DGX-1 machine with 8 32GB V100 GPUs.\nDatasets. We adopt three video datasets to validate our method.\n\u2022 YouTube dancing videos. It consists of 1, 500 dancing videos from YouTube. We divide them\ninto a training set and a test set with no overlapping subjects. Each video is further divided into\nshort clips of continuous motions. This yields about 15, 000 clips for training. At each iteration,\nwe randomly pick a clip and select one or more frames in the same clip as the example images.\nAt test time, both the example images and the input human poses are not seen during training.\n\u2022 Street-scene videos. We use street-scene videos from three different geographical areas: 1)\nGermany, from the Cityscapes dataset [9], 2) Boston, collected using a dashcam, and 3) NYC,\ncollected by a different dashcam. We apply a pretrained segmentation network [60] to get the\nsegmentation maps. Again, during training, we randomly select one frame of the same area as the\nexample image. At test time, in addition to the test set images from these three areas, we also test\non the ApolloScape [20] and CamVid [5] datasets, which are not included in the training set.\n\u2022 Face videos. We use the real videos in the FaceForensics dataset [44], which contains 854 videos\nof news brie\ufb01ng from different reporters. We split the dataset into 704 videos for training and 150\nvideos for validation. We extract sketches from the input videos similar to vid2vid, and select\none frame of the same video as the example image to convert sketches to face videos.\n\nBaselines. Since no existing vid2vid method can adapt to unseen domains using few example im-\nages, we construct 3 strong baselines that consider different ways of achieving the target generalization\ncapability. For the following comparisons and \ufb01gures, all methods use 1 example image.\n\u2022 Encoder. In this baseline approach, we encode the example images into a style vector and then\n\ndecode the features using the image synthesis branch in our H to generate \u02dcht.\n\n\u2022 ConcatStyle. In this baseline approach, we also encode the example images into a style vector.\nHowever, instead of directly decoding the style vector using the image synthesis branch in our\nH, it concatenates the vector with each of the input semantic images to produce an augmented\nsemantic input image. This image is then used as input to the spatial modulation branches in our\nH for generating the intermediate image \u02dcht.\n\n\u2022 AdaIN. In this baseline, we insert an AdaIN normalization layer after each spatial modulation\nlayer in the image synthesis branch of H. We generate the AdaIN normalization parameters by\nfeeding the example images to an encoder, similar to the FUNIT method [31].\n\nIn addition to these baselines, for the human synthesis task, we also compare our approach with the\nfollowing methods using the pretrained models provided by the authors.\n\u2022 PoseWarp [1] synthesizes humans in unseen poses using an example image. The idea is to\nassume each limb undergoes a similarity transformation. The \ufb01nal output image is obtained by\ncombining all transformed limbs together.\n\u2022 MonkeyNet [47] is proposed for transferring motions from a sequence to a still image. It \ufb01rst\n\ndetects keypoints in the images, and then predicts their \ufb02ows for warping the still image.\n\nEvaluation metrics. We use the following metrics for quantitative evaluation.\n\u2022 Fr\u00e9chet Inception Distance (FID) [17] measures the distance between the distributions of real\n\u2022 Pose error. We estimate the poses of the synthesized subjects using OpenPose [6]. This renders a\nset of joint locations for each video frame. We then compute the absolute error in pixels between\n\ndata and generated data. It is commonly used to quantify the \ufb01delity of synthesized images.\n\n6\n\n\fYouTube Dancing videos\n\nPose Error\n\nHuman Pref.\n\nStreet Scene videos\n\nHuman Pref.\n\nMethod\nEncoder\nConcatStyle\nAdaIN\nPoseWarp [1]\nMonkeyNet [47]\nOurs\n\n13.30\n13.32\n12.66\n16.84\n13.73\n6.01\n\nFID\n234.71\n140.87\n207.18\n180.31\n260.77\n80.44\n\nTable 1: Our method outperforms existing pose transfer methods and our baselines for both dancing\nand street scene video synthesis tasks. For pose error and FID, lower is better. For pixel accuracy and\nmIoU, higher is better. The human preference score indicates the fraction of subjects favoring results\nsynthesized by our method.\n\n0.96\n0.95\n0.93\n0.83\n0.93\n\u2014\n\nPixel Acc mIoU\n0.222\n0.240\n0.360\nN/A\nN/A\n0.408\n\n0.400\n0.479\n0.756\nN/A\nN/A\n0.831\n\nFID\n187.10\n154.33\n205.54\nN/A\nN/A\n144.24\n\n0.97\n0.97\n0.87\nN/A\nN/A\n\u2014\n\nFigure 3: Visualization of human video synthesis results. Given the same pose video but different\nexample images, our method synthesizes realistic videos of the subjects, who are not seen during\ntraining. The \ufb01gure is best viewed with Acrobat Reader. Click the image to play the video clip.\n\nthe estimated pose and the original pose input to the model. The idea behind this metric is that\nif the image is well-synthesized, a well-trained human pose estimation network should be able\nto recover the original pose used to synthesize the image. We note similar ideas were used in\nevaluating image synthesis performance in several prior works [22, 58, 57].\n\n\u2022 Segmentation accuracy. To evaluate the performance of street scene videos, we run a state-of-\nthe-art street scene segmentation network on the result videos generated by all the competing\nmethods. We then report the pixel accuracy and mean intersection-over-union (IoU) ratio. The\nidea of using segmentation accuracy as a performance metric follows the discussion of using the\npose error as discussed above.\n\n\u2022 Human subjective score. Finally, we use Amazon Mechanical Turk (AMT) to evaluate the\nquality of generated videos. We perform AB tests where we provide the user videos from\n\n7\n\n\fFigure 4: Comparisons against different baselines for human motion synthesis. Note that the\ncompeting methods either have many visible artifacts or completely fail to transfer the motion. The\n\ufb01gure is best viewed with Acrobat Reader. Click the image to play the video clip.\n\nFigure 5: Visualization of street scene video synthesis results. Our approach is able to synthesize\nvideos that realistically re\ufb02ect the style in the example images even if the style is not included in the\ntraining set. The \ufb01gure is best viewed with Acrobat Reader. Click the image to play the video clip.\n\ntwo different approaches and ask them to choose the one with better quality. For each pair of\ncomparisons, we generate 100 clips, each of them viewed by 60 workers. Orders are randomized.\nMain results. In Figure 3, we show results of using different example images when synthesizing\nhumans. It can be seen that our method can successfully transfer motion to all the example images.\nFigure 4 shows comparisons of our approaches against other methods. It can be seen that other\nmethods either generate obvious artifacts or fail to transfer the motion faithfully.\nFigure 5 shows the results of synthesizing street scene videos with different example images. It can\nbe seen that even with the same input segmentation map, our method can achieve different visual\nresults using different example images.\nTable 1 shows quantitative comparisons of both tasks against the other methods. It can be seen that\nour method consistently achieves better results than the others on all the performance metrics.\nIn Figure 6, we show results of using different example images when synthesizing faces. Our method\ncan faithfully preserve the person identity while capturing the motion in the input videos.\nFinally, to verify our hypothesis that a larger training dataset helps improve the quality of synthesized\nvideos, we conduct an experiment where part of the dataset is held out during training. We vary the\nnumber of videos in the training set and plot the resulting performance in Figure 7(a). We \ufb01nd that\nthe results support our hypothesis. We also evaluate whether having access to more example images\n\n8\n\n\fFigure 6: Visualization of face video synthesis results. Given the same input video but different\nexample images, our method synthesizes realistic videos of the subjects, who are not seen during\ntraining. The \ufb01gure is best viewed with Acrobat Reader. Click the image to play the video clip.\n\nFigure 7: (a) The plot shows the quality of our synthesized videos improves when it is trained with\na larger dataset. Large variety helps learn a more generalizable network weight generation module\nand hence improves adaptation capability. (b) The plot shows the quality of our synthesized videos\nis correlated with the number of example images provided at test time. The proposed attention\nmechanism can take advantage of a larger example set to better generate the network weights. (c)\nVisualization of attention maps when multiple example images are given. Note that when synthesizing\nthe front of the target, the attention map indicates that the network utilizes more of the front example\nimage, and vice versa.\n\nat test time helps with the video synthesis performance. As shown in Figure 7(b), the result con\ufb01rms\nour assumption.\nLimitations. Although our network can, in principal, generalize to unseen domains, when the test\ndomain is too different from the training domains it will not perform well. For example, when\ntesting on CG characters which look very different from real-world people, the network will struggle.\nIn addition, since our network is based on semantic estimations as input such as pose maps or\nsegmentation maps, when these estimations fail our network will also likely fail.\n\n5 Conclusion\n\nWe presented a few-shot video-to-video synthesis framework that can synthesize videos of unseen\nsubjects or street scene styles at the test time. This was enabled by our novel adaptive network\nweight generation scheme, which dynamically determines the weights based on the example images.\nExperimental results showed that our method performs favorably against the competing methods.\n\n9\n\n(c) Example Attn mapsfrontbackInput framesAttention mapsfrontfrontbackbackExample images(b) Effect of number ofexamples at inference(a) Effect of trainingdatasetsize\fReferences\n\n[1] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag. Synthesizing images of humans in unseen\n\nposes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[2] S. Benaim and L. Wolf. One-shot unsupervised cross domain translation. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2018.\n\n[3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain\nadaptation with generative adversarial networks. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017.\n\n[4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high \ufb01delity natural image synthesis.\n\nIn International Conference on Learning Representations (ICLR), 2019.\n\n[5] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from\n\nmotion point clouds. In European Conference on Computer Vision (ECCV), 2008.\n\n[6] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. OpenPose: realtime multi-person 2D pose\n\nestimation using Part Af\ufb01nity Fields. In arXiv preprint arXiv:1812.08008, 2018.\n\n[7] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. arXiv preprint arXiv:1808.07371,\n\n2018.\n\n[8] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Uni\ufb01ed generative adversarial\nnetworks for multi-domain image-to-image translation. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2018.\n\n[9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and\nB. Schiele. The Cityscapes dataset for semantic urban scene understanding. In IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), 2016.\n\n[10] E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2017.\n\n[11] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video\n\nprediction. In Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[12] O. Gafni, L. Wolf, and Y. Taigman. Vid2game: Controllable characters extracted from real-world videos.\n\narXiv preprint arXiv:1904.08379, 2019.\n\n[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial networks. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein\n\nGANs. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[15] D. Ha, A. Dai, and Q. V. Le. Hypernetworks. In International Conference on Learning Representations\n\n(ICLR), 2016.\n\n[16] Z. Hao, X. Huang, and S. Belongie. Controllable video generation with sparse trajectories. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale\nupdate rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems\n(NIPS), 2017.\n\n[18] Q. Hu, A. Waelchli, T. Portenier, M. Zwicker, and P. Favaro. Video synthesis from a single image and\n\nmotion stroke. arXiv preprint arXiv:1812.01874, 2018.\n\n[19] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In\n\nIEEE International Conference on Computer Vision (ICCV), 2017.\n\n[20] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang. The ApolloScape dataset\nfor autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n[21] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation.\n\nEuropean Conference on Computer Vision (ECCV), 2018.\n\n[22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial\n\nnetworks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[23] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic \ufb01lter networks. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2016.\n\n[24] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu.\n\nVideo pixel networks. arXiv preprint arXiv:1610.00527, 2016.\n\n[25] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[26] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on\n\nLearning Representations (ICLR), 2015.\n\narXiv preprint arXiv:1804.01523, 2018.\n\n[27] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction.\n\n[28] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Flow-grounded spatial-temporal video prediction\nfrom still images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 600\u2013615,\n2018.\n\n[29] X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual motion GAN for future-\ufb02ow embedded video prediction.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[30] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in\n\nNeural Information Processing Systems (NIPS), 2017.\n\n10\n\n\f[31] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz. Few-shot unsupervised\n\nimage-to-image translation. arXiv preprint arXiv:1905.01723, 2019.\n\n[32] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2016.\n\n[33] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised\n\nlearning. In International Conference on Learning Representations (ICLR), 2017.\n\n[34] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\n[35] T. Miyato and M. Koyama. cGANs with projection discriminator. In International Conference on Learning\n\nRepresentations (ICLR), 2018.\n\nVision (ECCV), 2018.\n\n[36] N. Neverova, R. Alp Guler, and I. Kokkinos. Dense pose transfer. In European Conference on Computer\n\n[37] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), 2017.\n\n[38] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In IEEE\n\nInternational Conference on Computer Vision (ICCV), 2017.\n\n[39] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classi\ufb01er GANs.\n\nIn\n\nInternational Conference on Machine Learning (ICML), 2017.\n\n[40] J. Pan, C. Wang, X. Jia, J. Shao, L. Sheng, J. Yan, and X. Wang. Video generation from single semantic\n\nlabel map. arXiv preprint arXiv:1903.04480, 2019.\n\n[41] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image synthesis with spatially-adaptive normal-\n\nization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[42] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\ngenerative adversarial networks. In International Conference on Learning Representations (ICLR), 2015.\n[43] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image\n\nsynthesis. In International Conference on Machine Learning (ICML), 2016.\n\n[44] A. R\u00f6ssler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nie\u00dfner. Faceforensics: A large-scale\n\nvideo dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179, 2018.\n\n[45] M. Saito, E. Matsumoto, and S. Saito. Temporal generative adversarial nets with singular value clipping.\n\nIn IEEE International Conference on Computer Vision (ICCV), 2017.\n\n[46] A. Shrivastava, T. P\ufb01ster, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and\nunsupervised images through adversarial training. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017.\n\n[47] A. Siarohin, S. Lathuili\u00e8re, S. Tulyakov, E. Ricci, and N. Sebe. Animating arbitrary objects via deep\n\nmotion transfer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[48] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using\n\nlstms. In International Conference on Machine Learning (ICML), 2015.\n\n[49] H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz. Pixel-adaptive convolutional neural\n\nnetworks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[50] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n[51] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. MoCoGAN: Decomposing motion and content for video\n\ngeneration. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and I. Polosukhin.\n\nAttention is all you need. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[53] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video\n\nsequence prediction. In International Conference on Learning Representations (ICLR), 2017.\n\n[54] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances in\n\nNeural Information Processing Systems (NIPS), 2016.\n\n[55] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using\n\nvariational autoencoders. In European Conference on Computer Vision (ECCV), 2016.\n\n[56] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose\n\nfutures. In IEEE International Conference on Computer Vision (ICCV), 2017.\n\n[57] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2018.\n\n[58] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis\nand semantic manipulation with conditional GANs. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2018.\n\n[59] J. Wu, D. Li, Y. Yang, C. Bajaj, and X. Ji. Dynamic sampling convolutional neural networks. In European\n\nConference on Computer Vision (ECCV), 2018.\n\n[60] T. Wu, S. Tang, R. Zhang, and Y. Zhang. Cgnet: A light-weight context guided network for semantic\n\nsegmentation. arXiv preprint arXiv:1811.08201, 2018.\n\n[61] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend\nand tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.\n[62] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image\ngeneration with attentional generative adversarial networks. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2018.\n\n11\n\n\fConference on Learning Representations (ICLR), 2019.\n\n[65] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. In\n\nInternational Conference on Machine Learning (ICML), 2019.\n\n[66] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. StackGAN: Text to photo-realistic\nimage synthesis with stacked generative adversarial networks. In IEEE International Conference on\nComputer Vision (ICCV), 2017.\n\n[67] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Dance dance generation: Motion transfer for internet\n\n[63] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via\n\ncross convolutional networks. In Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[64] C. Zhang, M. Ren, and R. Urtasun. Graph hypernetworks for neural architecture search. In International\n\nvideos. arXiv preprint arXiv:1904.00129, 2019.\n\n[68] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\n\nadversarial networks. In IEEE International Conference on Computer Vision (ICCV), 2017.\n\n[69] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal\n\nimage-to-image translation. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n12\n\n\f", "award": [], "sourceid": 2765, "authors": [{"given_name": "Ting-Chun", "family_name": "Wang", "institution": "NVIDIA"}, {"given_name": "Ming-Yu", "family_name": "Liu", "institution": "Nvidia Research"}, {"given_name": "Andrew", "family_name": "Tao", "institution": "Nvidia Corporation"}, {"given_name": "Guilin", "family_name": "Liu", "institution": "NVIDIA"}, {"given_name": "Bryan", "family_name": "Catanzaro", "institution": "NVIDIA"}, {"given_name": "Jan", "family_name": "Kautz", "institution": "NVIDIA"}]}