{"title": "Temporal Coherency based Criteria for Predicting Video Frames using Deep Multi-stage Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4268, "page_last": 4277, "abstract": "Predicting the future from a sequence of video frames has been recently a sought after yet challenging task in the field of computer vision and machine learning. Although there have been efforts for tracking using motion trajectories and flow features, the complex problem of generating unseen frames has not been studied extensively. In this paper, we deal with this problem using convolutional models within a multi-stage Generative Adversarial Networks (GAN) framework. The proposed method uses two stages of GANs to generate a crisp and clear set of future frames. Although GANs have been used in the past for predicting the future, none of the works consider the relation between subsequent frames in the temporal dimension. Our main contribution lies in formulating two objective functions based on the Normalized Cross Correlation (NCC) and the Pairwise Contrastive Divergence (PCD) for solving this problem. This method, coupled with the traditional L1 loss, has been experimented with three real-world video datasets, viz. Sports-1M, UCF-101 and the KITTI. Performance analysis reveals superior results over the recent state-of-the-art methods.", "full_text": "Temporal Coherency based Criteria for Predicting\nVideo Frames using Deep Multi-stage Generative\n\nAdversarial Networks\n\nPrateep Bhattacharjee1, Sukhendu Das2\nVisualization and Perception Laboratory\n\nDepartment of Computer Science and Engineering\n\nIndian Institute of Technology Madras, Chennai, India\n1prateepb@cse.iitm.ac.in, 2sdas@iitm.ac.in\n\nAbstract\n\nPredicting the future from a sequence of video frames has been recently a sought\nafter yet challenging task in the \ufb01eld of computer vision and machine learning.\nAlthough there have been efforts for tracking using motion trajectories and \ufb02ow\nfeatures, the complex problem of generating unseen frames has not been studied\nextensively. In this paper, we deal with this problem using convolutional models\nwithin a multi-stage Generative Adversarial Networks (GAN) framework. The\nproposed method uses two stages of GANs to generate crisp and clear set of\nfuture frames. Although GANs have been used in the past for predicting the\nfuture, none of the works consider the relation between subsequent frames in\nthe temporal dimension. Our main contribution lies in formulating two objective\nfunctions based on the Normalized Cross Correlation (NCC) and the Pairwise\nContrastive Divergence (PCD) for solving this problem. This method, coupled\nwith the traditional L1 loss, has been experimented with three real-world video\ndatasets viz. Sports-1M, UCF-101 and the KITTI. Performance analysis reveals\nsuperior results over the recent state-of-the-art methods.\n\n1\n\nIntroduction\n\nVideo frame prediction has recently been a popular problem in computer vision as it caters to a wide\nrange of applications including self-driving cars, surveillance, robotics and in-painting. However, the\nchallenge lies in the fact that, real-world scenes tend to be complex, and predicting the future events\nrequires modeling of complicated internal representations of the ongoing events. Past approaches\non video frame prediction include the use of recurrent neural architectures [19], Long Short Term\nMemory [8] networks [22] and action conditional deep networks [17]. Recently, the work of [14]\nmodeled the frame prediction problem in the framework of Generative Adversarial Networks (GAN).\nGenerative models, as introduced by Goodfellow et. al., [5] try to generate images from random noise\nby simultaneously training a generator (G) and a discriminator network (D) in a process similar to a\nzero-sum game. Mathieu et. al. [14] shows the effectiveness of this adversarial training in the domain\nof frame prediction using a combination of two objective functions (along with the basic adversarial\nloss) employed on a multi-scale generator network. This idea stems from the fact that the original\nL2-loss tends to produce blurry frames. This was overcome by the use of Gradient Difference Loss\n(GDL) [14], which showed signi\ufb01cant improvement over the past approaches when compared using\nsimilarity and sharpness measures. However, this approach, although producing satisfying results for\nthe \ufb01rst few predicted frames, tends to generate blurry results for predictions far away (\u223c6) in the\nfuture.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: The proposed multi-stage GAN framework. The stage-1 generator network produces a\nlow-resolution version of predicted frames which are then fed to the stage-2 generator. Discriminators\nat both the stages predict 0 or 1 for each predicted frame to denote its origin: synthetic or original.\n\nIn this paper, we aim to get over this hurdle of blurry predictions by considering an additional\nconstraint between consecutive frames in the temporal dimension. We propose two objective functions:\n(a) Normalized Cross-Correlation Loss (NCCL) and (b) Pairwise Contrastive Divergence Loss\n(PCDL) for effectively capturing the inter-frame relationships in the GAN framework. NCCL\nmaximizes the cross-correlation between neighborhood patches from consecutive frames, whereas,\nPCDL applies a penalty when subsequent generated frames are predicted wrongly by the discriminator\nnetwork (D), thereby separating them far apart in the feature space. Performance analysis over three\nreal world video datasets shows the effectiveness of the proposed loss functions in predicting future\nframes of a video.\nThe rest of the paper is organized as follows: section 2 describes the multi-stage generative adversarial\narchitecture; sections 3 - 6 introduce the different loss functions employed: the adversarial loss (AL)\nand most importantly NCCL and PCDL. We show the results of our experiments on Sports-1M [10],\nUCF-101 [21] and KITTI [4] and compare them with state-of-the-art techniques in section 7. Finally,\nwe conclude our paper highlighting the key points and future direction of research in section 8.\n\n2 Multi-stage Generative Adversarial Model\n\nGenerative Adversarial Networks (GAN) [5] are composed of two networks: (a) the Generator (G)\nand (b) the Discriminator (D). The generator G tries to generate realistic images by learning to\nmodel the true data distribution pdata and thereby trying to make the task of differentiating between\noriginal and generated images by the discriminator dif\ufb01cult. The discriminator D, in the other hand,\nis optimized to distinguish between the synthetic and the real images. In essence, this procedure of\nalternate learning is similar to the process of two player min-max games [5]. Overall, the GANs\nminimize the following objective function:\n\nmin\n\nG\n\nmax\n\nD\n\nv(D, G) = Ex\u223cpdata [log(D(x))] + Ez\u223cpz [log(1 \u2212 D(G(z)))]\n\n(1)\n\nwhere, x is a real image from the true distribution pdata and z is a vector sampled from the distribution\npz, usually to be uniform or Gaussian. The adversarial loss employed in this paper is a variant of that\nin equation 1, as the input to our network is a sequence of frames of a video, instead of a vector z.\nAs convolutions account only for short-range relationships, pooling layers are used to garner in-\nformation from wider range. But, this process generates low resolution images. To overcome this,\nMathieu et. al. [14] uses a multi-scale generator network, equivalent to the reconstruction process\nof a Laplacian pyramid [18], coupled with discriminator networks to produce high-quality output\nframes of size 32 \u00d7 32. There are two shortcomings of this approach:\n\n2\n\n\fa. Generating image output at higher dimensions viz. (128 \u00d7 128) or (256 \u00d7 256), requires\nmultiple use of upsampling operations applied on the output of the generators. In our\nproposed model, this upsampling is handled by the generator networks itself implicitly\nthrough the use of consecutive unpooling operations, thereby generating predicted frames at\nmuch higher resolution in lesser number of scales.\n\nb. As the generator network parameters are not learned with respect to any objective function\nwhich captures the temporal relationship effectively, the output becomes blurry after \u223c 4\nframes.\n\nTo overcome the \ufb01rst issue, we propose a multi-stage (2-stage) generative adversarial network\n(MS-GAN).\n\n2.1 Stage-1\n\nGenerating the output frame(s) directly often produces blurry outcomes. Instead, we simplify the\nprocess by \ufb01rst generating crude, low-resolution version of the frame(s) to be predicted. The stage-1\ngenerator (G1) consists of a series of convolutional layers coupled with unpooling layers [25] which\nupsample the frames. We used ReLU non-linearity in all but the last layer, in which case, hyperbolic\ntangent (tanh) was used following the scheme of [18]. The inputs to G1 are m number of consecutive\nframes of dimension W0 \u00d7 H0, whereas the outputs are n predicted frames of size W1 \u00d7 H1, where,\nW1 = W0 \u00d7 2 and H1 = H0 \u00d7 2. These outputs, stacked with the upsampled version of the original\ninput frames, produce the input of dimension (m + n) \u00d7 W1 \u00d7 H1 for the stage-1 discriminator (D1).\nD1 applies a chain of convolutional layers followed by multiple fully-connected layers to \ufb01nally\nproduce an output vector of dimension (m + n), consisting of 0\u2019s and 1\u2019s.\nOne of the key differences of our proposed GAN framework with the conventional one [5]is that, the\ndiscriminator network produces decision output for multiple frames, instead of a single 0/1 outcome.\nThis is exploited by one of the proposed objective functions, the PCDL, which is described later in\nsection 4.\n\n2.2 Stage-2\n\nThe second stage network closely resembles the stage-1 architecture, differing only in the input and\noutput dimensions. The input to the stage-2 generator (G2) is formed by stacking the predicted\nframes and the upsampled inputs of G1, thereby having dimension of (m + n) \u00d7 W1 \u00d7 H1. The\noutput of G2 are n predicted high-resolution frames of size W2 \u00d7 H2, where, W2 = W1 \u00d7 4 and\nH2 = H1 \u00d7 4. The stage-2 discriminator (D2), works in a similar fashion as D1, producing an output\nvector of length (m + n).\nEffectively, the multi-stage model can be represented by the following recursive equations:\n\n(cid:40)\n\n\u02c6Yk =\n\nGk( \u02c6Yk\u22121, Xk\u22121),\nGk(Xk\u22121)\n\nf or k \u2265 2\nf or k = 1\n\n(2)\n\nwhere, \u02c6Yk is the set of predicted frames and Xk are the input frames at the kth stage of the generator\nnetwork Gk.\n\n2.3 Training the multi-stage GAN\n\nThe training procedure of the multi-stage GAN model follows that of the original generative adversar-\nial networks with minor variations. The training of the discriminator and the generator are described\nas follows:\n\nTraining of the discriminator Considering the input to the discriminator (D) as X (series of\nm frames) and the target output to be Y (series of n frames), D is trained to distinguish between\nsynthetic and original inputs by classifying (X, Y ) into class 1 and (X, G(X)) into class 0. Hence,\nfor each of the k stages, we train D with target (cid:126)1 (Vector of 1\u2019s with dimension m) for (X, Y ) and\n\n3\n\n\ftarget (cid:126)0 (Vector of 0\u2019s with dimension n) for (X, G(X)). The loss function for training D is:\n\nLD\nadv =\n\nLbce(Dk(Xk, Yk),(cid:126)1) + Lbce(Dk(Xk, Gk(Xk)),(cid:126)0)\n\n(3)\n\nwhere, Lbce, the binary cross-entropy loss is de\ufb01ned as:\n\nLbce(A, A(cid:48)) = \u2212\n\n(cid:48)ilog(Ai) + (1 \u2212 A\n\n(cid:48)i)log(1 \u2212 Ai), Ai \u2208 {0, 1}, A\n\n(cid:48)i \u2208 [0, 1]\n\nA\n\n(4)\n\nwhere, A and A(cid:48) are the target and discriminator outputs respectively.\n\ni=1\n\nTraining of the generator We perform an optimization step on the generator network (G), keeping\nthe weights of D \ufb01xed, by feeding a set of consecutive frames X sampled from the training data with\ntarget Y (set of ground-truth output frames) and minimize the following adversarial loss:\n\nNstages(cid:88)\n\nk=1\n\n|A|(cid:88)\n\nNstages(cid:88)\n\nLG\nadv(X) =\n\nLbce(Dk(Xk, Gk(Xk)),(cid:126)1)\n\n(5)\n\nk=1\n\nBy minimizing the above two loss criteria (eqns. 3, 5), G makes the discriminator believe that,\nthe source of the generated frames is the input data space itself. Although this process of alternate\noptimization of D and G is reasonably well designed formulation, in practical purposes, this produces\nan unstable system where G generates samples that consecutively move far away from the original\ninput space and in consequence D distinguishes them easily. To overcome this instability inherent in\nthe GAN principle and the issue of producing blurry frames de\ufb01ned in section 2, we formulate a pair\nof objective criteria: (a) Normalized Cross Correlation Loss (NCCL) and (b)Pairwise Contrastive\nDivergence Loss (PCDL), to be used along with the established adversarial loss (refer eqns. 3 and 5).\n\n3 Normalized Cross-Correlation Loss (NCCL)\n\nThe main advantage of video over image data is the fact that, it offers a far richer space of data\ndistribution by adding the temporal dimension along with the spatial one. Convolutional Neural\nNetworks (CNN) can only capture short-range relationships, a small part of the vast available\ninformation, from the input video data, that too in the spatial domain. Although this can be somewhat\nalleviated by the use of 3D convolutions [9], that increases the number of learn-able parameters\nimmensely. Normalized cross-correlation has been used since long time in the \ufb01eld of video analytics\n[1, 2, 16, 13, 23] to model the spatial and temporal relationships present in the data.\nNormalized cross correlation (NCC) measures the similarity of two image patches as a function of\nthe displacement of one relative to the other. This can be mathematically de\ufb01ned as:\n\n(cid:88)\n\n(f (x, y) \u2212 \u00b5f )(g(x, y) \u2212 \u00b5g)\n\nN CC(f, g) =\n\nx,y\n\n\u03c3f \u03c3g\n\n(6)\n\nwhere, f (x, y) is a sub-image, g(x, y) is the template to be matched, \u00b5f , \u00b5g denotes the mean of\nthe sub-image and the template respectively and \u03c3f , \u03c3g denotes the standard deviation of f and g\nrespectively.\nIn the domain of video frame(s) prediction, we incorporate the NCC by \ufb01rst extracting small non-\noverlapping square patches of size h \u00d7 h (1 < h \u2264 4), denoted by a 3-tuple Pt{x, y, h}, where, x\nand y are the co-ordinates of the top-left pixel of a particular patch, from the predicted frame at time t\nand then calculating the cross-correlation score with the patch extracted from the ground truth frame\nat time (t \u2212 1), represented by \u02c6Pt\u22121{x \u2212 2, y \u2212 2, h + 4}.\nIn simpler terms, we estimate the cross-correlation score between a small portion of the current\npredicted frame and the local neighborhood of that in the previous ground-truth frame. We assume\nthat, the motion features present in the entire scene (frame) be effectively approximated by adjacent\nspatial blocks of lower resolution,using small local neighborhoods in the temporal dimension. This\nstems from the fact that, unless the video contains signi\ufb01cant jitter or unexpected random events like\n\n4\n\n\fAlgorithm 1: Normalized cross-correlation score for estimating similarity between a set of predicted\nframe(s) and a set of ground-truth frame(s).\nInput: Ground-truth frames (GT ), Predicted frames (P RED)\nOutput: Cross-correlation score (ScoreN CC)\n// h = height and width of an image patch\n// H = height and width of the predicted frames\n// t = current time\n// T = Number of frames predicted\nInitialize: ScoreN CC = 0;\nfor t = 1 to T do\n\nfor i = 0 to H, i \u2190 i + h do\n\nfor j = 0 to H, j \u2190 j + h do\n\nPt \u2190 extract_patch(P REDt, i, j, h);\n/* Extracts a patch from the predicted frame at time t of dimension\n*/\n\u02c6Pt\u22121 \u2190 extract_patch(GTt\u22121, i \u2212 2, j \u2212 2, h + 4);\n/* Extracts a patch from the ground-truth frame at time (t \u2212 1) of\n\nh \u00d7 h starting from the top-left pixel index (i, j)\n\ndimension (h + 4) \u00d7 (h + 4) starting from the top-left pixel index\n(i \u2212 2, j \u2212 2)\n*/\n\u00b5Pt \u2190 avg(Pt);\n\u00b5 \u02c6Pt\u22121\n\u03c3Pt \u2190 standard_deviation(Pt);\n\u03c3 \u02c6Pt\u22121\n\n\u2190 standard_deviation( \u02c6Pt\u22121);\n\n\u2190 avg( \u02c6Pt\u22121);\n\nScoreN CC \u2190 ScoreN CC + max(cid:0)0,(cid:80)\n\n(Pt(x,y)\u2212\u00b5Pt )( \u02c6Pt\u22121(x,y)\u2212\u00b5 \u02c6Pt\u22121\n\n)\n\n(cid:1);\n\nx,y\n\n\u03c3Pt \u03c3 \u02c6Pt\u22121\n\nend\n\nend\nScoreN CC \u2190 ScoreN CC/(cid:98)H/h(cid:99)2 ;\n\nend\nScoreN CC \u2190 ScoreN CC/(T\u22121);\n\n// Average over all the patches\n\n// Average over all the frames\n\nscene change, the motion features remain smooth over time. The step-by-step process for \ufb01nding the\ncross-correlation score by matching local patches of predicted and ground truth frames is described\nin algorithm 1.\nThe idea of calculating the NCC score is modeled into an objective function for the generator network\nG, where it tries to maximize the score over a batch of inputs. In essence, this objective function\nmodels the temporal data distribution by smoothing the local motion features generated by the\nconvolutional model. This loss function, LN CCL, is de\ufb01ned as:\n\nLN CCL(Y, \u02c6Y ) = \u2212ScoreN CC(Y, \u02c6Y )\n\n(7)\nwhere, Y and \u02c6Y are the ground truth and predicted frames and ScoreN CC is the average normalized\ncross-correlation score over all the frames, obtained using the method as described in algorithm 1.\nThe generator tries to minimize LN CCL along with the adversarial loss de\ufb01ned in section 2.\nWe also propose a variant of this objective function, termed as Smoothed Normalized Cross-\nCorrelation Loss (SNCCL), where the patch similarity \ufb01nding logic of NCCL is extended by\nconvolving with Gaussian \ufb01lters to suppress transient (sudden) motion patterns. A detailed dis-\ncussion of this algorithm is given in sec. A of the supplementary document.\n\n4 Pairwise Contrastive Divergence Loss (PCDL)\n\nAs discussed in sec. 3, the proposed method captures motion features that vary slowly over time. The\nNCCL criteria aims to achieve this using local similarity measures. To complement this in a global\nscale, we use the idea of pairwise contrastive divergence over the input frames. The idea of exploiting\nthis temporal coherence for learning motion features has been studied in the recent past [6, 7, 15].\n\n5\n\n\fT\u22121(cid:88)\nT\u22121(cid:88)\n\ni=0\n\nBy assuming that, motion features vary slowly over time, we describe \u02c6Yt and \u02c6Yt+1 as a temporal\npair, where, \u02c6Yi and \u02c6Yt+1 are the predicted frames at time t and (t + 1) respectively, if the outputs of\nthe discriminator network D for both these frames are 1. With this notation, we model the slowness\nprinciple of the motion features using an objective function as:\nLP CDL( \u02c6Y , (cid:126)p) =\n\nD\u03b4( \u02c6Yi, \u02c6Yi+1, pi \u00d7 pi+1)\n\npi \u00d7 pi+1 \u00d7 d( \u02c6Yi, \u02c6Yi+1) + (1 \u2212 pi \u00d7 pi+1) \u00d7 max(0, \u03b4 \u2212 d( \u02c6Yi, \u02c6Yi+1))\n\n(8)\n\n=\n\ni=0\n\nwhere, T is the time-duration of the frames predicted, pi is the output decision (pi \u2208 {0, 1}) of the\ndiscriminator, d(x, y) is a distance measure (L2 in this paper) and \u03b4 is a positive margin. Equation\n8 in simpler terms, minimizes the distance between frames that have been predicted correctly and\nencourages the distance in the negative case, up-to a margin \u03b4.\n\n5 Higher Order Pairwise Contrastive Divergence Loss\n\nT\u22122(cid:88)\nT\u22122(cid:88)\n\ni=0\n\nThe Pairwise Contrastive Divergence Loss (PCDL) discussed in the previous section takes into\naccount (dis)similarities between two consecutive frames to bring them further (or closer) in the\nspatio-temporal feature space. This idea can be extended for higher order situations involving three\nor more consecutive frames. For n = 3, where n is the number of consecutive frames considered,\nPCDL can be de\ufb01ned as:\nL3\u2212P CDL =\n\nD\u03b4(| \u02c6Yi \u2212 \u02c6Yi+1|,| \u02c6Yi+1 \u2212 \u02c6Yi+2|, pi,i+1,i+2)\n\npi,i+1,i+2 \u00d7 d(| \u02c6Yi \u2212 \u02c6Yi+1|,| \u02c6Yi+1 \u2212 \u02c6Yi+2|)\n\n=\n+ (1 \u2212 pi,i+1,i+2) \u00d7 max(0, \u03b4 \u2212 d(|( \u02c6Yi \u2212 \u02c6Yi+1)|,|( \u02c6Yi+1 \u2212 \u02c6Yi+2)|))\n\ni=0\n\n(9)\n\nwhere, pi,i+1,i+2 = 1 only if pi, pi+1 and pi+2- all are simultaneously 1, i.e., the discriminator is\nvery sure about the predicted frames, that they are from the original data distribution. All the other\nsymbols bear standard representations de\ufb01ned in the paper.\nThis version of the objective function, in essence shrinks the distance between the predicted frames\noccurring sequentially in a temporal neighborhood, thereby increasing their similarity and maintaining\nthe temporal coherency.\n\n6 Combined Loss\n\nFinally, we combine the objective functions given in eqns. 5 - 8 along with the general L1-loss with\ndifferent weights as:\n\nLCombined =\u03bbadvLG\n\nadv(X) + \u03bbL1LL1(X, Y ) + \u03bbN CCLLN CCL(Y, \u02c6Y )\n\n+ \u03bbP CDLLP CDL( \u02c6Y , (cid:126)p) + \u03bbP CDLL3\u2212P CDL( \u02c6Y , (cid:126)p)\n\n(10)\n\nAll the weights viz. \u03bbL1, \u03bbN CCL, \u03bbP CDL and \u03bb3\u2212P CDL have been set as 0.25, while \u03bbadv equals\n0.01. This overall loss is minimized during the training stage of the multi-stage GAN using Adam\noptimizer [11].\nWe also evaluate our models by incorporating another loss function described in section A of the\nsupplementary document, the Smoothed Normalized Cross-Correlation Loss (SNCCL). The weight\nfor SNCCL, \u03bbSN CCL equals 0.33 while \u03bb3\u2212P CDL and \u03bbP CDL is kept at 0.16.\n\n7 Experiments\n\nPerformance analysis with experiments of our proposed prediction model for video frame(s) have\nbeen done on video clips from Sports-1M [10], UCF-101 [21] and KITTI [4] datasets. The input-\noutput con\ufb01guration used for training the system is as follows: input: 4 frames and output: 4 frames.\n\n6\n\n\fWe compare our results with recent state-of-the-art methods using two popular metrics: (a) Peak\nSignal to Noise Ratio (PSNR) and (b) Structural Similarity Index Measure (SSIM) [24].\n\n7.1 Datasets\n\nSports-1M A large collection of sports videos collected from YouTube spread over 487 classes.\nThe main reason for choosing this dataset is the amount of movement in the frames. Being a collection\nof sports videos, this has suf\ufb01cient amount of motion present in most of the frames, making it an\nef\ufb01cient dataset for training the prediction model. Only this dataset has been used for training all\nthroughout our experimental studies.\n\nUCF-101 This dataset contains 13320 annotated videos belonging to 101 classes having 180\nframes/video on average. The frames in this video do not contain as much movement as the Sports-\n1m and hence this is used only for testing purpose.\n\nKITTI This consists of high-resolution video data from different road conditions. We have taken\nraw data from two categories: (a) city and (b) road.\n\n7.2 Architecture of the network\n\nTable 1: Network architecture details; G and D represents the generator and discriminator networks\nrespectively. U denotes an unpooling operation which upsamples an input by a factor of 2.\n\nNetwork\nNumber of fea-\nture maps\n\nStage-1 (G)\n64, 128, 256U, 128,\n64\n\nKernel sizes\nFully connected N/A\n\n5, 3, 3, 3, 5\n\nStage-2 (G)\n64, 128U, 256,\n512U, 256, 128,\n64\n5, 5, 5, 5, 5, 5, 5\nN/A\n\nStage-1 (D)\n64, 128, 256\n\nStage-2 (D)\n128, 256, 512,\n256, 128\n\n3, 5, 5\n1024, 512\n\n7, 5, 5, 5, 5\n1024, 512\n\nThe architecture details for the generator (G) and discriminator (D) networks used for experimental\nstudies are shown in table 1. All the convolutional layers except the terminal one in both stages of G\nare followed by ReLU non-linearity. The last layer is tied with tanh activation function. In both the\nstages of G, we use unpooling layers to upsample the image into higher resolution in magnitude of 2\nin both dimensions (height and width). The learning rate is set to 0.003 for G, which is gradually\ndecreased to 0.0004 over time. The discriminator (D) uses ReLU non-linearities and is trained with a\nlearning rate of 0.03. We use mini-batches of 8 clips for training the overall network.\n\n7.3 Evaluation metric for prediction\n\nAssessment of the quality of the predicted frames is done by two methods: (a) Peak Signal to Noise\nRatio (PSNR) and (b) Structural Similarity Index Measure (SSIM). PSNR measures the quality of\nthe reconstruction process through the calculation of Mean-squared error between the original and\nthe reconstructed signal in logarithmic decibel scale [1]. SSIM is also an image similarity measure\nwhere, one of the images being compared is assumed to be of perfect quality [24].\nAs the frames in videos are composed of foreground and background, and in most cases the back-\nground is static (not the case in the KITTI dataset, as it has videos taken from camera mounted on\na moving car), we extract random sequences of 32 \u00d7 32 patches from the frames with signi\ufb01cant\nmotion. Calculation of motion is done using the optical \ufb02ow method of Brox et. al. [3].\n\n7.4 Comparison\n\nWe compare the results on videos from UCF-101, using the model trained on the Sports-1M dataset.\nTable 2 demonstrates the superiority of our method over the most recent work [14]. We followed\nsimilar choice of test set videos as in [14] to make a fair comparison. One of the impressive facts\nin our model is that, it can produce acceptably good predictions even in the 4th frame, which is a\nsigni\ufb01cant result considering that [14] uses separate smaller multi-scale models for achieving this\n\n7\n\n\fFigure 2: Qualitative results of using the proposed framework for predicting frames in UCF-101 with\nthe three rows representing (a) Ground-truth, (b) Adv + L1 and (c) Combined (section 6) respectively.\n\u2019T\u2019 denotes the time-step. Figures in insets show zoomed-in patches for better visibility of areas\ninvolving motion (Best viewed in color).\n\nfeat. Also note that, even though the metrics for the \ufb01rst predicted frame do not differ by a large\nmargin compared to the results from [14] for higher frames, the values decrease much slowly for the\nmodels trained with the proposed objective functions (rows 8-10 of table 2). The main reason for this\nphenomenon in our proposed method is the incorporation of the temporal relations in the objective\nfunctions, rather than learning only in the spatial domain.\nSimilar trend was also found in case of the KITTI dataset. We could not \ufb01nd any prior work in\nthe literature reporting \ufb01ndings on the KITTI dataset and hence compared only with several of our\nproposed models. In all the cases, the performance gain with the inclusion of NCCL and PCDL is\nevident.\nFinally, we show the prediction results obtained on both the UCF-101 and KITTI in \ufb01gures 2 and 3.\nIt is evident from the sub-\ufb01gures that, our proposed objective functions produce impressive quality\nframes while the models trained with L1 loss tends to output blurry reconstruction. The supplementary\ndocument contains visual results (shown in \ufb01gures C.1-C.2) obtained in case of predicting frames\nfar-away from the current time-step (8 frames).\n8 Conclusion\n\nIn this paper, we modi\ufb01ed the Generative Adversarial Networks (GAN) framework with the use\nof unpooling operations and introduced two objective functions based on the normalized cross-\ncorrelation (NCCL) and the contrastive divergence estimate (PCDL), to design an ef\ufb01cient algorithm\nfor video frame(s) prediction. Studies show signi\ufb01cant improvement of the proposed methods over the\nrecent published works. Our proposed objective functions can be used with more complex networks\ninvolving 3D convolutions and recurrent neural networks. In the future, we aim to learn weights for\nthe cross-correlation such that it focuses adaptively on areas involving varying amount of motion.\n\n8\n\n\fTable 2: Comparison of performance for different methods using PSNR/SSIM scores for the UCF-101\nand KITTI datasets. The \ufb01rst \ufb01ve rows report the results from [14]. (*) indicates models \ufb01ne tuned on\npatches of size 64 \u00d7 64 [14]. (-) denotes unavailability of data. GDL stands for Gradient Difference\nLoss [14]. SNCCL is discussed in section A of the supplementary document. Best results in bold.\n\nMethods\nL1\nGDL L1\nGDL L1*\nAdv + GDL \ufb01ne-tuned*\nOptical \ufb02ow\nNext-\ufb02ow [20]\nDeep Voxel Flow [12]\nAdv + NCCL + L1\nCombined\nCombined + SNCCL\nCombined + SNCCL (full\nframe)\n\n1st frame\n\n4th frame\n\n2nd frame\n\nprediction score\nUCF\nKITTI\n23.8/0.83 -\n24.9/0.84 -\n26.4/0.87 -\n28.9/0.89 -\n28.2/0.90 -\n-\n-\n-\n-\n\nprediction score\nUCF\nKITTI\n28.7/0.88 -\n29.4/0.90 -\n29.9/0.90 -\n32.0/0.92 -\n31.6/0.93 -\n31.9/-\n-\n35.8/0.96 -\n35.4/0.94 37.1/0.91 33.9/0.92 35.4/0.90 28.7/0.75 27.8/0.75\n37.3/0.95 39.7/0.93 35.7/0.92 37.1/0.91 30.2/0.76 29.6/0.76\n38.2/0.95 40.2/0.94 36.8/0.93 37.7/0.91 30.9/0.77 30.4/0.77\n37.3/0.94 39.4/0.94 35.1/0.91 36.4/0.91 29.5/0.75 29.1/0.76\n\nprediction score\nKITTI\nUCF\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\nFigure 3: Qualitative results of using the proposed framework for predicting frames in the KITTI\nDataset, for (a) L1, (b) NCCL (section 3), (c) Combined (section 6) and (d) ground-truth (Best viewed\nin color).\n\n9\n\n\fReferences\n[1] A. C. Bovik. The Essential Guide to Video Processing. Academic Press, 2nd edition, 2009.\n[2] K. Briechle and U. D. Hanebeck. Template matching using fast normalized cross correlation.\n\nIn\nAerospace/Defense Sensing, Simulation, and Controls, pages 95\u2013102. International Society for Optics and\nPhotonics, 2001.\n\n[3] T. Brox and J. Malik. Large displacement optical \ufb02ow: descriptor matching in variational motion estimation.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(3):500\u2013513, 2011.\n\n[4] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. International\n\nJournal of Robotics Research (IJRR), 2013.\n\n[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pages 2672\u2013\n2680, 2014.\n\n[6] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised learning of spatiotemporally\ncoherent metrics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages\n4086\u20134093, 2015.\n\n[7] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 1735\u20131742, 2006.\n\n[8] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n[9] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(1):221\u2013231, 2013.\n\n[10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classi\ufb01cation\nwith convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2014.\n\n[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[12] Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel \ufb02ow. arXiv\n\npreprint arXiv:1702.02463, 2017.\n\n[13] J. Luo and E. E. Konofagou. A fast normalized cross-correlation calculation method for motion estimation.\n\nIEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 57(6):1347\u20131357, 2010.\n\n[14] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error.\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\n[15] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In Proceedings\nof the 26th Annual International Conference on Machine Learning (ICML), pages 737\u2013744. ACM, 2009.\n[16] A. Nakhmani and A. Tannenbaum. A new distance measure based on generalized image normalized\ncross-correlation for robust video tracking and image recognition. Pattern Recognition Letters (PRL),\n34(3):315\u2013321, 2013.\n\n[17] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks\nin atari games. In Advances in Neural Information Processing Systems (NIPS), pages 2863\u20132871, 2015.\n[18] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[19] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a\n\nbaseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.\n\n[20] N. Sedaghat. Next-\ufb02ow: Hybrid multi-tasking with next-frame prediction to boost optical-\ufb02ow estimation\n\nin the wild. arXiv preprint arXiv:1612.03777, 2016.\n\n[21] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the\n\nwild. arXiv preprint arXiv:1212.0402, 2012.\n\n[22] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using\n\nlstms. In International Conference on Machine Learning (ICML), pages 843\u2013852, 2015.\n\n[23] A. Subramaniam, M. Chatterjee, and A. Mittal. Deep neural networks with inexact matching for person\nre-identi\ufb01cation. In Advances in Neural Information Processing Systems (NIPS), pages 2667\u20132675, 2016.\n[24] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility\n\nto structural similarity. IEEE Transactions on Image Processing (TIP), 13(4):600\u2013612, 2004.\n\n[25] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference\n\non Computer Vision (ECCV), pages 818\u2013833. Springer, 2014.\n\n10\n\n\f", "award": [], "sourceid": 2237, "authors": [{"given_name": "Prateep", "family_name": "Bhattacharjee", "institution": "Indian Institute of Technology Madras"}, {"given_name": "Sukhendu", "family_name": "Das", "institution": "IIT Madras"}]}