{"title": "Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution", "book": "Advances in Neural Information Processing Systems", "page_first": 235, "page_last": 243, "abstract": "Super resolving a low-resolution video is usually handled by either single-image super-resolution (SR) or multi-frame SR. Single-Image SR deals with each video frame independently, and ignores intrinsic temporal dependency of video frames which actually plays a very important role in video super-resolution. Multi-Frame SR generally extracts motion information, e.g. optical flow, to model the temporal dependency, which often shows high computational cost. Considering that recurrent neural network (RNN) can model long-term contextual information of temporal sequences well, we propose a bidirectional recurrent convolutional network for efficient multi-frame SR.Different from vanilla RNN, 1) the commonly-used recurrent full connections are replaced with weight-sharing convolutional connections and 2) conditional convolutional connections from previous input layers to current hidden layer are added for enhancing visual-temporal dependency modelling. With the powerful temporal dependency modelling, our model can super resolve videos with complex motions and achieve state-of-the-art performance. Due to the cheap convolution operations, our model has a low computational complexity and runs orders of magnitude faster than other multi-frame methods.", "full_text": "Bidirectional Recurrent Convolutional Networks\n\nfor Multi-Frame Super-Resolution\n\n1Center for Research on Intelligent Perception and Computing\n\nNational Laboratory of Pattern Recognition\n\n2Center for Excellence in Brain Science and Intelligence Technology\n\nYan Huang1\n\nWei Wang1\n\nLiang Wang1,2\n\nInstitute of Automation, Chinese Academy of Sciences\n{yhuang, wangwei, wangliang}@nlpr.ia.ac.cn\n\nAbstract\n\nSuper resolving a low-resolution video is usually handled by either single-image\nsuper-resolution (SR) or multi-frame SR. Single-Image SR deals with each video\nframe independently, and ignores intrinsic temporal dependency of video frames\nwhich actually plays a very important role in video super-resolution. Multi-Frame\nSR generally extracts motion information, e.g., optical \ufb02ow, to model the temporal\ndependency, which often shows high computational cost. Considering that recur-\nrent neural networks (RNNs) can model long-term contextual information of tem-\nporal sequences well, we propose a bidirectional recurrent convolutional network\nfor ef\ufb01cient multi-frame SR. Different from vanilla RNNs, 1) the commonly-used\nrecurrent full connections are replaced with weight-sharing convolutional con-\nnections and 2) conditional convolutional connections from previous input layers\nto the current hidden layer are added for enhancing visual-temporal dependency\nmodelling. With the powerful temporal dependency modelling, our model can\nsuper resolve videos with complex motions and achieve state-of-the-art perfor-\nmance. Due to the cheap convolution operations, our model has a low compu-\ntational complexity and runs orders of magnitude faster than other multi-frame\nmethods.\n\n1\n\nIntroduction\n\nSince large numbers of high-de\ufb01nition displays have sprung up, generating high-resolution videos\nfrom previous low-resolution contents, namely video super-resolution (SR), is under great demand.\nRecently, various methods have been proposed to handle this problem, which can be classi\ufb01ed into\ntwo categories: 1) single-image SR [10, 5, 9, 8, 12, 25, 23] super resolves each of the video frames\nindependently, and 2) multi-frame SR [13, 17, 3, 2, 14, 13] models and exploits temporal dependency\namong video frames, which is usually considered as an essential component of video SR.\nExisting multi-frame SR methods generally model the temporal dependency by extracting subpixel\nmotions of video frames, e.g., estimating optical \ufb02ow based on sparse prior integration or variation\nregularity [2, 14, 13]. But such accurate motion estimation can only be effective for video sequences\nwhich contain small motions. In addition, the high computational cost of these methods limits the\nreal-world applications. Several solutions have been explored to overcome these issues by avoiding\nthe explicit motion estimation [21, 16]. Unfortunately, they still have to perform implicit motion\nestimation to reduce temporal aliasing and achieve resolution enhancement when large motions are\nencountered.\nGiven the fact that recurrent neural networks (RNNs) can well model long-term contextual infor-\nmation for video sequence, we propose a bidirectional recurrent convolutional network (BRCN)\n\n1\n\n\fto ef\ufb01ciently learn the temporal dependency for multi-frame SR. The proposed network exploits\nthree convolutions. 1) Feedforward convolution models visual spatial dependency between a low-\nresolution frame and its high-resolution result. 2) Recurrent convolution connects the hidden layers\nof successive frames to learn temporal dependency. Different from the commonly-used full recurrent\nconnection in vanilla RNNs, it is a weight-sharing convolutional connection here. 3) Conditional\nconvolution connects input layers at the previous timestep to the current hidden layer, to further en-\nhance visual-temporal dependency modelling. To simultaneously consider the temporal dependency\nfrom both previous and future frames, we exploit a forward recurrent network and a backward re-\ncurrent network, respectively, and then combine them together for the \ufb01nal prediction. We apply the\nproposed model to super resolve videos with complex motions. The experimental results demon-\nstrate that the model can achieve state-of-the-art performance, as well as orders of magnitude faster\nspeed than other multi-frame SR methods.\nOur main contributions can be summarized as follows. We propose a bidirectional recurrent con-\nvolutional network for multi-frame SR, where the temporal dependency can be ef\ufb01ciently modelled\nby bidirectional recurrent and conditional convolutions. It is an end-to-end framework which does\nnot need pre-/post-processing. We achieve better performance and faster speed than existing multi-\nframe SR methods.\n\n2 Related Work\n\nWe will review the related work from the following prospectives.\nSingle-Image SR. Irani and Peleg [10] propose the primary work for this problem, followed by\nFreeman et al. [8] studying this problem in a learning-based way. To alleviate high computational\ncomplexity, Bevilacqua et al. [4] and Chang et al. [5] introduce manifold learning techniques which\ncan reduce the required number of image patch exemplars. For further acceleration, Timofte et al.\n[23] propose the anchored neighborhood regression method. Yang et al. [25] and Zeyde et al. [26]\nexploit compressive sensing to encode image patches with a compact dictionary and obtain sparse\nrepresentations. Dong et al. [6] learn a convolutional neural network for single-image SR which\nachieves the current state-of-the-art result. In this work, we focus on multi-frame SR by modelling\ntemporal dependency in video sequences.\nMulti-Frame SR. Baker and Kanade [2] extract optical \ufb02ow to model the temporal dependency in\nvideo sequences for video SR. Then, various improvements [14, 13] around this work are explored\nto better handle visual motions. However, these methods suffer from the high computational cost\ndue to the motion estimation. To deal with this problem, Protter et al. [16] and Takeda et al. [21]\navoid motion estimation by employing nonlocal mean and 3D steering kernel regression. In this\nwork, we propose bidirectional recurrent and conditional convolutions as an alternative to model\ntemporal dependency and achieve faster speed.\n\n3 Bidirectional Recurrent Convolutional Network\n\n3.1 Formulation\n\nGiven a low-resolution, noisy and blurry video, our goal is to obtain a high-resolution, noise-free\nand blur-free version. In this paper, we propose a bidirectional recurrent convolutional network (BR-\nCN) to map the low-resolution frames to high-resolution ones. As shown in Figure 1, the proposed\nnetwork contains a forward recurrent convolutional sub-network and a backward recurrent convolu-\ntional sub-network to model the temporal dependency from both previous and future frames. Note\nthat similar bidirectional scheme has been proposed previously in [18]. The two sub-networks of\nBRCN are denoted by two black blocks with dash borders, respectively. In each sub-network, there\nare four layers including the input layer, the \ufb01rst hidden layer, the second hidden layer and the output\nlayer, which are connected by three convolutional operations:\n\n1. Feedforward Convolution. The multi-layer convolutions denoted by black lines learn\nvisual spatial dependency between a low-resolution frame and its high-resolution result.\nSimilar con\ufb01gurations have also been explored previously in [11, 7, 6].\n\n2\n\n\fFigure 1: The proposed bidirectional recurrent convolutional network (BRCN).\n\n2. Recurrent Convolution. The convolutions denoted by blue lines aim to model long-term\ntemporal dependency across video frames by connecting adjacent hidden layers of suc-\ncessive frames, where the current hidden layer is conditioned on the hidden layer at the\nprevious timestep. We use the recurrent convolution in both forward and backward sub-\nnetworks. Such bidirectional recurrent scheme can make full use of the forward and back-\nward temporal dynamics.\n\n3. Conditional Convolution. The convolutions denoted by red lines connect input layer at\nthe previous timestep to the current hidden layer, and use previous inputs to provide long-\nterm contextual information. They enhance visual-temporal dependency modelling with\nthis kind of conditional connection.\n\nWe denote the frame sets of a low-resolution video1 X as {Xi}i=1,2,...,T , and infer the other three\nlayers as follows.\nFirst Hidden Layer. When inferring the \ufb01rst hidden layer Hf\n1(Xi)) at the ith timestep\nin the forward (or backward) sub-network, three inputs are considered: 1) the current input layer\nXi connected by a feedforward convolution, 2) the hidden layer Hf\n1(Xi+1)) at the\ni\u22121th (or i+1th) timestep connected by a recurrent convolution, and 3) the input layer Xi\u22121 (or\nXi+1) at the i\u22121th (or i+1th) timestep connected by a conditional convolution.\n\n1 (Xi\u22121) (or Hb\n\n1 (Xi) (or Hb\n\n\u2217 Xi + Wf\n\u2217 Xi + Wb\n\nr1\n\nr1\n\n\u2217 Hf\n\u2217 Hb\n\n1 (Xi\u22121) + Wf\n1(Xi+1) + Wb\nt1\n\nt1 \u2217 Xi\u22121 + Bf\n1 )\n\u2217 Xi+1 + Bb\n1)\n\n(1)\n\nHf\nHb\n\n1 (Xi) = \u03bb(Wf\nv1\n1(Xi) = \u03bb(Wb\nv1\nv1) and Wf\n\nt1 (or Wb\n\nv1 (or Wb\n\nwhere Wf\nt1) represent the \ufb01lters of feedforward and conditional con-\nvolutions in the forward (or backward) sub-network, respectively. Both of them have the size of\nc\u00d7fv1\u00d7fv1\u00d7n1, where c is the number of input channels, fv1 is the \ufb01lter size and n1 is the number\nof \ufb01lters. Wf\nr1) represents the \ufb01lters of recurrent convolutions. Their \ufb01lter size fr1 is set to\n1 to avoid border effects. Bf\n1) represents biases. The activation function is the recti\ufb01ed linear\nunit (ReLu): \u03bb(x)=max(0, x) [15]. Note that in Equation 1, the \ufb01lter responses of recurrent and\n\nr1 (or Wb\n\n1 (or Bb\n\n1Note that we upscale each low-resolution frame in the sequence to the desired size with bicubic interpola-\n\ntion in advance.\n\n3\n\n\ud835\udc7f\ud835\udc8a\u2212\ud835\udfcf \ud835\udc7f\ud835\udc8a \ud835\udc7f\ud835\udc8a+\ud835\udfcf \ud835\udc7f\ud835\udc8a+\ud835\udfcf \ud835\udc7f\ud835\udc8a \ud835\udc7f\ud835\udc8a\u2212\ud835\udfcf \u22ef \u22ef \u22ef \u22ef \u22ef \u22ef \u22ef \u22ef Backward sub-network Forward sub-network Input layer (low-resolution frame) Output layer (high-resolution frame) First hidden layer Second hidden layer Second hidden layer First hidden layer Input layer (low-resolution frame) : Feedforward convolution : Recurrent convolution : Conditional convolution \f(a) TRBM\n\n(b) BRCN\n\nFigure 2: Comparison between TRBM and the proposed BRCN.\n\nconditional convolutions can be regarded as dynamic changing biases, which focus on modelling\nthe temporal changes across frames, while the \ufb01lter responses of feedforward convolution focus on\nlearning visual content.\nSecond Hidden Layer. This phase projects the obtained feature maps Hf\n1(Xi)) from\nn1 to n2 dimensions, which aims to capture the nonlinear structure in sequence data. In addition to\nintra-frame mapping by feedforward convolution, we also consider two inter-frame mappings using\nrecurrent and conditional convolutions, respectively. The projected n2-dimensional feature maps in\nthe second hidden layer Hf\n2(Xi)) in the forward (or backward) sub-network can be\nobtained as follows:\n\n2 (Xi) (or (Hb\n\n1 (Xi) (or Hb\n\nHf\nHb\n\n2 (Xi) = \u03bb(Wf\nv2\n2(Xi) = \u03bb(Wb\nv2\n\n\u2217 Hf\n\u2217 Hb\n\n1 (Xi) + Wf\nr2\n1(Xi) + Wb\nr2\n\n\u2217 Hf\n\u2217 Hb\n\n2 (Xi\u22121) + Wf\n2(Xi+1) + Wb\nt2\n\nt2 \u2217 Hf\n\u2217 Hb\n\n1 (Xi\u22121) + Bf\n2 )\n1(Xi+1) + Bb\n2)\n\n(2)\n\nr2 (or Wb\n\nv2 (or Wb\n\nv2) and Wf\n\nt2 (or Wb\n\nt2) represent the \ufb01lters of feedforward and conditional con-\nr2) represents the\n\nwhere Wf\nvolutions, respectively, both of which have the size of n1\u00d71\u00d71\u00d7n2. Wf\n\ufb01lters of recurrent convolution, whose size is n2\u00d71\u00d71\u00d7n2.\nNote that the inference of the two hidden layers can be regarded as a representation learning phase,\nwhere we could stack more hidden layers to increase the representability of our network to better\ncapture the complex data structure.\nOutput Layer. In this phase, we combine the projected n2-dimensional feature maps in both for-\nward and backward sub-networks to jointly predict the desired high-resolution frame:\n\u2217 Hb\n\nO(Xi) =Wf\nv3\n\n2(Xi+1) + Bb\n3\n(3)\nwhere Wf\nt3) represent the \ufb01lters of feedforward and conditional convo-\nlutions, respectively. Their sizes are both n2\u00d7fv3\u00d7fv3\u00d7c. We do not use any recurrent convolution\nfor output layer.\n\nt3 \u2217 Hf\nt3 (or Wb\n\nv3 (or Wb\n\nv3) and Wf\n\n2 (Xi\u22121) + Bf\n\n3 + Wb\nv3\n\n2(Xi) + Wb\nt3\n\n2 (Xi) + Wf\n\n\u2217 Hf\n\n\u2217 Hb\n\n3.2 Connection with Temporal Restricted Boltzmann Machine\n\nIn this section, we discuss the connection between the proposed BRCN and temporal restricted\nboltzmann machine (TRBM) [20] which is a widely used model in sequence modelling.\nAs shown in Figure 2, TRBM and BRCN contain similar recurrent connections (blue lines) between\nhidden layers, and conditional connections (red lines) between input layer and hidden layer. They\nshare the common \ufb02exibility to model and propagate temporal dependency along the time. How-\never, TRBM is a generative model while BRCN is a discriminative model, and TRBM contains an\nadditional connection (green line) between input layers for sample generation.\nIn fact, BRCN can be regarded as a deterministic, bidirectional and patch-based implementation of\nTRBM. Speci\ufb01cally, when inferring the hidden layer in BRCN, as illustrated in Figure 2 (b), feed-\nforward and conditional convolutions extract overlapped patches from the input, each of which is\n\n4\n\n1B1A1C0CiX1i\uf02dX1i\uf02dHiH1i\uf02dXiX11()fi\uf02dHX1()fiHX-dimensional vector\ffully connected to a n1-dimensional vector in the feature maps Hf\n1 (Xi). For recurrent convolution-\ns, since each \ufb01lter size is 1 and all the \ufb01lters contain n1\u00d7n1 weights, a n1-dimensional vector in\nHf\n1 (Xi) is fully connected to the corresponding n1-dimensional vector in Hf\n1 (Xi\u22121) at the previ-\nous time step. Therefore, the patch connections of BRCN are actually those of a \u201cdiscriminative\u201d\nTRBM. In other words, by setting the \ufb01lter sizes of feedforward and conditional convolutions as the\nsize of the whole frame, BRCN is equivalent to TRBM.\nCompared with TRBM, BRCN has the following advantages for handling the task of video super-\nresolution. 1) BRCN restricts the receptive \ufb01eld of original full connection to a patch rather than the\nwhole frame, which can capture the temporal change of visual details. 2) BRCN replaces all the full\nconnections with weight-sharing convolutional ones, which largely reduces the computational cost.\n3) BRCN is more \ufb02exible to handle videos of different sizes, once it is trained on a \ufb01xed-size video\ndataset. Similar to TRBM, the proposed model can be generalized to other sequence modelling\napplications, e.g., video motion modelling [22].\n\n3.3 Network Learning\nThrough combining Equations 1, 2 and 3, we can obtain the desired prediction O(X ; \u0398) from the\nlow-resolution video X , where \u0398 denotes the network parameters. Network learning proceeds by\nminimizing the Mean Square Error (MSE) between the predicted high-resolution video O(X ; \u0398)\nand the groundtruth Y:\n\nL = (cid:107)O(X ; \u0398) \u2212 Y(cid:107)2\n\n(4)\n\nvia stochastic gradient descent. Actually, stochastic gradient descent is enough to achieve satisfying\nresults, although we could exploit other optimization algorithms with more computational cost, e.g.,\nL-BFGS. During optimization, all the \ufb01lter weights of recurrent and conditional convolutions are\ninitialized by randomly sampling from a Gaussian distribution with mean 0 and standard deviation\n0.001, whereas the \ufb01lter weights of feedforward convolution are pre-trained on static images [6].\nNote that the pretraining step only aims to speed up training by providing a better parameter ini-\ntialization, due to the limited size of training set. This step can be avoided by alternatively using a\nlarger scale dataset. We experimentally \ufb01nd that using a smaller learning rate (e.g., 1e\u22124) for the\nweights in the output layer is crucial to obtain good performance.\n\n4 Experimental Results\n\nTo verify the effectiveness, we apply the proposed model to the task of video SR, and present both\nquantitative and qualitative results as follows.\n\n4.1 Datasets and Implementation Details\n\nWe use 25 YUV format video sequences2 as our training set, which have been widely used in many\nvideo SR methods [13, 16, 21]. To enlarge the training set, model training is performed in a volume-\nbased way, i.e., cropping multiple overlapped volumes from training videos and then regarding each\nvolume as a training sample. During cropping, each volume has a spatial size of 32\u00d732 and a\ntemporal step of 10. The spatial and temporal strides are 14 and 8, respectively. As a result, we\ncan generate roughly 41,000 volumes from the original dataset. We test our model on a variety\nof challenging videos, including Dancing, Flag, Fan, Treadmill and Turbine [19], which contain\ncomplex motions with severe motion blur and aliasing. Note that we do not have to extract volumes\nduring testing, since the convolutional operation can scale to videos of any spatial size and temporal\nstep. We generate the testing dataset with the following steps: 1) using Gaussian \ufb01lter with standard\ndeviation 2 to smooth each original frame, and 2) downsampling the frame by a factor of 4 with\nbicubic method3.\n\n2http://www.codersvoice.com/a/webbase/video/08/152014/130.html.\n3Here we focus on the factor of 4, which is usually considered as the most dif\ufb01cult case in super-resolution.\n\n5\n\n\fTable 1: The results of PSNR (dB) and running time (sec) on the testing video sequences.\nANR [23]\n\nSC [25]\n\nBicubic\n\nTime\n\nVideo\nDancing\nFlag\nFan\nTreadmill\nTurbine\nAverage\n\nPSNR\n26.83\n26.35\n31.94\n21.15\n25.09\n26.27\n\nPSNR\n26.80\n26.28\n32.50\n21.27\n25.77\n26.52\n\nTime\n45.47\n12.89\n12.92\n15.47\n16.49\n20.64\n\n-\n-\n-\n-\n-\n-\n\nK-SVD [26]\nTime\nPSNR\n27.69\n2.35\n0.58\n27.61\n1.06\n33.55\n0.35\n22.22\n0.51\n27.00\n27.61\n0.97\n\nNE+NNLS [4]\nTime\nPSNR\n27.63\n19.89\n4.54\n27.41\n8.27\n33.45\n2.60\n22.08\n3.67\n26.88\n27.49\n7.79\n\nPSNR\n27.67\n27.52\n33.49\n22.24\n27.04\n27.59\n\nPSNR\n28.09\n28.55\n33.73\n22.63\n27.71\n28.15\n\nTime\n0.85\n0.20\n0.38\n0.12\n0.18\n0.35\n\nTime\n3.44\n0.78\n1.46\n0.46\n0.70\n1.36\n\nBRCN\n\nVideo\nDancing\nFlag\nFan\nTreadmill\nTurbine\nAverage\n\nNE+LLE [5]\nTime\nPSNR\n4.20\n27.64\n27.48\n0.96\n1.76\n33.46\n0.57\n22.22\n0.80\n26.98\n27.52\n1.66\n\nSR-CNN [6]\nTime\nPSNR\n1.41\n27.81\n28.04\n0.36\n0.60\n33.61\n0.15\n22.42\n0.23\n27.50\n27.87\n0.55\n\n3DSKR [21]\nTime\nPSNR\n1211\n27.81\n26.89\n255\n323\n31.91\n127\n22.32\n173\n24.27\n26.64\n418\n\nEnhancer [1]\nPSNR\nTime\n27.06\n26.58\n32.14\n21.20\n25.60\n26.52\n\n-\n-\n-\n-\n-\n-\n\nTable 2: The results of PSNR (dB) by variants of BRCN on the testing video sequences. v: feedfor-\nward convolution, r: recurrent convolution, t: conditional convolution, b: bidirectional scheme.\n\nVideo\nDancing\nFlag\nFan\nTreadmill\nTurbine\nAverage\n\nBRCN BRCN BRCN\n{v, t}\n{v}\n27.99\n27.81\n28.39\n28.04\n33.65\n33.61\n22.42\n22.56\n27.50\n27.50\n27.87\n28.02\n\n{v, r}\n27.98\n28.32\n33.63\n22.59\n27.47\n27.99\n\nBRCN\n{v, r, t}\n28.09\n28.47\n33.65\n22.59\n27.62\n28.09\n\nBRCN\n{v, r, t, b}\n\n28.09\n28.55\n33.73\n22.63\n27.71\n28.15\n\nSome important parameters of our network are illustrated as follows: fv1=9, fv3=5, n1=64, n2=32\nand c=14. Note that varying the number and size of \ufb01lters does not have a signi\ufb01cant impact on the\nperformance, because some \ufb01lters with certain sizes are already in a regime where they can almost\nreconstruct the high-resolution videos [24, 6].\n\n4.2 Quantitative and Qualitative Comparison\n\nWe compare our BRCN with two multi-frame SR methods including 3DSKR [21] and a commercial\nsoftware namely Enhancer [1], and seven single-image SR methods including Bicubic, SC [25], K-\nSVD [26], NE+NNLS [4], ANR [23], NE+LLE [5] and SR-CNN [6].\nThe results of all the methods are compared in Table 1, where evaluation measures include both peak\nsignal-to-noise ratio (PSNR) and running time (Time). Speci\ufb01cally, compared with the state-of-the-\nart single-image SR methods (e.g., SR-CNN, ANR and K-SVD), our multi-frame-based method can\nsurpass them by 0.28\u223c0.54 dB, which is mainly attributed to the bene\ufb01cial mechanism of temporal\ndependency modelling. BRCN also performs much better than the two representative multi-frame\nSR methods (3DSKR and Enhancer) by 1.51 dB and 1.63 dB, respectively. In fact, most existing\nmulti-frame-based methods tend to fail catastrophically when dealing with very complex motions,\nbecause it is dif\ufb01cult for them to estimate the motions with pinpoint accuracy.\nFor the proposed BRCN, we also investigate the impact of model architecture on the performance.\nWe take a simpli\ufb01ed network containing only feedforward convolution as a benchmark, and then\nstudy its several variants by successively adding other operations including bidirectional scheme,\nrecurrent and conditional convolutions. The results by all the variants of BRCN are shown in Table\n2, where the elements in the brace represent the included operations. As we can see, due to the ben-\n\n4Similar to [23], we only deal with luminance channel in the YCrCb color space. Note that our model can\nbe generalized to handle all the three channels by setting c=3. Here we simply upscale the other two channels\nwith bicubic method for well illustration.\n\n6\n\n\f(a) Original\n\n(b) Bicubic\n\n(c) ANR [23]\n\n(d) SR-CNN [6]\n\n(e) BRCN\n\nFigure 3: Closeup comparison among original frames and super resolved results by Bicubic, ANR,\nSR-CNN and BRCN, respectively.\n\ne\ufb01t of learning temporal dependency, exploiting either recurrent convolution {v, r} or conditional\nconvolution {v, t} can greatly improve the performance. When combining these two convolutions\ntogether {v, r, t}, they obtain much better results. The performance can still be further promoted\nwhen adding the bidirectional scheme {v, r, t, b}, which results from the fact that each video frame\nis related to not only its previous frame but also the future one.\nIn addition to the quantitative evaluation, we also present some qualitative results in terms of single-\nframe (in Figure 3) and multi-frame (in Figure 5). Please enlarge and view these \ufb01gures on the\nscreen for better comparison. From these \ufb01gures, we can observe that our method is able to recover\nmore image details than others under various motion conditions.\n\n4.3 Running Time\n\nWe present the comparison of running\ntime in both Table 1 and Figure 4, where\nall the methods are implemented on the\nsame machine (Intel CPU 3.10 GHz and\n32 GB memory). The publicly avail-\nable codes of compared methods are al-\nl in MATLAB while SR-CNN and ours\nare in Python. From the table and \ufb01g-\nure, we can see that our BRCN takes\n1.36 sec per frame on average, which\nis orders of magnitude faster than the\nfast multi-frame SR method 3DSKR.\nIt should be noted that the speed gap\nis not caused by the different MAT-\nLAB/Python implementations. As stat-\ned in [13, 21], the computational bottle-\nneck for existing multi-frame SR meth-\nods is the accurate motion estimation,\nwhile our model explores an alternative\nbased on ef\ufb01cient spatial-temporal con-\nvolutions which has lower computational complexity. Note that the speed of our method is worse\nthan the fastest single-image SR method ANR. It is likely that our method involves the additional\nphase of temporal dependency modelling but we achieve better performance (28.15 vs. 27.59 dB).\n\nFigure 4: Running time vs. PSNR for all the methods.\n\n7\n\nBRCN 3DSKR SR-CNN SC NE+LLE ANR K-SVD NE+NNLS : multi-frame SR method : single-image SR method \f(a) Original\n\n(b) Bicubic\n\n(c) ANR [23]\n\n(d) SR-CNN [6]\n\n(e) BRCN\n\nFigure 5: Comparison among original frames (2th, 3th and 4th frames, from the top row to the\nbottom) of the Dancing video and super resolved results by Bicubic, ANR, SR-CNN and BRCN,\nrespectively.\n\n4.4 Filter Visualization\n\n(a) Wf\nv1\n\n(b) Wf\nt1\n\n(c) Wf\nv3\n\n(d) Wf\nt3\n\nFigure 6: Visualization of learned \ufb01lters by the proposed BRCN.\n\nv1 and Wf\nv3 and Wf\n\nWe visualize the learned \ufb01lters of feedforward and conditional convolutions in Figure 6. The \ufb01lters\nof Wf\nt1 exhibit some strip-like patterns, which can be viewed as edge detectors. The \ufb01lters\nof Wf\nt3 show some centrally-averaging patterns, which indicate that the predicted high-\nresolution frame is obtained by averaging over the feature maps in the second hidden layer. This\naveraging operation is also in consistent with the corresponding reconstruction phase in patch-based\nSR methods (e.g., [25]), but the difference is that our \ufb01lters are automatically learned rather than\npre-de\ufb01ned. When comparing the learned \ufb01lters between feedforward and conditional convolutions,\nwe can also observe that the patterns in the \ufb01lters of feedforward convolution are much more regular\nand clear.\n\n5 Conclusion and Future Work\n\nIn this paper, we have proposed the bidirectional recurrent convolutional network (BRCN) for multi-\nframe SR. Our main contribution is the novel use of bidirectional scheme, recurrent and conditional\nconvolutions for temporal dependency modelling. We have applied our model to super resolve\nvideos containing complex motions, and achieved better performance and faster speed. In the future,\nwe will perform comparisons with other multi-frame SR methods.\n\nAcknowledgments\n\nThis work is jointly supported by National Natural Science Foundation of China (61420106015,\n61175003, 61202328, 61572504) and National Basic Research Program of China (2012CB316300).\n\n8\n\n\fReferences\n\n[1] Video enhancer. http://www.infognition.com/videoenhancer/, version 1.9.10. 2014.\n[2] S. Baker and T. Kanade. Super-resolution optical \ufb02ow. Technical report, CMU, 1999.\n[3] B. Bascle, A. Blake, and A. Zisserman. Motion deblurring and super-resolution from an image sequence.\n\nEuropean Conference on Computer Vision, pages 571\u2013582, 1996.\n\n[4] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel. Low-complexity single-image super-\n\nresolution based on nonnegative neighbor embedding. British Machine Vision Conference, 2012.\n\n[5] H. Chang, D.-Y. Yeung, and Y. Xiong. Super-resolution through neighbor embedding. IEEE Conference\n\non Computer Vision and Pattern Recognition, page I, 2004.\n\n[6] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-\n\nresolution. European Conference on Computer Vision, pages 184\u2013199, 2014.\n\n[7] D. Eigen, D. Krishnan, and R. Fergus. Restoring an image taken through a window covered with dirt or\n\nrain. IEEE International Conference on Computer Vision, pages 633\u2013640, 2013.\n\n[8] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learning low-level vision. International Journal of\n\nComputer Vision, pages 25\u201347, 2000.\n\n[9] D. Glasner, S. Bagon, and M. Irani. Super-resolution from a single image. IEEE International Conference\n\non Computer Vision, pages 349\u2013356, 2009.\n\n[10] M. Irani and S. Peleg. Improving resolution by image registration. CVGIP: Graphical Models and Image\n\nProcessing, pages 231\u2013239, 1991.\n\n[11] V. Jain and S. Seung. Natural image denoising with convolutional networks. Advances in Neural Infor-\n\nmation Processing Systems, pages 769\u2013776, 2008.\n\n[12] K. Jia, X. Wang, and X. Tang. Image transformation based on learning dictionaries across image spaces.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, pages 367\u2013380, 2013.\n\n[13] C. Liu and D. Sun. On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, pages 346\u2013360, 2014.\n\n[14] D. Mitzel, T. Pock, T. Schoenemann, and D. Cremers. Video super resolution using duality based tv-l 1\n\noptical \ufb02ow. Pattern Recognition, pages 432\u2013441, 2009.\n\n[15] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. International\n\nConference on Machine Learning, pages 807\u2013814, 2010.\n\n[16] M. Protter, M. Elad, H. Takeda, and P. Milanfar. Generalizing the nonlocal-means to super-resolution\n\nreconstruction. IEEE Transactions on Image Processing, pages 36\u201351, 2009.\n\n[17] R. R. Schultz and R. L. Stevenson. Extraction of high-resolution frames from video sequences. IEEE\n\nTransactions on Image Processing, pages 996\u20131011, 1996.\n\n[18] M. Schusterand and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal\n\nProcessing, pages 2673\u20132681, 1997.\n\n[19] O. Shahar, A. Faktor, and M. Irani. Space-time super-resolution from a single video. IEEE Conference\n\non Computer Vision and Pattern Recognition, pages 3353\u20133360, 2011.\n\n[20] I. Sutskever and G. E. Hinton. Learning multilevel distributed representations for high-dimensional se-\n\nquences. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 548\u2013555, 2007.\n\n[21] H. Takeda, P. Milanfar, M. Protter, and M. Elad. Super-resolution without explicit subpixel motion esti-\n\nmation. IEEE Transactions on Image Processing, pages 1958\u20131975, 2009.\n\n[22] G. Taylor, G. Hinton, and S. Roweis. Modeling human motion using binary latent variables. Advances in\n\nNeural Information Processing Systems, pages 448\u2013455, 2006.\n\n[23] R. Timofte, V. De, and L. V. Gool. Anchored neighborhood regression for fast example-based super-\n\nresolution. IEEE International Conference on Computer Vision, pages 1920\u20131927, 2013.\n\n[24] L. Xu, J. S. Ren, C. Liu, and J. Jia. Deep convolutional neural network for image deconvolution.\n\nAdvances in Neural Information Processing Systems, pages 1790\u20131798, 2014.\n\nIn\n\n[25] J. Yang, J. Wright, T. S. Huang, and Y. Ma.\n\nImage super-resolution via sparse representation.\n\nTransactions on Image Processing, pages 2861\u20132873, 2010.\n\nIEEE\n\n[26] R. Zeyde, M. Elad, and M. Protte. On single image scale-up using sparse-representations. Curves and\n\nSurfaces, pages 711\u2013730, 2012.\n\n9\n\n\f", "award": [], "sourceid": 116, "authors": [{"given_name": "Yan", "family_name": "Huang", "institution": "CRIPAC, CASIA"}, {"given_name": "Wei", "family_name": "Wang", "institution": "NLPR,CASIA"}, {"given_name": "Liang", "family_name": "Wang", "institution": null}]}