Paper ID: 116
Title: Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper uses a bidirectional, convolutional RNN to do multi-frame super resolution. They compare their method to a number of single frame and multi-frame methods.

This appears to be an extension of the model proposed by Dong et al 2014 to the temporal domain. The convolutional network is extended into the temporal domain by having featuremaps which are dependent in time alongside the featuremaps found in a standard feedforward convolutional network. The recurrent convolution is novel. The model is fairly simple and elegant to implement.

The bidirectional aspect is not novel and goes back to at least as early as 1997 (Schuster and Paliwal) []. Bidirectional networks are also extremely common in the NLP and speech recognition literature.

The authors should provide more detail if pretraining the feedforward weights is absolutely necessary for the model to work as intended. If it is necessary, the authors should provide more detail to the pretraining process (whether it was trained on static images or video images frame by frame).

The recurrent/conditional convolutions provide the most substantial gains is a nice result. But the marginal improvement provided by both connections combined (v,r,t) seems to imply one becomes redundant in the presence of the other.

The method seems to work better than other previous methods, although it is hard to evaluate by how much.

Other comments: -- No citations for bidirectional RNNs (see above) -- The filter visualization section is not very illuminating -- Could offer better motivation for the architecture, why would one expect this to work better than existing methods? -- Not clear how they deal with edge effects. Eg. using a convolution reduces the size of the image. Going forward in time, they have filter sizes of size 1 so the size doesn't change, but when going deeper into the network, the filter size is 9, so the resulting feature map is smaller than the original image. Not sure how they deal with this.

-- figure 3, the image with the flag and power lines, the region with the power lines has some ringing artifacts, comments?

Overall, I like the paper. the recurrent convolution is a very good/obvious idea.

Q2: Please summarize your review in 1-2 sentences
The use of a convolutional RNN is a good idea and a novel contribution to this problem.

However the paper could do more to shed insight on why the method actually works well compared to existing methods.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
Paper proposes a recurrent CNN architecture for multi-frame super resolution. It employs three kinds of convolutional filters: (i) Feed-forward; (ii) Recurrent; (iii) conditional convolutions. Overall the architecture is novel but seems straightforward combination of existing deep learning modules.

Technical quality of paper is borderline. It is a straightforward combination of existing deep learning modules. However, at the same time novelty lies in the fact that it has not been done before for multi-frame SR.

In the rebuttal authors should address the following questions: (i) How many parameters does the model have? (ii) Did they compare with pre-trained SR-CNN or whether they re-trained SR-CNN on the new data set? (iii) It is not clear why SR-CNN run-time slower than BRCN? (iv) In Fig 3 & 5 how does the images for SR-CNN looks like?

Originality of the paper is incremental. In table 2 authors show how individual parts of the architecture contribute. Though simply BRCN {v,r} outperforms the existing methods. It is well acceptable by now that making architecture more complex and increasing the number of parameters helps improve the accuracy. This is evident from table 2 that BRCN {v,t} {v,r,t} & {v,r,t,b} gives incremental improvement in performance.

Significance of the paper for NIPS audience is open for discussion. The architecture introduced in the paper is novel but it is incremental to the field of deep learning.
Q2: Please summarize your review in 1-2 sentences
Paper proposes a novel architecture for multi-frame SR. It is easy to follow and builds upon previous work on single-frame CNN for SR [6]. In experiments some details are missing which can be fixed. Overall the paper is incremental but thorough. Its suitability to NIPS audience is open for discussion.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see

This paper proposes a bidirectional recurrent convolutional network for multi-frame super resolution. In a sense the paper formulates the problem in a way that I think most people familiar with RNNs would consider as a straightforward and sensible way to formulate the problem. The approach appears novel. The Dong et al. ECCV 14 work on learning deep convolutional networks for image super resolution appears to indeed be the most relevant recent work, but it was applied to static frames. This work seems like a very natural extension in the general trajectory of deep convents for super resolution. This combined with the fact that this work provides an evaluation of reasonable quality makes the paper acceptable for NIPS in my view.

The most interesting aspect of this work is perhaps the exploration in Table 2 where the effect of feedforward convolutions, recurrent convolutions and conditional convolutions are evaluated.

The statements about MATLAB vs Python when considering different explanations for the differences in run times seems like a side issue compared to the more important question of GPU acceleration. Modern convolutional neural networks are almost always implemented using GPUs in state of the art systems. This paper states on line 362 that the implementation of the approach presented here is in Python, but there is no discussion either way concerning the use of GPU acceleration.

Please clarify this issue. If a GPU was not used, this method could be dramatically faster. If a GPU was used, then the comparison with prior work really needs to be cast in that light. Many prior methods are likely amenable to GPU acceleration as well.

Accurate motion estimation in particular is given as the traditional bottleneck for non-RNN based methods. Such techniques could potentially be quite effectively accelerated with GPU methods.

Language Issues:


* Considering that recurrent neural network[s] (RNNs) can... * Different from vanilla RNN[s]... * conditional convolutional connections from previous input layers to [the]current hidden layer are added for enhancing visual-temporal dependency mod[el].

Please proof read the body text for other language issues.

Conclusions: In the future, we will [perform] comparison[s] with [other] multi-frame SR methods

Q2: Please summarize your review in 1-2 sentences
This paper presents both some strong quantitative results as well as clear visual results illustrating the effectiveness of a bidirectional recurrent convolutional network approach for multi-frame super resolution. The paper has a few language issues (which need to be resolved by the authors for the paper to be acceptable), but the model explored is very sensible and the results have a good mix of quantitative performance increases, increases to visual quality and compelling computation times relative to prior art.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
[this is a light review] I would encourage the authors to discuss in more detail how their approach compares to other approaches w.r.t. quantitative and qualitative evaluation of between frame quality (i.e. consistency/flickering etc.).

== post rebuttal == According to the rebuttal, a positive point is that the the authors plan to release their model. Additionally it would be helpful if they release their code to allow repeatability.
Q2: Please summarize your review in 1-2 sentences
The paper presents a novel and efficient (w.r.t. runtime) approach, which is evaluated against a large body of related work as well as ablations. Given that the task is video, it would be good to extend the qualitative evaluation of multi-frame in Fig 5. with a discussion and a quantitative comparison (e.g. consistency measure).

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
81 : Maybe explain the current motion based methods better for readers who are not familiar. -Is it the first work that use convolution for recurrent transition for any kind of problem? If not please mention it. -The name conditional convolution is misleading and not accurate as there is no conditioning , it's just a convolution from previous time step input. 180: Explaining second hidden layer in detail is redundant

3.2 Why out of no where suddenly it's compared to TRBM in a whole section? If TRBM an important related work it should have been mentioned before.
Q2: Please summarize your review in 1-2 sentences
The paper propose a new sequence based video super resolution model. The model replace the fully connected recurrent transition with convolutional recurrent transition and add extra convolution transition from the last time step inputs. They achieve better results with lower computational cost. It's an interesting computer vision application paper, with good results and I enjoyed reading it. But I am not sure about its impact to the nips community and it might be better received at computer vision conference.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for their insightful comments.

To R_1

Q: Is pretraining necessary?
A: It is unnecessary if with enough training time and data. However, to speed up training, we employ the pretraining (on static images [6]) to initialize our feedforward networks.

Q: The filter visualization is not very illuminating
A: We have clearly explained the learned filters in Line 412-421. The filter patterns show the consistence with specific operations in previous methods (e.g. [24]), which is very meaningful.

Q: Motivation for the architecture
A: Since the great success of CNNs in spatial domain and RNNs in long-term sequence modelling, we integrate their merits to propose bidirectional recurrent convolutional networks (BRCNs) to model the spatial (image frame) and temporal (motion) dependencies in video superresolution simultaneously. BRCNs have great potential to surpass existing methods.

Q: Edge effects
A: Similar to [6], we can perform zero-padding to guarantee that the feature map has the same size as the input.

Q: Ringing artifacts
A: The ringing artifacts may result from the bicubic interpolation input which brings these artifacts itself. In the future, we will consider inputting low-resolution data to the networks directly, instead of bicubic interpolation.

To R_2

Q: How many model parameters?
A: There are 8129/5216/8128 parameters in feedforward/recurrent/conditional convolutions in each direction. For the bidirectional architecture, there are (8129+5216+8128)*2=42946 parameters in total.

Q: Compare with pre-trained or re-trained SR-CNN?
A: We retrained SR-CNN on the new dataset, and compared with both pretrained and retrained SR-CNN. The two SR-CNNs obtain similar results.

Q: Why is SR-CNN slower than BRCN?
A: We adopt the released code of SR-CNN [6] in the experiments. As the authors declared, "this code is not optimized and the speed is not representative". Theoretically, SR-CNN should be faster than BRCN.

Q: In Fig 3 & 5, how about the results of SR-CNN?
A: SR-CNN recovers less image details than our model, and produces the similar visual results to ANR [22]. We will add the results in final version.

Q: Its suitability to NIPS
A: In addition to its matching with NIPS topics of video processing/deep learning, this work is suitable to be published at NIPS due to the following aspects. 1) Learning to super resolve video is an important problem which needs to model spatial and temporal dependencies simultaneously. We propose an end-to-end solution for it. 2) As R_4 said, `this paper reports the first attempt to apply deep learning to multi-frame super-resolution'. 3) This work provides a good example to the fields of computer vision and deep learning by integrating the merits of CNNs and RNNs to solve the spatial-temporal modelling problems. 3) The better performance verifies the proposed bidirectional architecture, which provides a successful practice for modelling temporal sequences with RNNs.

To R_3

Q: Use GPU acceleration or not?
A: We train the networks on an Nvidia K20 GPU, and test to obtain all experimental results ONLY using CPU. If testing with GPU, the average time can be reduced from 0.61 to 0.02 seconds.

To R_4

Q: Compare with only two multi-frame methods, 3DSKR is relatively old, recent [13] is missing
A: 3DSKR and Enhancer are two state-of-the-art multi-frame methods which are often used as compared methods, e.g. in [13]. In this paper, we should have compared with [13]. But we do not due to the lack of publicly available codes. We will include this comparison by re-implementation in the future.

Q: Why [6] has lower PSNR than bicubic?
A: The comparison results between SR-CNN [6] and bicubic on video sequences have not been reported. The little lower PSNR is not strange, as [6] may not well handle some regions with motion blur which never occurs in its still image training dataset.

To R_6

Q: It might be better received at computer vision conference
A: We have emphasized its suitability to NIPS in the responses to R_2. Moreover, visual processing is an important topic in NIPS, e.g. Xu et al., NIPS2014, and Charles, NIPS2014 for image deconvolution and denoising, respectively.

Q: Is it the first work that use convolution for recurrent transition?
A: To the best of our knowledge, yes.

Q: Why suddenly compare with TRBM in a whole section?
A: TRBM is a widely used model for sequence modelling. Through comparing with TRBM in terms of structure and characteristics, we can further discover the advantages of the proposed BRCN in video modelling.

Thank the reviewers for pointing out the missing references, language errors and typos. We will carefully revise them in final version. We will also follow the suggestions of R_6 to write the Related Work and the Formulation clearly, and the suggestions of R_5 to discuss more on evaluation methods. In addition, we will make the pretrained networks available.