NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:1235
Title:Review Networks for Caption Generation

Reviewer 1

Summary

The paper proposes an extension to popular encoder-decoder approaches, e.g. for image captioning. Specifically, the idea of the paper is to include an intermediate “reviewer” module which predicts a set of fact vectors, which could e.g. represent the most important concepts in an image. These fact vectors are then provided, in addition to a hidden representation to the decoder model. Interestingly, the fact vectors can be supervised additionally during training time.

Qualitative Assessment

Strength: - Interesting idea to include additional supervision in the encoder-decoder learning approach. - Experimental evaluation on two tasks and dataset, in both cases showing improvement over ablations. - The paper shows that the approach is a generalization over normal encoder-decoder approaches. Weaknesses: 1. For the captioning experiment, the paper compares to related work only on some not official test set or dev set, however the final results should be compared on the official COOC leader board on the blind test set: https://competitions.codalab.org/competitions/3221#results e.g. [5,17] have won this challenge and have been evaluated on the blind challenge set. Also, several other approaches have been proposed since then and significantly improved (see leaderboard, the paper should at least compare to the once where an corresponding publication is available). 2. A human evaluation for caption generation would be more convincing as the automatic evaluation metrics can be misleading. 3. It is not clear from Section 4.2 how the supervision is injected for the source code caption experiment. While it is over interesting work, for acceptance at least points 1 and 3 of the weaknesses have to be addressed. ==== post author response === The author promised to include the results from 1. in the final For 3. it would be good to state it explicitly in Section section 4.2. I encourage the authors to include the additional results they provided in the rebuttal, e.g. T_r in the final version, as it provides more insight in the approach. Mine and, as far as I can see, the other reviewers concerns have been largely addressed, I thus recommend to accept the paper.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

The paper proposes a novel sequence-to-sequence architecture that includes an encoder and a decoder, as usual, but adds a "reviewer" module in the middle that can perform a fixed number of attentive iterations over the input representations in order to better represent the complete input sequence. This process can also be refined by adding an intermediate loss to guide it, for instance a bag-of-word loss over the target sequence. Experiments are conducted on two different tasks (image captioning and code captioning) and comparisons are made with other state-of-the-art approaches.

Qualitative Assessment

In the last few months, a few related approaches have been proposed that should probably be considered. I remember (at least): - Order Matters:..., by Vinyals et al, ICLR 2016 - Adaptive Computation..., by Graves, ArXiv 2016. The "Order Matters" paper proposes a "Process" module between the encoder and the decoder which has a lot of similarities with the "Attentive Input Reviewer" version I think. The "Adaptive Computation" paper bears similarities with the Decoder module I think. I like the idea of the "discriminative supervision" which enable to re-use the same supervision in a different way (kind of bag-of-word supervision) but I would really like to see how important it is: it seems that in the experiment the \lambda factor that mediates between this loss and the usual loss is fixed, so we don't really know how important this is. The experiments on COCO do not mention that on the MSCOCO website, more recent results are available and show better performance on about all metrics compared to the ones in the paper. Another interesting experiment I wish was provided was the importance of T_r, which is set to 8 in the COCO experiment. What happens when this is changed (lower? higher? is there potential overfitting?). Regarding the image captioning model using VGG, could you specify where was the input attention taken? (which layer of VGG) Regarding the attentive input reviewer vs attentive output reviewer, I would have liked to have a discussion on when to use which and how they compare. ======== I have read the author's response, which answered all my concerns. I think the updated paper should be accepted.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

This paper proposes to introduce a "reviewer" module into attentive recurrent encoder-decoder networks. The main idea is that the attentive decoder can benefit from access to global summary vectors - referred to as "facts" in the paper - for improved performance in image and source code captioning. After the encoder has processed a sequence, the reviewer module makes T_{r} further passes over the encoded sequence, each time adding a new vector to the accumulated set of facts F. The reviewer can be trained with an auxiliary task such as predicting the presence of vocabulary words, or end-to-end for decoding. The encoder-reviewer-decoder (ERD) variants are shown to outperform the baseline attentive-encoder-decoder, and is competitive with the state-of-the-art models for image captioning.

Qualitative Assessment

The proposed modification to the architecture of attentive encoder-decoder networks is sensible and appears to confer a performance boost in image captioning. However, it is not clear how much of the benefit comes from simply adding more capacity to the model, and how much is due to the specific architecture of extracting a set of global "fact" vectors before decoding. Overall, this looks like a slightly better way of doing things, but not a groundbreaking result. Detailed comments/questions: - Does discriminative supervision also help the attentive encoder decoder? - How does ERD compare to attentive encoder decoder with more layers? - Currently all of the facts are derived from the image alone and potentially grounded in some discriminative task. Is there any other way to extract a set of factors that might use a knowledge base or other sources other than the image alone, e.g. prior knowledge about objects appearing in the scene and their common-sense relations, etc?

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

Extending the traditional encoder-decoder framework for end-to-end learning, this paper proposes adding a reviewer module in the middle for visual caption generation. Experimental results show that reviewer module does improve the performance of image captioning and source code captioning over the traditional attentive encoder-decoder framework. The additional reviewer module with multiple-step attention over hidden units captures global information across several attentions, and also has a discriminative loss to guide the overall learning process.

Qualitative Assessment

The idea of using a multi-step reviewer module is very interesting. Overall, the paper is well written. The attentive mechanism has two variants, input reviewer and output reviewer. The authors should explain the difference and their usage under different settings in details in the method section. The multi-step attention is very similar to the multi-hop attention used in memory networks. It is known that the final performance is highly dependent on different steps. The authors should explain how increasing steps of reviewing affects the results.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)