Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The methods appear to be new, but they are mostly a collection of pretrained components with minor novelty linking the visual and linguistic components. The submission appears to be technically sound, and the experimental results seem to validate the idea put forth. The analysis is lacking; all the results point to one thing, the effectiveness of the pretraining method for transfer learning. That is the one point, and it seems to be demonstrated, but there is little further discussion. The submission is clear enough, and it is organized well enough to make an easy read. The fact that code is available aids in reproducibility where one might be in doubt about reproducing form the paper alone. The results are not particularly surprising, and they do not seem particularly revolutionary. Rather they seem like a reasonable next step for extending BERT. This is not to say the results are not valuable, only to properly scope the importance of this singular work given that much similar work is likely. Others will likely make use of this work and refer to it in the multimodal setting.
I think that this paper is a solid extension of masked language model pre-training to image-and-text (e.g., captioning) tasks. It defines two novel but intuitive pre-training tasks for this scenario: (i) predicting the semantic class of masked image regions given the surrounding image regions (from the same image) and the corresponding text, (ii) predicting whether image and text pairs are aligned. They demonstrate significant improvements over both the previous SOTA and the strong baseline of simply using a pre-trained text-only BERT model. They also show that having two encoders (with different parameters), one for images and one for text, is superior to a joint encoder. I would have liked to have seen more ablation of the pre-training tasks, since I think that this is more interesting than the model depth ablation that the authors performed. I think that the biggest weakness of the paper is that all of the experiments were pre-trained on Conceptual Captions and then evaluated on other image captioning (or closely related tasks). So effectively this can also be thought of as transfer learning from a large captioning data set to a small one, which is well-known to work. It would have been nice to see what the results could be with just image and text data without correspondences, as an additional ablation. The paper does have significant overlap with VideoBERT, but since the work is concurrent I don't think it's fair to penalize this paper because VideoBERT was uploaded to Arxiv first, so I did not factor that in.
Strengths: - Reusable and task agnostic visual-linguistic representations are very interesting approaches to tackle visual grounding problem. - The authors adapted the commonly known BERT to a new multimodal task.