Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
1. Recent parallel work by Haurilet et al. "It’s not about the Journey; It’s about the Destination: Following Soft Paths under Question-Guidance for Visual Reasoning", CVPR 2019 is quite related as it also builds reasoning on scene graphs. However, this was published after NIPS deadline and deals mostly with synthetic images. 2. The supp. material provides ablation studies. In particular, it is good to see a 4% point difference between using region features vs. region concept labels -- highlighting that the symbolic approach is indeed useful. A few additional ablation experiments: (a) Analyze the impact of errors in scene graph prediction. This seems like an important backbone especially as concept labels are being used rather than features. Is there any chance for recovery if the visual prediction makes mistakes? (b) Bias in image question answering is well known where answering works well even without looking at the image. While VQA-CP does limit this to some extent, the proposed method uses a concatenation of the question and the "answer vector" m. What would be the performance without this concatenation? With multiple steps of reasoning, one could hope that this may not be as required as other models. 3. Overall the paper is well written and clear to understand. Code will be released. It might be nice to include one of the qualitative results from the supp. material as it highlights how the proposed approach works. -------------------- Post-rebuttal: I'm happy to recommend accepting this paper. The rebuttal clarifies some of the additional questions raised by all reviewers.
I think this is a strong and interesting submission. The presented model, named "Neural State Machine" deviates from the existing approaches to visual question answering by doing 'a sequence of computations' that resembles 'a sequence of reasoning steps'. However, definitely, there are some resembles with the already existing approaches towards visual question answering. E.g., there are approaches that were using outputs of classifier as a knowledge representation, semantic parsers as computational and compositional methods to derive an answer, and use classifier uncertainty to represent concepts (e.g., "A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input"). There are also similarities to graph neural networks in the terms of compositionality, and computability (message passing). However, still the method seems to diverge significantly from these to consider it as a novel. Moreover, the results on GQA are quite strong. The following paper "Language-Conditioned Graph Networks for Relational Reasoning" comes to my mind in terms of the graphnet-equivalent of the Neural State Machine, but it performs significantly worse than the latter. However, this could be a result of different vision representation. It is worth mentioning that GQA is semi-synthetic (questions are synthetic), and hence there is a possibility to 'game' the dataset. Therefore it is nice the authors also provide strong results on the VQA-CP dataset, proving their point. In overall, I think this is interesting submission, with reasonable novel model and strong results.
*Originality* The presented approach is relatively easy to understand and doesn’t require extra training data. As far as I can tell, the model is relatively simple and is mostly operating over and recomputing probability distributions of discrete elements in the image and tokens in the sentence. It’s not a surprising next step in this area, but this approach is a good step in that direction. One concern is assumptions placed on the image content space by using a dataset like Visual Genome/GQA. Visual Genome uses a fixed ontology of properties and possible property values and (as the paper states in L129) ignores fine-grained statistics of the image (e.g., information about the background, like what color the sky is). Requiring this fixed ontology may work for a dataset like GQA, which is generated from such an ontology, but may be harder to extend to other, more realistic datasets where topics don’t have to be limited to objects included in the gold scene graph. (Of course, the VQA-CP results are SOTA as well. Thus, I would have liked to see more analysis from VQA-CP, where as far as I understand gold scene graphs are not available.) A highly related paper is Gupta and Lewis 2018 (evaluate on CLEVR by creating a differentiable knowledge graph). *Quality* As stated above, my main concern is that this method relies on a scene graph parser that uses a fixed ontology of object types, properties, and relations. It’s also not obvious to me how the state machine could capture wide scopes as instantiated by universal quantifiers, negation, or counting, which are not heavily represented in GQA (Suhr et al. 2019). Evaluating on additional challenging visual reasoning datasets with real language (and new images never before evaluated by scene graph parsers) could measure this model’s ability to handle a wider and noisier variety of language reasoning skills. Some such datasets: CLEVR-Humans (Johnson et al. 2017) and NLVR (Suhr et al. 2017) would both provide gold standard scene graphs but would evaluate other linguistic reasoning problems; NLVR2 (Suhr et al. 2019) would test both (more than VQA; very recent paper). It would have been nice to see some evaluation of the amount of noise allowable in the state machine (coming from the scene graph). I.e., through perturbing the distributions of properties, or even adding and removing objects. Another way of putting it: how many errors are caused because the scene graph prediction is incorrect, and when evaluating on noisier images (i.e., ones which have *never* been seen by a scene graph parser; as far as I understand it all images in the GQA test set have been seen at least a few times during test-set evaluation of existing scene graph parsers, so existing scene graph parsers should do modestly well on them). ---> after reading the rebuttal, sorry I wasn't aware that the test set was a completely new set of images for GQA. I appreciate the inclusion of experiments wrt. scene graph noise -- never mind on the concern about the test set! What’s the time complexity of applying an instruction to the current distribution over nodes? Since the state machine is a fully-connected graph (as I understand), it seems like this will be a very expensive operation as the scene graph gets larger. What’s the size of the splits for the GQA generalization tests (Table 4)? *Clarity* Details on how the model is trained should be included in the main paper. The approach section left me with many questions about annotation and what parameters were being learned (this should be listed somewhere). It would be naturally followed by a training section. Similarly, experimental setup (e.g., which scene graph parser is used) should be included. The output space of the decoder in Section 3.3 should be defined -- is this picking items from the input hidden states (if so, how is this supervised?)? Or is it just applying a recurrent transformation on them somehow and generating a sequence the same length as the input? Some terminology was confusing as the terms were being overloaded: . “alphabet” -- usually refers to single graphemes, not words. . “learning” in Section 3 (i.e., constructing the state machine) is not really learning anything, just performing a bit of inference to get the state machine. . “raw dense features” should be specified as the direct image features (L125) . “tagging” in L166 to me is a misuse of the term because if anything it’s a soft tag. “Alignment” would be a better term. . “normalizing” in Section 3.3 is also misused; “contextualized” would make more sense. Some other notational things: . The numbers on the edges in Figure 1 don’t seem to add anything. . The types of the elements of the sets in L108--112 should be defined. Are these all vectors? . L in L119 is unexpected without having defined the set of L properties first. . L289: “4th” should be “5th” Typo: “modaility” in L129 *Significance* This approach is easy to understand and doesn’t require extra data beyond scene graph annotation during training. It outperforms SOTA for two visual reasoning tasks.