Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Overall this is a really interesting idea incorporating concrete visual concepts and more abstract metaconcepts in a joint space and using the learning of one to guide the other. There are some issues below, mostly details about training implementation, that could clear up my questions. 1. Why not use pretrained word embeddings for the GRU model? The issue here is that the object proposal generator was trained on ImageNet, meaning it almost definitely had access to visual information about the held out concepts in Ctest. The GRU baseline, even signficantly less training data, outperforms for instance-of. Would the same be true of synonym metaconcept if it had been trained on data like Common Crawl or Wikipedia? The authors should either pretrain the vision model with access only to the training concepts or train word embeddings with more data for a more fair comparison. 2. In Section 4.3.2, how do the authors know instance-of metaconcept is the reason the model outperforms the others on CLEVR? Also, all the models essentially have the same performance on GQA. 3. In Table 4, high performance on synonym makes sense because the concepts are visually similar and visual features are computed from a model pretrained on ImageNet. For the instance-of metaconcept, which likely relies more on linguistic information than synonym, the GRU with word embeddings trained on a small amount of data outperforms the pretrained semantic parser. There’s a huge performance gap between the metaconcepts, the largest being for this model, that should be explored more in-depth. This also ties into the comment above. 4. Did authors consider using existing zero-shot compositional visual reasoning datasets (e.g. ) instead of manually generating questions and answers? Clarity: - Well-written paper with clear, easy-to-follow sections. A small change on Fig 3 could be to add more space between the lines coming from the meta-verify parse. - Was the GRU portion of GRU-CNN also pretrained only on the questions? - Number of metaconcepts used isn’t clear until the final section before the conclusion (except brief mention in Table 1). Perhaps add an additional line in dataset section. - How was the semantic parser trained? If it wasn’t on just training questions, in the same way the GRU was trained, then this isn’t a fair comparison with the GRU and GRU-CNN baselines. - Table 4 only includes results for GQA but also mentions results for CLEVR in the text. Notes: -line 110: “In Figure 2, We illustrate..” -> “In Figure 2, we illustrate” -line 119: “we first runs..” -> “we first run” -line 181: “we assume the access to” -> “we assume access to: -line 247: “it fails the outperform” -> “it fails to outperform” -line 260- “to indicate the models” -> “to indicate to the models” References:  C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset. Agrawl et.al. 2017.
Strengths: - The proposed model is relevant and close to how most human learning is done ie. through hierarchy of concepts and often inference is done through interaction of those concepts. - The proposed model learns the visual concepts and the meta concepts which specifically aid in zero shot learning and generalizing to noisy, unseen and even biased dataset conditions.
originality: it is difficult to evaluate the originality because there is no discussion of relevant work on metaconcepts. Perhaps none exists (I'm not familiar with the area), and if so the authors should be more clear about that. quality: I find the discussion of the results in Table 3 puzzling -- it looks to me like the metaconcept "instance of" is (barely) helping in the synthetic dataset only, and is not providing any benefit in the real dataset. Moreover, these accuracy are awfully close to 50% -- is this not chance accuracy? It would be helpful to provide significance testing for the differences. clarity: Generally, the paper is well written, though it does assume a bit of the reader in terms of background knowledge. Also, L216 mentions that the proposed method will be compared to BERT but I cannot find this comparison. significance: Because only two metaconcepts are tested, and only one of them actually somewhat benefits concept learning in a real dataset, it is difficult to say how significant this method is. *** POST-REBUTTAL *** I had 2 major concerns about this work. The first was that the considered metaconcepts were quite contrived, and it was not easy to see what other metaconcepts can be incorporated. In their rebuttal, the authors include results with a new metaconcept "hypernym", but I fail to see how this is different from their original metaconcept of "is instance of". The second major concern was that the experimental results did not seem thorough -- I can't find any results reported over multiple random seeds, despite the authors claiming that they have provided clear error bars and standard deviations in the reproducibility checklist. This concern also holds for the new results reported in Tables A,B, and C. Specifically the results in Table C would have been more convincing if they had been computed in a bootstrapped fashion (if they had actually been done this way, the authors should have provided a standard deviation). These concerns still stand after the rebuttal. Additionally, a more minor concern is that I feel this work is not a good fit for the Neuroscience and Cognitive Science track that it has been submitted to. I find it very loosely related to human psychology, and difficult to see how a researcher in this area (like myself) would be interested in this work. For these reasons, I maintain my review score.