I have read the reviews and the author response and I have also asked an expert AC to also provide a comment in lieu of a 4th reviewer (pasted below for reference). Taken all these together I will recommend acceptance, with a note. NOTE TO AUTHORS: This work is going to be the reference paper for using generation as opposed to discrimination. As such, it is really crucial to set the right path for evaluating model in a fair and rigorous way, so that research that follows on builds on a solid base. The presented evaluation has some issues (see points bellow). Please, use the feedback provided and incorporate the human evaluation in the paper. ====== All reviewers agree that this is an intriguing way of viewing progressive matrices tests as a generation problem rather than a discrimination problem since. The authors themselves point as a motivation that "the ability to generate a correct answer is the ultimate test of understanding the question". I personally agree that this is an extremely interesting hypothesis, but the paper as it stands only goes half-way into question convincingly. Currently, the main evaluation of the paper seems to be focussing on the generation quality of a particular model itself, rather than whether or not the generation process provides intrinsically a better training signal than the discrimination one. At the same time, echoing R1, evaluating generation quality is currently problematic, since this is done using learned models, which to begin with are far from perfect, i.e., evaluating a perfect generation model would score anywhere from 75-85 depending on the underlying classification model. R2 pointed the need of human evaluation: The authors do provide some human results on the author response. My recommendation would be to try to clearly point that automatically evaluating generation is tough, present your results on that (I mean, you did the best you could) and also present the human evaluation. Perhaps the more interesting result is in Table 3. It is positive to see that the auxiliary classifier trained within this generation-pipeline improves upon other fully discriminative models. Shouldn't this also be somewhat central part of the evaluation for the models that will follow? On a final note, the authors claim in the discussion "Our work presents the first method to perform this task convincingly in the context of RPMs.", but this method doesn't really compare to baselines, so i don't think I even agree with this statement. Like, what would need to happen for the method to not perform convincingly? ======EXTRA REVIEW====== I think creating a neural network that can generate human-plausible answers to Raven's Progressive Matrices is a notable step forward in the list of things that neural networks (and ML in general) can do (given that the network is most definitely operating in the interface between symbolic / mathematical 'reasoning' and spatial/visual intelligence). I would have given the paper a score of 8 and after the author response, probably 9, and nominated for an honorable mention. The authors only consider one problem domain. A different domain would be the icing on the cake, but don't think it's critical because RPMs are very representative of visual-logical IQ tests in general. A main concern is that of the accuracy of the evaluation metric. The authors (during the rebuttal period) did a human evaluation which shows that humans generally choose the model's answer as the correct answer. This study should be included in the camera ready. Another solution would be to experiment with the 'interpolation' subset of the RPM dataset, where classifiers get 95% rather than 77% (because the problems involve less fine-grained distinctions in shades of grey). The paper could be written more clearly - it is very dense and I would have preferred a less symbolic/mathematical description of how the networks function and a more spatial/functional one. Overall, this is one of the most exciting papers I've seen in a long time. I strongly recommend acceptance.