This paper had borderline scores. Overall, I think this paper presents a nice core finding that was sufficiently well validated in the context of simulations. The simulated results are reasonably compelling and relatively thorough analyses were presented. In the discussion with reviewers, there was a reasonable consensus that the author response overemphasized the extent to which the reviewers were hung up on the lack of experimental data. While R3 felt most strongly that this specific paper was not strong enough without further empirical validation, this was clarified to not be a bias against simulation papers generally, but rather the reviewer's opinion that they were not convinced this specific paper's results were strong enough without real data. Other reviewers had indicated that their concern was actually, at least in part, that the authors simply overstated their claims given the lack of empirical validation (more below). There was also a sense by R1 and R4 that perhaps the authors might have been able to have found a way to leverage pre-recorded neural data (collected under conventional settings) to somehow partially validate the claims of the paper. The author response did not attempt to address this point, and this was dissatisfying to the reviewers. While at this point I don't believe I'm capable of further enforcing changes to the paper, there are a few summary points which warrant further attention. Remaining concerns 1) Lack of nuance in motivation: It is unambiguously clear that it would be useful to reduce the amount of experimental time required to directly fit models of neural responses. This was part of the motivation, and it is essentially correct. Nevertheless, this work does not fully contextualize itself relative to attempts to perform direct fitting and instead just asserts that it isn't presently tractable. There are, as R1 noted, multiple previous papers which successfully train direct models of neural activity from relatively small quantities of data, as well as other methods that have attempted to come up with creative approaches around this problem. For example, multiple papers by Fetz and colleagues in the early 90s did versions of direct training using RNNs to predict neurons in the motor system, admittedly for small input spaces. More recently, there has been some work directly fitting neurons in the retina with relatively small amounts of data (e.g. McIntosh et al. 2016, Batty et al. 2017). And for visual cortical neurons, papers from the Zylberberg lab are particularly relevant. Specifically, Kindel et al 2019 performs direct fitting from publicly available data. Confusingly, the authors cite this paper along with a large group of references in support of the statement "To date, the most successful applications of this approach use visual features extracted from natural images by pre-trained models". However, as far as I can tell, Kindel et al in fact comes to the opposite conclusion from what was stated in the present submission. In that work, a headline result was that fully data driven model of V1 outperformed pretrained models (Fig 4). I found the misrepresentation of this work particularly alarming since it is a work about V1 and the title, abstract, and framing of the present submission focus on visual cortex. While the authors were correct to point out in their response that direct training by neural networks specifically of V4/IT neurons has probably not been done, those regions were not well framed as the exclusive focus of the present submission. And if those brain regions are the main focus, other work by e.g. Pasupathy and Connor or Cadieu et al (mentioned by R4) would be relevant to discuss. Overall, I believe the authors could have situated their work better...that is, without errors and with more nuance. In my reading of the reviews, this was core to R1's concerns and I share them. "Improved object recognition using neural networks trained to mimic the brain’s statistical properties" Federerer et al 2020 (published after NeurIPS submission, but possibly of interest) "Multilayer recurrent network models of primate retinal ganglion cell responses" Batty et al. 2017 "Using deep learning to probe the neural code for images in primary visual cortex" Kindel et al 2019 "Deep learning models of the retinal response to natural scenes" McIntosh et al. 2016 2) The use of the word "gaudy" to describe the images used, while perhaps meant to be catchy, seems to me very misleading, given that this word already has a common usage that doesn't seem to me to align very well with the proposed technical definition. I asked the reviewers what how they felt about this point. Two reviewers agreed this word choice is confusing (the other two did not address this point). The work seemed to me to simply be using a kind of "saturated" image. Suggestions by reviewers for alternatives might include sticking with "high-contrast" or using either "color quantized" or "binarized natural images". 3) The title and some of the claims seem to overreach, as discussed above. In the reviewer discussion and responses it was clear there was a bit of irritation that the title and some comments within the paper implied stronger validation than existed in the actual paper (as R4 noted, with a suggestion to rephrase to “simulated visual cortex”). Again the authors responded that it was too much to ask for them to perform electrophysiology experiments before publication in NeurIPS. While three out of four reviewers and this AC were willing to accept this claim in part, what was primarily being argued was that the paper should more accurately reflect its limitations and caveats, especially around the fact that it hadn't performed real-world experiments. I would strongly urge the authors to address these summary points, as well as specific comments by the individual reviewers in their revisions. Nevertheless, the quality of the core contributions along with the generally clear presentation and strong simulated validations make me comfortable recommending this paper for acceptance.