Reviews: Learning from brains how to regularize machines

CNNs, like visual cortex, build a representation of the visual world that is useful to the “viewer”. We have known for a while that CNNs trained on object recognition tasks capture some (but not all) aspects of the representation computed by primate visual cortex. Here the authors propose to bridge the gap by explicitly encouraging a CNN to build a representation that is “similar” to the one computed by the visual cortex of mice. This is a neat idea and certainly a novel one. The paper is clearly written, which I appreciated. The research question being tackled is clearly explained, the experiments are properly designed (I really like the randomized matrix control). The research being presented is properly placed in the context of existing literature and all the moving parts are well summarized, which made the paper very easy to follow. I wish the authors would consider citing at least one piece of work by Poggio and collaborators. Poggio spent the first half of this decade characterizing the robustness of neural and artificial representations, so I think that line of inquiry is relevant. Maybe either Tacchetti, Isik, Poggio Annual Reviews of Vision Science 2018, which is a good review of that work, or the book Poggio, Anselmi 2016 with MIT Press. I genuinely liked this paper, and I think it presents a very interesting idea that I am sure will inspire further inquiry. There are a few things I wish the authors had included, and hopefully might serve as a suggestion for a revised camera ready version, and which might improve the significance of this work. 1) While this paper certainly presents a “cool” idea, it is unclear what we, as either neuroscientists trying to understand the brain, or computer scientist trying to replicate human visual intelligence, are to make of this result. Is there a way to dive deeper and understand which nuances the unregularized CNN representation was missing? What is the “computational goal” encoded in the mice similarity matrix that was not present in the baseline CNN representation? How does this result help us build better CNNs, or what did we learn about the mouse visual system that we did not know before? 2) One possible way to start getting at this, would be investigating what happens if one trains the CNN exclusively with the regularization term (no task). Is that representation worse? Where does it fail? 3) In a similar spirit, random noise and adversarial perturbations are solid choices for a sanity check, but do not really reveal where the two representations differ and why. I wish the authors had designed more semantically relevant “attacks”. Maybe it’s 3-D rotations, maybe it’s illumination, what is it that CNN are missing. 4) Finally, a minor suggestion: it is my understanding that mice are pretty much blind, and either way do not “use” their sense of vision all that much. Why did you choose mice? Could you put a sentence or two in the text to explain your choice? Thank you for sharing these cool ideas and results! I hope these suggestions help. All the best!

Reviewer 2

This paper introduces a neural regularization method for CNN based image classification architectures. The authors hypothesize that biasing the representation of artificial networks towards biological stimulus representations might positively affect their robustness. They present natural images (CIFAR10) to mice and measured the responses of thousands of neurons from cortical visual areas. A prediction model is trained on the collected data (100 oracle images), in order to estimate the neuron responses to a much larger image set (5000 images) and perform denoising. The neural representation similarity is then used to regularize CNN by penalizing intermediate representations that are deviated from neural ones. Experimental results on the CIFAR10 data show that the proposed regularization method can achieve better performance on classifying noisy images, in comparison with several baselines or control models. (1) The idea of using the similarity of neurophysiological data to regularize the representation of artificial neural networks is interesting and novel. The authors have made good efforts towards bridging the fields of neuroscience and machine learning through such regularization. (2) Experimental results on the CIFAR10 data demonstrate the effectiveness of the proposed method and provide some validation of the hypothesis. (3) In the proposed joint training method (Section 3), a selection of layers from the bottom to the top of the architecture is chosen. The final similarity of representations from CNN is a weighted average of the similarity calculated from each layer. The output from different layers can be dramatically different, which lead to quite different final similarity values. The authors need to provide guidelines on how to choose such a selection of layers. (4) In Section 2.2, a predictive model is used to "denoise neural responses". What is the prediction accuracy (or correlation) in the proposed experiment setup (i.e., use the neural response to 100 oracle images to predict the response to the non-oracle 5000 images)? The scaled model response is defined as $\hat{r}_{ai}=w_{a}v_{a}\hat{rho}_{ai}$. It is not clear which correlation measure is used to compute $v$. Generally speaking, correlation coefficient measures the statistical relationship between two variables. The definition of the scaled model response is not mathematically correct and meaningful. I am a bit concerned about how the performance of the predictive model can affect the final results. The prediction accuracy of a model built on 100 images is likely not high when applied to 5000 images. (5) The proposed method is only applied to one dataset CIFAR10 and one architecture ResNet18. The authors may want to report the performance on other datasets and architectures as well. (6) The authors claim that "we denoised the notoriously variable neural activity using strong predictive models trained on this large corpus of responses from the mouse visual system, and calculated the representational similarity for millions of pairs of images from the model’s predictions". From the current experiment setup, the prediction model is only applied to 5000 non-oracle images, not on million scale. (7) How can the proposed neural regularization method be used in practice? It is apparently not a trivial effort to collect actual neural responses, build prediction model, and then perform joint training. It would be great if the authors can provide some discussions along this line. (8) Minor issue: there are also some typos and grammar issues in the paper. Post-Rebuttal: I appreciate the authors' response, which helped clarify some important technical details, especially on the use of 100 oracle images, the training of predictive models on 5000 non-oracle images, as well as the scale of training data. Authors also added new expriments on other datasets and architectures. I hope these clarifications and discussions can be included in the final version of the paper.

Reviewer 3

It is a very meaningful topic to use the real biological stimulus to help neural network training. This paper proposes an interesting idea on regularizing NN by preserving the feature distance measured by mice's brain. If it is the first paper on doing this, I really think it would be an interesting paper to appear on NeurIPS. However, the paper does have many drawbacks. First, this paper does not have a related work section and does not provide enough survey on the prior work on this domain. It leads to another issue of the paper, not stating the contribution clearly. For example, does this paper is the first to use the mice brain's stimulus to regularize NN? For the technique part, some design is not well motivated. 1. Why use grayscale images instead of RBG images? Is it because mice not recognize color? 2. Why need 'oracle images'? This part is very confusing. Why some images have quality predictor? How do you pick those 'oracle images'. Please explain. 3. The design of the similarity of convolutional features is also tricky. Why the linear coefficient (equation 10) to combine the similarities in each layer is trainable? What if I just take the average of the layer similarities as the convolutional feature similarity? Minor comments: Although it is an interesting method, it seems not very scalable since I have to show images to animals to get regularization data. Also how the amount of the regularization data affects the regularization result. What if I use more data, will the result be better?

Paper ID:	5059
Title:	Learning from brains how to regularize machines

Reviewer 1

Reviewer 2

Reviewer 3