__ Summary and Contributions__: In this work the authors trained RNNs to solve a task that requires to infer and represent two latent variables that evolve on two separated time-scales. They show that the behavior of RNNs is quantitatively similar to optimal Bayesian baselines. They then reverse-engineer the trained RNN using various techniques to provide a detailed dynamical system description of how the two latent variables are represented in network's hidden state, how they influence network's decision, and how learnt connectivity support this dynamics. Finally they employ a distillation technique to recover some of the insights they gained from the previous reverse-engineering work.

__ Strengths__: - This high quality work provides a dynamical system perspective of Bayes optimal computations in RNN. They do so for a task involving inference of a latent variable over a long time scale, which advances our understanding of context-dependent computation in RNN.
- The task is of interest for neuroscience as part of the IBL project.

__ Weaknesses__: - It is not clear to what extent this is relevant to the non-neuroscientific crowd at NeurIPS as the part about the distillation technique is a small portion of the work and not completely novel as stated by the authors. I thus wonder whether this work would be better suited for publication in a neuroscience journal.

__ Correctness__: Yes

__ Clarity__: Yes the paper is well written.

__ Relation to Prior Work__: It would be nice to discuss the relationship with this work, that reverse-engineer RNN performing Bayes optimal integration of stimuli:
Bayesian computation through cortical latent dynamics
H Sohn, D Narain, N Meirhaeghe, M Jazayeri
Neuron 103 (5), 934-947. e5

__ Reproducibility__: No

__ Additional Feedback__: Update: I thank the authors for their update, that will surely improve the quality of the manuscript, and for making their code available. I thus keep my score at 7 as this paper will make for a nice contribution.
The only major concern I have regards the appropriateness of the work for the NeurIPS audience. Other than that I found the work interesting and of a very high quality.
One thing I found missing in the paper are details about the training procedure. In order to infer the values of the latent variable corresponding to block identity, RNN need multiple trials to do so. In a typical RNN training, for neuroscience tasks, gradients are back-propagated only over one trial, and network's state is drawn randomly, or at 0, at the beginning of each trial. Are gradient back-propagated over multiple trials ? If so, what about the vanishing gradient problem ? How are network states initialized at the beginning of each trial ?
Another question related to training relates to the reward input (it is mentioned not to be important after training but I guess it is instrumental for the training part). This reward input depends on the trial by trial value of the readout unit and as such can not be defined a priori (typical training procedures define a set of input-output a priori, or use RL like approaches). How is this problem overcome here ?
Another improvement would be discuss/motivate a bit more the distillation approach, e.g. which new insight does it give compare to the previous analysis, or which insight it does not give compare to the previous analysis ?

__ Summary and Contributions__: The authors train a RNN on a task inspired by an experimental neuroscience task. The task requires integrating information across two timescales.
The authors analyze the RNN in a variety of methods. Performance is compared to ideal Bayesian solvers. The internal representation of the RNN is analyzed in a meaningful fashion. RNN dynamics show a line attractor that explains the observed representation. Finally, a novel pruning method reduces the RNN to a 2-unit network that retains its main properties.

__ Strengths__: 1. A complete analysis trajectory of a neuroscience-relevant task by an RNN. From training, through behavior and novel analysis techniques.
2. Introduction of the reduction method.

__ Weaknesses__: 1. There are no statistics. It seems that the entire paper presents a single instance of a trained RNN.
2. The clarity of the paper – figures and explanations of methods – is lacking.

__ Correctness__: The results appear correct.

__ Clarity__: The rationale and results are clearly presented.
Figures have a very small font size, and the colors are sometimes highly similar (and not colorblind friendly).
Some of the details are unclear – is the objective to increase reward? Is training done in batches? Cross entropy with regards to which target?

__ Relation to Prior Work__: There is no dedicated section for related work, and the discussion section does not address that either.
The introduction mentions related work, but without clearly comparing the present work to them.
A few related works that seem relevant:
1. Full-FORCE has internal activity as training targets. This is related to the model compression
2. Tasks involving two interacting variables (Line 187). Mante & Sussillo have context interacting with stimulus. Sussillo&Barak 2013 has a plane attractor. Romani & Tsodyks (Plos comp 2010) study two coupled continuous attractors.

__ Reproducibility__: Yes

__ Additional Feedback__: POST REBUTTAL EDIT:
I read all reviews and the author response.
Clarity and lack of details were a major issue - and it seems that the authors will amend this. Most of my concerns in this regard were lack of details, and therefore I am satisfied with the proposed edits.
Statistics - the authors state that they checked four networks but do not plan to do any statistics. I think statistics would greatly strengthen the paper, and urge the authors to do them for the final version.
I increase my score from 7 to 8
-------------------------------------------------------
A few additional comments:
1. Panel labels (A,B,C) are missing from all figures.
2. What is Tmax?
3. Why the choice of the small waiting penalty? How does it affect behavior?
4. L91: fully specified unless there is timeout.
5. 3.1.1. the accuracy of the Bayesian agent is affected by the threshold. What is the threshold? How does it affect accuracy? How was it chosen?
6. Fig 2A: Marking chance level could help.
7. Fig 2C: It’s quite hard to see what’s going on there.
8. Figure 3B: Perhaps it would be better to only show the zero stimulus case.
9. Figure 3C: not described in the text.
10. L117: “additionally…” Not entirely clear what this sentence adds on top of the previous one.
11. L121 “approaches” in what sense? Is there some limit?
12. Figure 4A: The dots are clearly separated with a large margin. Why isn’t performance 100%?
13. Figure 4A: The description is very partial. What are the points? Were they taken from a specific timepoint in the trial? Which?
14. Figure 4B,C: A colorbar could help. Are there less trials for the right block?
15. L146: showing a time point. Which time point?
16. Figure 5B: very light colors. Hard to see.
17. Figure 5C: Why the sudden increase at the last trials?
18. Figure 7B: It’s hard to see the pattern. Perhaps add distributions of values for each quadrant.
19. Figure 7C. Color is not explained. I assume this is the norm of the flow of dynamics, similar to the measure in Sussillo&Barak (2013), but this is not explained well.
20. L217: The “Delta timestep decoherence”. I’m not sure this is a standard term. If this is where it is being defined, the sentence doesn’t make it clear.
21. Figure 8B: Hard to see because curves are on top of each other.

__ Summary and Contributions__: The paper extracts the mechanism by which an RNN implements a complex task, involving accumulation of evidence on two different timescales. Behavioral analyses show that the RNN approaches optimal bayesian behavior. Dimensionality reduction analyses point to a dynamical mechanism reminiscent of line attractors, with stimulus posterior and block posterior being encoded along 2 different directions. Finally, the authors apply a novel model compression method to further clarify the mechanism. Overall, the paper both introduces a very promising method for reverse-engineering RNNs.

__ Strengths__: - exhaustive theoretical analysis and interpretation of a trained RNN
- focus on a highly relevant neuroscience task, for which large amounts of data will be collected.
- beyond this specific task, the method for interpreting RNNs is highly relevant to the NeurIPS community

__ Weaknesses__: - the most interesting part of the paper (analysis of the distilled network) is highly compressed. The corresponding sub-sections are hard to follow. Supplementary information would have been very useful.
- it would have been nice to link the final distilled model to Bayesian aspects of the computations. The reader is left wondering how Bayesian integration emerges from the RNN.
- it would have been interesting to link the distilled network parameters to the original networks dynamics (timescales of integration and decay), and connectivity (eigenvalues...).

__ Correctness__: yes

__ Clarity__: Overall clear, but some details are missing.
Is the beginning of the trial indicated in any way to the network? Presumably there is no reset (otherwise blocks would not be detected), but then does the action on the previous trial bias the action on the next trial beyond what is expected based on block structure?
L194: "representation of the block side must be continuous..." - I found the argument difficult to follow
Figure text is very small. Labels for panels are missing throughout.
Fig 3a: How is concordant/discordant defined for a zero contrast stimulus?
Fig 4a: what does every point represent? What are lines and vectors?
Fig 4b,c are hard to read
Fig 7a: this is the Pearson correlation of which quantity?
Fig 7b: the E-I population structure mentioned in the text is not obvious in the figure. Since it is anyway not used for the mechanism, I would remove this panel.
Fig8a: not clear what is plotted here
Fig8b: "trajectories closely match" -> very hard to see this in the figure
Fig 8c: same question as in Fig 3a
Several of the references are incomplete (journal, year ... missing).

__ Relation to Prior Work__: Clearly discussed.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper studies the mechanism of an RNN model trained on a block-structured hierarchical inference task for rodents that's centered around inferring two correlated task variables: block side and stimulus side (requiring inference timescales separated by some orders of magnitude).
The RNN is compared to two (nearly identical) bayesian baselines to serve as upper performance bounds. Both baselines have two modules that perform posterior inference for the two variables (stimulus side, block side) with full task knowledge, with only one allowed to make an action and move on to subsequent trials.
Contributions (and/or findings):
1) RNN model performance comes close to the bayesian upper bound
2) RNN infers the block structure of the trials, and not just the stimulus
3) RNN dynamics collapse to a 2D subspace, governed mostly by the two variables
3) RNN accumulates evidence to infer both variables
4) RNN consists of two anti-correlated populations (encoding left and right) that likely self-sustain own dynamics
5) Authors introduce a distillation method that forces student RNN states to be low-dimensional projection of teacher RNN
6) A 2-unit distilled RNN learns the same variables as the teacher RNN, though a 2-unit RNN that's trained from scratch doesn't learn well

__ Strengths__: - Empirical evaluation is very satisfactory, the claims are quite easy to agree with and are based on the right experiments.
- The paper draws similarities between standard RNNs with bayes-optimal actors, offers a thorough slicing of RNN mechanics along multiple dimensions that allows insights into their behavior, and communicates basic tools that could be of use to the broader community in investigating neural networks. Along these axes, it's a worthwhile contribution to the commmunity.

__ Weaknesses__: The distillation objective proposed is lacking some context. It is acknowldged that the proposed scheme is similar to the one employed in tinyBERT; on the other hand, there are other distillation schemes for RNNs that could have been used as baseline, such as sequence-level distillation (https://arxiv.org/pdf/1606.07947.pdf). If a method is posed as a contribution, it's important to visit existing techniques before introducing a new one.

__ Correctness__: Empirical methodology is sound, with the experiments more or less straightforwardly leading to the stated conclusions.

__ Clarity__: The paper is very well written, with some pieces likely missing that could leave a question mark for most. For instance, what was the data to train the RNNs with the CE loss? Is it LM-style teacher forcing (with a predefined sequence shifted between inout and output), or something else? This seems like an important detail left out. Since the codebase stated in the paper is not yet existent, there doesn't seem to be a clear way for the reader to get an answer.

__ Relation to Prior Work__: There isn't much discussion of prior work, or work that moves in similar directions, from different fields (analogies between RNNs and optimal decoders, other works that use artifiical networks to model tasks like these, etc.), and this is a shortcoming.

__ Reproducibility__: No

__ Additional Feedback__: I'd have liked to see the codebase (reproducibility is highly desired), the details on RNN training, relevant work (doesn't need to be competing work, additional context is always very useful), and a distillation baseline if the distillation method itself is considered a contribution. I firmly believe there is great value in disseminating this work, and I'd surely consider adjusting my score upon provision of above information.
Additionally, the distillation objective introduced surely has its broader impacts, which should ideally be addressed.
/////////////////////////
Update after rebuttal
/////////////////////////
I'm satisfied with the rebuttal. I'm raising my score since the authors agreed to provide the requested information and also made their code public.
I also advise revising the broader impact section, adding discussion around the introduced distillation technique and its impacts.