__ Summary and Contributions__: This paper proposes a stochastic recurrent neural network that builds up its local information representation through a learning rule based on Boltzmann machines, but weighted by a task-dependent objective function, forming a so-called tri-factor learning rule. The results show how the network depends on the tasks of regression and classification in terms of the distribution of the tuning curves, population-averaged activities, and dependence on stimulus priors. The paper then considered how noises are redistributed in the neural manifold such that task performance can be achieved.

__ Strengths__: This paper has several strong points. The first one is its thoroughness in elucidating the properties of the learning rule it introduced. Besides merely showing that it works, the paper discusses how the network behaviors such as the distribution of the tuning curves depend on factors such as the choice of the task and the extent of congruence between the most probable input and the decision boundary. The focus of the paper is comprehensive in its attention paid to the behavior at the neuronal level as well as the population level.
Another strong point is the discussion on the tuning curves of the neurons. This enables the comparison of its predictions and related models with neuroscience experimental results.
The discussion on the effects of prior shifts and discriminability are also relevant to the neuroscience community.
One further contribution of the paper is the study of the effects of learning on noise redistribution through its introduction of the notion of noise volume fraction. This concept and technique can be extended to the study of other models.

__ Weaknesses__: Despite the theoretical insights, many details were not written clearly. Please see comments in the section on clarity.
Update after feedback and discussions:
=============================
The main contribution of this paper is the insightful elucidation of the learning behaviors of the tri-factor learning rules.
The authors’ response on the symmetric weights is less satisfactory. If the weights are not symmetric, it is not possible to write an energy function to be minimized in the first place, and an alternative formulation is needed.
I accept the authors’ clarification of the volume fraction. On the other hand, the argument that noises were introduced to derive the closed-form expressions and to avoid trapped learning was in fact already well known. The original sentence might have misguided the reader to think of deeper principles not intended by the authors. Related to the issue of noises in the authors’ response to Reviewer 3, I agree that more efforts should be made to consider how stochasticity encodes uncertainty, rather than merely a tool to derive convenient algorithms. For example, variances of neural responses play a role in encoding uncertainty in probabilistic population coding.

__ Correctness__: The claims are fine, but restricted to the case of symmetric connectivity matrices, for which the energy function can be written and gradient descent can be derived. The authors may like to discuss how the situation would be modified for asymmetric couplings.

__ Clarity__: Surely there is room for improvement. Below are some examples:
Fig. 1 caption: “Kernel density estimates” is not explained.
Line 43: The fixed tuning $s_j(\theta)$ is not explained. This is an important point since the tuning profile will affect the output tuning curves.
Eq. (8): A minus sign is missing.
Line 110: Where is the symbol $\tau_s$ defined?
Line 112: Is the symbol $\sigma_B$ the same as $\sigma$?
Line 126: It was asserted that contribution of improbable stimuli to the emerging representation is small. The phrase “improbable stimuli” is ambiguous, since it may refer to either $\theta = \pi/2$ where the frequency in Fig. 1d is highest, or to the extreme values $\theta = \pm \pi$. The reader will get confused if he/she has the former case in mind.
Fig. 2a: Explanation to this figure is missing. For example, what does the shapes and colors of the ellipses represent, and what do the pins connecting the ellipses and the crosses represent?
Fig. 2d: Similarly, explanation to this figure is missing. For example, what do the patches represent? What do their shapes and colors represent?
Lines 132-144: In the experiments with changed priors, it is not clear to the reader whether the priors are changed during learning or only during their testing, or the simulation is performed with online learning with simultaneous online testing.
Line 160 and Fig. 2: The symbol $d’$ is not explained.
Line 180: The conditional stimulus was mentioned, but it is not clear the stimulus is conditional of what.
Line 185: The noise volume fraction was introduced, but no mathematical formulation was presented in either the main text or Supplementary Material. If this is an important conclusion of the paper, the reader deserves to know more details.
Line 213: It is not clear in what way “intrinsic noise in the recurrent dynamics ... allows us to derive closed-form probabilistic expressions for the objective function gradients.” If it refers to the factor $1/\sigma^2$ in Eq. (8) which was discarded in Eq. (9), the reasoning is not strong enough as this is merely an issue of scaling.
Ref [32]: The volume and pages are missing.

__ Relation to Prior Work__: A few references on tri-factor learning were mentioned and it was pointed out that many details remained unexplored. The authors may also like to note the following reference:
Lukasz Kusmierz, Takuya Isomura and Taro Toyoizumi, Learning with three factors: modulating Hebbian plasticity with errors, Current Opinion in Neurobiology 46:170-177 (2017).

__ Reproducibility__: No

__ Additional Feedback__: Would the authors like to consider more complex tasks as the next step?

__ Summary and Contributions__: In this paper the authors derive an error modulated hebbian learning rule and show how it can be used in a recurrent neural network to learn an estimation or classification task. They then investigate how the learned representations vary across these two tasks.

__ Strengths__: The authors' investigation of the difference in neural tuning, and how these are shaped by the different priors and loss functions between tasks, was thorough.

__ Weaknesses__: It's unclear what the major contributions of the current paper are, when compared to the cited literature. Specifically, the derivation of learning rules (eq 4-9) results in equations nearly identical to the previous literature (eg: eq 7 of citation 34, eq B3 25, eq 16 of Legenstein, Chase, Schwartz, Maass 2010), once the conditional expectations are replaced with time-averaged values. The introduction of two different read out terms (eq 5 & 6) represents only incremental progress by introducing the ability to perform two different types of tasks.

__ Correctness__: The methodology and interpretation of results seem correct.

__ Clarity__: The paper is written clearly, and the presentation doesn't inhibit comprehension.

__ Relation to Prior Work__: It's unclear how the proposed learning rule is different from the cited works.

__ Reproducibility__: Yes

__ Additional Feedback__: [Following author feedback]: It's still unclear to me what the major novelty of the derived learning rule is. As the authors point out in their rebuttal the learning rule takes the form of a modulation term multiplied by the difference between short term and longterm pre-post correlations. This is the form taken in standard Contrastive Hebbian Learning, with an additional modulating term.
However, I will defer to my fellow reviewers in the novelty and contributions of having derived the learning rule in a principled manner.

__ Summary and Contributions__: This study uses a normative approach to derive a learning rule in a network model with consideration of task information. The derived learning rule resembles the reward-modulated Hebbian rule, as claimed by the author. To illustrate the model, the author used this model to solve two tasks with estimating continuous and categorical variables.

__ Strengths__: The mathematical analysis in this work is solid, and the simulated experiments support their claims.

__ Weaknesses__: I have a couple of major concerns on conceptual levels. I appreciate if the authors could add related discussions in a revised version (there is still enough space in Discussions) or briefly answer them in the author feedback.
1. I am happy to see the results that how different tasks influence the representation in a recurrent network. However, I am wondering the possibility of a disentangled representation of sensory information and the task information. That is, the learned recurrent weight W only depends on the sensory information (likelihood and prior) but not on the task information, while the task information is uniquely represented on the readout matrix D. An advantage of this representation is that the brain doesn’t need two sets of recurrent weight W to implement two different tasks, while the brain only needs to choose how to readout the sensory information represented in the network.
Update after rebuttal: the authors cite two papers support a mixed representation between sensory and task information. I suggest the authors add a brief discussion of this in the revised manuscript. This could also enhance the biological relevance of this work.
2. Although the results of current study imply there is a conflict between the roles of stochasticity on learning and on encoding (line 225), I am hesitate to accept this claim in general. I think it is probably not true if the authors consider a Bayesian framework where the posterior distribution is represented by sampling-based codes. In this framework, the internal stochasticity is also essential to encode the posterior distribution, but not harmful for encoding.
Line 225: conflict between the positive role of stochasticity on learning and its deleterious effects on encoding. The conflict may not be true if you consider the sampling-based representation of posterior distribution in a Bayesian framework.
The network just plays as a filter of the noise in the framework of MSE (point estimate).

__ Correctness__: I have gone through all math equations and I believe the math derivations are correct although some typos exist (see the comments below). The numerical experiments support the conclusions.

__ Clarity__: Overall the paper is well written and I could get the main information quickly.
Some suggestions on the structure and typos:
1. Although the authors mentioned the stochastic network is performing sampling, I have been puzzled that why they only consider objective functions based on point estimate (Eqs. 5-6) rather than whole posterior distribution. My puzzled was not relieved until I saw the footnote on page 7. I strongly suggest the authors mention this at somewhere earlier, e.g., when they mention sampling and objective functions on pages 3 and 4.
2. Line 40: I think the title of “local learning” is a little bit over-claimed because the learning rule also depends on a global task-specific loss function. Although I do accept the derived learning rule and its biological plausibility, I suggest the author revise this title.
3. Eq. 6: I think the objective function is the cross entropy if I understood correctly. And then the term inside the 2nd log function should be 1- \psi.
4. For the Eq. right after line 90: does it lose 1/\sigma^2 for the first term on the right-hand side?
5. Eq. 9: compare with the Eq. right after line 90, I think you cannot just throw away the expectation when approximating it by sampling. Some standard notation is replacing the expectation by an empirical sum over samples (see Eq. 11.2 in Bishop 2006’s book for an example). For example, you could write Eq. 9 as the empirical average over neuronal responses r which has an index t of time.
6. Line 220: what is the full name of HVC? Is it a brain area or a nucleus in a songbird? Even if it is a common and standard abbreviation in songbird study, briefly mentioning what this means is quite helpful for readers.

__ Relation to Prior Work__: The author discussed related work on stochasticity, learning rules, etc.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors detail a framework for stochastic gradient estimation of task-based objective functions. They demonstrate its application in a noisy recurrent neural network design in which the stochastic gradient updates can be expressed as three-factor Hebbian update rules for the parameters. Experiments show that the authors’ network learns to “allocate” noise in a way that helps it maximize task performance and represent common stimuli well.

__ Strengths__: The authors clearly situate their work within the realm of bioplausible gradient-estimation approaches, and give a very clear exposition. I know this will sound cliche, but after reading their paper I think, “aha, if I had read the background material, I could have come up with this!” While I am partial to a Bayesian approach, their derivations make the paper worth reading before I reach the numerical results.

__ Weaknesses__: The authors could have used an experimental design that would be easier to scale up and evaluate from a machine learning perspective, or from the perspective of broader perceptual neuroscience. Real perception involves multiple receptors, modalities, etc. On the other hand, datasets with easy-to-evaluate probability distributions for these kinds of stimuli are harder to come by.
I would also like to have seen the authors discuss a broader class of neural network designs, ideally separating the network architecture itself from the Hebbian gradient ascent algorithm. After all, we do not know what sort of network architecture the brain itself uses, so an algorithm or update rule derivation that can apply to arbitrary architectures would seem, at least to me, to be more biologically plausible than one that constrains us to fully connected recurrent networks.

__ Correctness__: I cannot find any clear mistakes in their application of the REINFORCE gradient estimation trick, nor in the plots of their numerical results.

__ Clarity__: The paper is very clearly written, one of its strengths. The largest correction I would ask for is that the authors refer uniformly to “gradient ascent” throughout the paper, since they apply their gradient-estimation framework to objective maximization, rather than the minimizing gradient descent normally used in much of machine learning.

__ Relation to Prior Work__: While the authors situate their work relative to prior work in the Discussion section instead of earlier, they do so very clearly.

__ Reproducibility__: Yes

__ Additional Feedback__: