Review for NeurIPS paper: Characterizing emergent representations in a space of candidate learning rules for deep networks

NeurIPS 2020

Characterizing emergent representations in a space of candidate learning rules for deep networks

Review 1

Summary and Contributions: This paper proposes a two-dimensional space of learning rules governed by two parameters such that different points in the space correspond to gradient descent, contrastive hebbian learning, a variant of predictive coding, hebbian learning, and anti-hebbian learning. They then analyze the learning dynamics of these rules in a linear network with one hidden layer, after the fashion of previous work by Andrew Saxe and colleagues. They train the networks to learn the standard hierarchy of concepts used in much previous work, oak, pine, rose, daisy, etc. They then characterize these networks for their ability to model progressive differentiation, stage-like learning, and illusory generalizations, all characteristics found in human development. While most of the algorithms lead to zero error on the training set, only a smaller set satisfy these criteria, and they generally are close to gradient descent algorithms. In a second experiment, they more finely differentiate the algorithms along the two axes corresponding to top-down feedback (positive or negative) and anti-hebbian to hebbian learning. Generally, it appears that only weak hebbian learning can show the correct behavior dynamics in terms of the progressive differentiation, while a relatively wider range along the top-down feedback axis can. Furthermore, not all of these algorithms show the stage-like learning. After having read the author's response, I am please to maintain my assessment of this paper. I always like more figures (;-)), and I believe the authors have done a good job of addressing the negative reviewers' concerns.

Strengths: 1. The parametric model of learning rules is a nice contribution. 2. The characterization of the space of algorithms that satisfy the three criteria is cool. 3. They come up with relatively formal measures of things like stage-like learning.

Weaknesses: 1. I think it is time to retire this dataset in favor of something more realistic that could be more specific about human learning. 2. The model seems overly simple. This is both a feature and a bug. 3. The characterizations are not as systematic as one would like. For example, while all models are characterized in terms of their learning dynamics and final representational structure, only two are displayed in Figure 2 for their stage-like behavior. If this is in the supplementary material, ok, but the paper should be relatively self-contained. 4. The notation is very confusing.

Correctness: Yes.

Clarity: Well, not in terms of the notation. it’s clear in equations 1 and 2, but then in the paragraph starting on line 76, all hell breaks loose. There are some W’s that would seem to need another subscript - the difference in notation between W_h and W_{i_c} is not consistent. How about using superscripts on the deltas, capitalized for the learning rule, as in: \Delta^H_W_1(\eta} for Hebbian and \Delta^C_W_2(\gamma) for Contrastive learning or some such. Or use superscripts on the weight change W’s. In any case, I found it very confusing. A question in line 185, you say that stage-like learning doesn’t happen in shallow networks, but then you get it with your very shallow network. Why? line 204: The pairwise distances don’t have to be identical, but they should be correlated. Distances will vary with the dimensions of the representation. line 244: rather local -> rather than local

Relation to Prior Work: Yes. However, it seems odd to have a paper about mixed learning rules for cortex without referencing Randy O’Reilly’s work (in particular, the O’Reilly & Munakata book). He found that a small Hebbian contribution in addition to CHL really cleaned up the representations his network learned, and made them generalize better than vanilla backprop.

Reproducibility: Yes

Additional Feedback: It would help for reproducibility to promise to share your code.

Review 2

Summary and Contributions: The authors evaluated many synaptic plasticity rules on a supervised learning task used to model semantic development in human children. In particular, the plasticity rules live in a two-dimensional plasticity-rule space that contains several classical plasticity rules (or particular implementation of these rules), including Contrastive Hebbian, gradient-descent, Hebbian, and anti-Hebbian rules. The authors showed that several features exhibited by human children during development are also presented in networks trained with gradient-descent-like learning rules, but not many other rules. The behavioral features studied include stage-like transitions, illusory correlation, and over generalization during few-shot learning.

Strengths: The authors proposed an interesting way to parameterize plasticity rules such that in a two-dimensional space, many classical rules can be visualized. The use of a two-dimensional space makes visualization particularly easy. The authors went beyond looking at the performance of these rules in the regression task, and studied several behavioral properties previously identified as interesting and non-trivial for neural networks.

Weaknesses: I have some concerns about the technical correctness of the results shown, see below. Perhaps the authors could clarify them, and I will be willing to change my score accordingly. Assuming the results are all technically correct, then the main concern is that this work is rather similar to Saxe, McClelland, Ganguli 2019 PNAS, with almost the exact same dataset and analyses. They even used the same “worms have bones” example for illusory correlations. The authors of course extended the previous analyses to more learning rules. ------------------------- Update after rebuttal: The authors largely addressed my technical concerns. I still think their measure of progressive differentiation (Fig. 2a) could be improved, but I think it won't impact their main results. I have changed my overall score from 4 to 5.

Correctness: (1) In Fig. 2a, the authors quantified the “mean time lag between learning adjacent hierarchy levels” and used that as the measure of progressive differentiation of different hierarchy levels. However, when $\gamma$ increases its absolute value, the effective learning rate likely increases, and the overall learning time for all hierarchy levels may decrease. So to measure progressive differentiation, it is probably more appropriate to normalize the mean time lag by the time to learn the last hierarchy (or something similar). (2) In Fig. 4, I don’t understand how the network could possibly reduce training error, not to mention to zero, when \gamma = 0. If I understand correctly, when \gamma = 0, there is no information about the target y in the network (see Fig. 1 and section 2), then how could the network possibly learn the target?

Clarity: The paper is overall well-written and understandable. Line 185, the authors, perhaps following the example of ref 38, used “deep networks” to reference neural networks with at least a single hidden layer. I think usually deep networks are used to describe networks with more than one hidden layer (at least in the context of feedforward networks). So this usage of deep vs shallow networks could be confusing, and I would prefer if the authors use more precise language here.

Relation to Prior Work: Overall, the paper is well-referenced. Line 100, the authors mentioned that Contrastive Hebbian learning (CHL) “has been widely used as a biologically plausible alternative to gradient descent”. CHL still relies on symmetric feedback weight and two-phase learning, two properties that are not clearly biologically plausible. I would recommend the authors to tone down the claim about biologically plausibility of CHL here.

Reproducibility: Yes

Additional Feedback: None

Review 3

Summary and Contributions: This paper studies a family of learning rules in the space of top-down feedback and Hebbian learning, covering five benchmark learning algorithms (gradient descent, contrastive Hebbian, quasi-predictive coding, Hebbian and anti-Hebbian) as specific points. The paper is particularly interested in the extent to which the behaviors of the algorithms resemble human semantic development such as progressive differentiation and illusory correlations, and the extent of task-relevant representations relative to unsupervised representations.

Strengths: The unification of the benchmark algorithms in the space of top-down feedback and Hebbian learning is a fresh way of studying the learning algorithms.

Weaknesses: The exploration of learning rules with only two variable parameters is too restrictive. There have already been numerous learning rules proposed in the field but they do not fit into the proposed scheme. Even with two parameters characterizing feedback and Hebbian learning, a single additive mixture is only one possibility. For example, a recent development is the study of tri-factor learning rules [1-3]. There are also learning rules that deal with special cases such as hierarchical patterns that this paper studies; a family of (static) learning rules have been proposed many years ago [4]. Recurrent networks is another class of models worthy of attention. Surely it is neither necessary nor practical for a conference paper to deal with too many learning rules. Rather, citing the above examples, the issue more relevant to the reader is: given an arbitrary learning rule, how will one be able to characterize it in terms of the attributes the authors are interested (feedback or Hebbian-like or human-like). Another point concerns the significance of the result in Figure 6, which compares the integrated weight change during learning due to the contrastive Hebbian and Hebbian components. Since the amplitudes of the weight changes are prescribed by the parameters $\gamma$ and $\eta$, it is not clear what extra information one can obtain from Figure 6. The paper also mentioned the prospective of “help guide future experiments” (line 53), but the reader cannot find hints on how this can be achieved. [1] Nicolas Fremaux and Wulfram Gerstner. Neur0modulated spike-timing-dependent plasticity, and theory of three-factor learning rules. Frontiers in Neural Circuits, 9:85 (2016). [2] Lukasz Kusmierz, Takuya Isomura and Taro Toyoizumi, Learning with three factors: modulating Hebbian plasticity with errors, Current Opinion in Neurobiology 46:170-177 (2017). [3] Wulfram Gerstner, Marco Lehmann, Vasiliki Liakoni, Dane Corneil, and Johanni Brea. Eligibility traces and plasticity on behavioral time scales: experimental support of neohebbian three-factor learning rules. Frontiers in Neural Circuits, 12:53 (2018). [4] N. Parga and M. A. Virasoro. The ultrametric organization of memories in a neural network. Journal de Physique 47(11):1857-1864 (1986) and many other papers citing it. Update after feedback and discussions: ============================= I have no complaint about the amount and quality of work done by the authors, but would like the authors to think more about the deeper issues implied by their work. Among my many comments on this paper, the most important concern is whether the family of learning rules is too restrictive or not. While the 2D space is able to unify 5 common learning rules, the mixing of the rules is already a bit ad hoc and the scheme precludes the inclusion of other learning paradigms. One may further query the value of mixing, and conjecture that the comparative study may reach similar conclusions if the behaviors of the learning rules are studied individually. If, at the starting point of the study, attention is paid to what attributes of network behaviors the authors are interested, and use those attributes to construct a 2D (or higher dimension) space (rather than using two parameters of specific learning rules), and in turn use that space to map the different learning rules, the impact of their work will be more far-reaching. With this consideration, my score remains at 4.

Correctness: Overall, the claims and methods are correct, but there are two points of caution. The paper claims to be investigating deep networks, but in fact it is studying networks with a single hidden layer. While some observations can be extended to subsequent layers in deep networks, one has to be especially cautious when dealing with issues such as stage-like transitions, as it cannot be excluded that transitions may take place in subsequent layers even when it does not occur in earlier layers. Another point concerns the interpretation of illusory correlations in human learning. The present observation of illusory trajectories seems to be based on learning instances with opposite-signed coefficients of the singular vectors in singular value decomposition. This is possible in the present model because there is an inversion symmetry of the singular vectors (that is, after multiplying the singular vectors by -1, they remain as singular vectors). The issue is whether this inversion symmetry is also present in human learning experience or is an artifact of artificial models.

Clarity: Overall, the paper is clearly written, but there are a few points for clarification. Figure 1a: camped  clamped Line 185: The sentence “We consider an individual feature $m$ for item $i$, …”. It took a while before it is realized that the sentence refers to a particular learning instance, not all learning instances of the feature-item pair. Line 244: rather  rather than

Relation to Prior Work: The paper starts by discussing learning rules for task-relevant representations and for unsupervised learning, and expressed the need for a link between them. This may be considered as a distinctive feature of this paper.

Reproducibility: Yes

Additional Feedback: How will adding repulsive couplings among the hidden units modify the learning behavior?

Review 4

Summary and Contributions: This paper extends previous results on semantic learning of hierarchical categories in deep networks to other learning rules beyond gradient descent. The paper defines a two-dimensional family of learning rules that encompasses gradient descent, contrastive Hebbian learning, a Predictive Coding-style learning rule, Hebbian and Anti-Hebbian learning rules and examines the learning dynamics across different members of the family. The paper finds substantial difference in progressive differentiation, inductive generalization and illusory correlations. Analyzing the number of learning stages They find that the learning dynamics consistent with human data is broadly centered around gradient descent-like learning rules.

Strengths: I think that the questions this paper addresses is very well motivated, the methods are convincing and the paper is overall well-structured and demonstrates novel work. The question which learning rule is implemented in the brain is a very important (and difficult!) one, and while this paper only makes a small step towards answering this question it does so in a very well argued and methodical way.

Weaknesses: - One limitation of the present work is that it considers a very simple hierarchical task whose categories are perfectly linearly separable using the given features and a simple two-layer linear network. It remains to be seen whether the lessons obtained here will generalize to harder tasks and non-linear networks. - One particular concern is that in this setup, as shown Figure 4a, a strong Hebbian learning rule leads to a much faster convergence than gradient descent, presumably because Hebbian learning quickly memorizes the training set. It might very well be that Hebbian learning leads to more progressive differentiation in tasks where it results in equal or slower convergence than gradient descent.

Correctness: Yes, as far as I can see.

Clarity: Yes, very clear.

Relation to Prior Work: To my knowledge, yes.

Reproducibility: Yes

Additional Feedback: --- After rebuttal --- I'm satisfied with the authors' response to my review. I raised my score from 7 to 8.