Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3824
Title:Updates of Equilibrium Prop Match Gradients of Backprop Through Time in an RNN with Static Input

Reviewer 1


		
The manuscript describes a discrete time reduction of equilibrium prop (EP) which enables the authors to compare the algorithms gradient estimates and performance directly to BPTT. Moreover, the associated reduction in compute cost also enables them to train the first CNN (that I know of) using EP. While EP approximates BP in feedforward networks, it uses neuronal activity of an RNN in equilibrium to propagate target information or error feedback to perform credit assignment. While this work may be less interesting for DL practitioners because it is still more costly than backprop (BP), it is one of the contenders for bio-plausible backprop which is discussed in the literature. In that regard the present work contributes to this discussion meaningfully. The paper is relatively well written, but still contains the occasional typo and generally the language could be improved. Overall, however, the messages are clear and well presented. I only have minor points which could be used to further improve the manuscript. Use of RMSE: I am wondering whether it would be better to use cos alpha between the EP and BP gradient to illustrate performance on all the plots. RMSE would be also susceptible to the length of the vector which can always be absorbed in the learning rate. Since here the authors seem to be mostly interested in the direction, something that agnostic to the length would seem more suitable. I found the formulation of the loss as a sum of losses a bit weird (ll.86). It would make more sense to me to put the temporal index on the loss function and not the parameters (since they are supposed to be same if I understood the authors correctly). Section 3.2: When comparing EP to BPTT, wouldn't it make more sense to compare to both BPTT & RTRL and find an umbrella term for both? RTRL will give the same gradient as BPTT, but somehow the manuscript makes the argument that EP is forward in time whereas BP isn't. But then RTRL is too ... so this is argument needs to be honed a bit. In practice, however, it is clear that gradients are computed using BPTT in time, I am only suggesting to amend the text here. Figure2 & Figure 4: I suggest changing the order of the plotted curves such that the dashed lines are somehow visible. That or play with transparency. Currently due to the overlap dashed is virtually invisible which is confusing. Finally, the reduction in computational cost is not shown which is one of the sales arguments for the discrete time version. However, it would be nice to add a small graph or table with some numbers in terms of wall clock time. l.41. "real-time leaky integrate neuronal dynamics" something is wrong with this sentence l.58. "similar performance than" -> to Update: Thanks for the clarifications on the RMSE and for adding the wall clock time. Finally, the updated figures, will further improve the MS. Thanks!

Reviewer 2


		
Even if the practical applicability of this learning algorithm on current hardware is limited, the theoretical approach and its derivation is certainly relevant to the NIPS community. The paper is well structured, the mathematical notation is well understandable and clear. Nevertheless I have some (minor) concerns. I miss a clear presentation of the restrictions on the transition function F and the role of the convergence “assumption” of the first phase. As far as I understood convergence of the first phase requires F’s such that F=d Phi /ds. The propotypical setting seams to state the same with other words, isn’t it. A clear relation between the fixed-point search and Energy-maximization might be obvious in the whole context of EP, but it is not clear enough from this paper. A discretized version of EP has to be compared to standard RNN approaches, hence also the relation to other non fixedpoint-converging, standard RNNs should be discussed. In particular a comparison with LSTM and a comment on the relation with the vanishing/exploding gradient problem and why this is not a problem in view of the fixedpoint search would be nice UPDATE: Thanks for the additional information in the rebuttal and the clarifications. Congrats to this excellent paper!

Reviewer 3


		
The authors provide a formulation of equilibrium propagation that is applicable to discrete time models that usually use backpropagation-through-time. The main theoretical result of the paper is that the updates from EP are equal on a step-by-step level with the updates of BPTT when the transition function of the RNN derives from the primitive function'' (which is similar to the energy function used in Scellier and Bengio 2019), and the RNN converges to a steady state over time. The authors further demonstrate this in practice, also for standard RNNs where this condition is not met. Originality: The results of this work seem to be original. Quality: The quality of the paper is high. The main theoretical results are explained in a clear and intuitive way, and the experiments are well-motivated. Clarity: The clarity of the paper is high. The setup, assumptions and theoretical results are clearly explained. The experiments are also well explained. The figures could benefit from having titles/legends or text in the caption clarifying each setup -- example in fig. 3, unless I read the main text, it's not clear what each column is. Another minor fix for clarity would be to explicitly specify what the weight updates are for each of the experiment, and when the weights are updated. The theory section could benefit from a brief verbal outline of the proof. Significance: The results of the paper are significant, since it elucidates how and when equilibrium prop algorithm would work for the discrete time case and for standard RNN models. The fact that the performance comes so close to BPTT on MNIST contributes to the significance. Some factors that could affect general applicability of this result is the use of symmetric weights, and the requirement of the network to converge quickly to a steady state. Further comments: - It would be interesting to see the behaviour of the algorithm when the weights are not symmetric. Is it expected to work? What would be the trajectory of the weights? - Quantitative data of how the performance degrades with more layers could help define the limits of the algorithm in terms of what the deepest networks this algorithm could train.