Review for NeurIPS paper: Gaussian Gated Linear Networks

NeurIPS 2020

Gaussian Gated Linear Networks

Review 1

Summary and Contributions: The authors propose an extension of the GLN by modelling each neuron as a product of Gaussian. Doing so enables them to model regression tasks as well as density estimation tasks. Their G-GLN model is back-propagation free and updates directly using online gradient descent. Hence, the G-GLN is much more efficient in both computational resources and data dependency. Additionally, thanks to their model and loss function choices, their optimization problem is locally convex.

Strengths: 1. The GLN-based idea is very interesting and novel. Current SOTA models can achieve amazing results but require much more computational resources due to back-propagation. 2. The paper is very well written and easy to follow. The figures, tables, and algo-box are also easy to understand. 3. The paper provides sufficient background information on both prior work and model properties for the Gaussian distribution. 4. The paper shows competitive performance of the G-GLN model on various tasks such as regression, density estimation, and bandits.

Weaknesses: 1. For Table 2, on the SARCOS dataset, the G-GLN is trained with 1200 epochs. Are all models trained with the same number of epochs? 2. Similarly for Table 1, are all models trained with 40 epochs? I am not sure if the results for the G-GLN is the average prediction after 1 epoch or 40 epochs. More generally, I am curious if there is an advantage for the G-GLN model in terms of training epochs.

Correctness: The claims and method seem correct.

Clarity: The paper is very well written.

Relation to Prior Work: The paper provides sufficient background information and comparisons to its predecessor GLN model.

Reproducibility: Yes

Additional Feedback: Post rebuttal: keeping original score.

Review 2

Summary and Contributions: This paper introduces a new backpropagation-free deep learning algorithm for multivariate regression that leverages local convex optimization and data-dependent gating to model highly non-linear and heteroskedastic functions. Experiments have been conducted on several univariate and multivariate regression benchmarks in comparing with state-of-the-art. The proposed method outperforms competitive algorithms.

Strengths: The paper is well written and clear. The proposed framework G-GLN can be considered as an extension to the recently proposed GLN family of deep neural networks. In detail, authors extend the GLN framework from classification to multiple regression and density modelling by generalizing geometric mixing to a product of Gaussian densities. Many proofs of related theorems are given and experiments are sufficient.

Weaknesses: 1. Why “side information” is defined as the input features for an input example? Moreover, the function of “side information” should be further demonstrated by experiments. 2. The introduction of related work is not sufficient, and more work on GLN should be given to reflect the advantages or difference of the proposed method, such as the difference from B-GLN. 3. More detail experimental analysis should be connected with your objective function.

Correctness: Yes

Clarity: Good

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The authors propose Gaussian gated linear networks, they extend the framework of GLN from classification to multiple regression. Their proposed approach is competitive on several benchmarks, including contextual bandits and density estimation.

Strengths: The authors build upon the work Bernoulli GLNs [2] and [3] and the well-known closure property of Gaussian distributions to extend GLNs to (multiple) regression and density estimation tasks, from which the concept of Gaussian GLNs arises naturally. The experiments show that G-GLN is competitive on several benchmarks covering (multiple) regression, online contextual bandits and density estimation.

Weaknesses: One could argue that the proposed approach is a straightforward extension of [2] and [3], thus limiting novelty. However, the connection to Gaussian distributions makes it an interesting approach. Some important details of the proposed approach are relegated to the supplementary material in favor of less interesting background material. The authors do not address the interpretability and robustness to catastrophic forgetting components of the proposed approach beyond mentioning them in the introduction.

Correctness: The technical details of the approach as well as the experimental setup seem correct and well-described.

Clarity: The paper is clearly written, the paper is well motivated and the assumptions, methodology and experiments are clearly stated.

Relation to Prior Work: The authors highlight the existing work on GLNs for classification and highlight that no prior work exists for regression tasks with GLNs.

Reproducibility: Yes

Additional Feedback: The authors may consider moving some of the details of the base model learning and switching aggregation to the main paper which are less obvious and more important to the proposed approach than well-known, textbook, properties of the Gaussian distribution (Section 2.1).

Review 4

Summary and Contributions: The paper proposes the Gaussian gate linear network (G-GLN) which it is an extension to the GLN family of deep neural networks. Properties of G-GLN are studied and authors examined G-GLN numerically using well know datasets and compared to other methods.

Strengths: The idea of GLN is local credit assignment mechanism by optimizing an objective and, previously the Bernoulli GLN was established. In this paper, authors uses the fact that exponential family densities are closed under multiplication to formulate G-GLN. It is demonstrated that the G-GLN is compatible through simulation studies.

Weaknesses: This is not my research area and I wasn’t able to grasp weakness. In the GLN, side information is fed to improve the learning instead of using all information. Could authors show or demonstrate the improvement due to side information? Also could author vaguely guide an optimal proportion for side information?

Correctness: The claims don’t seem to have any major fault.

Clarity: It was well structured and the idea, architecture, algorithm are well explained.

Relation to Prior Work: The relation to the previous work, Bernoulli GLN was stated.

Reproducibility: Yes

Additional Feedback: