Review for NeurIPS paper: Information theoretic limits of learning a sparse rule

NeurIPS 2020

Information theoretic limits of learning a sparse rule

Review 1

Summary and Contributions: This paper extends recent work on the "all-or-nothing" phenomenon observed in binary compressive sensing (i.e., a sparse linear model) to a class of generalized linear models. Roughly speaking, the all-or-nothing phenomenon demonstrates that the mean-square error undergoes a rapid phase transition from its maximum value to zero after crossing a sampling threshold.

Strengths: Extending the all-or-nothing analysis to a broad family of generalized linear models is a challenging problem. This paper makes progress towards this goal by 1) characterizing the mutual information of the generalized model via a "single-letter" variational formula 2) applying this bound to develop a non-rigorous analysis of the all-or-nothing phenomenon for generalized linear models via the replica method 3) demonstrating agreement between these non-rigorous predictions and the performance of the GAMP algorithm. These are all novel contributions and important steps towards a full understanding of the generalized linear models case. (The authors now claim a rigorous bound.)

Weaknesses: Edit after author feedback: I believe that the work is much stronger now that the authors have established rigorous MMSE bounds. I have increased my overall score accordingly. The main weakness of the paper is that the connection to the all-or-nothing phenomenon is non-rigorous, whereas prior work for linear models establishes rigorous bounds. Given this limitation, the paper would have been stronger had it increased the focus on the agreement between the non-rigorous prediction and state-of-the-art algorithms. While the mutual information bound is interesting on its own, it would be more interesting with an explicit connection to the all-or-nothing phenomenon.

Correctness: The claims seems correct.

Clarity: The paper is relatively well-written, although some sections, especially the introduction, could be streamlined to make the key points more clearly. I would have preferred if the paper were more explicit and direct in the introduction about the fact that the all-or-nothing analysis is non-rigorous. (This suggestion is now moot given that the authors have established a rigorous connection.)

Relation to Prior Work: The paper makes extensive references to prior work, but I think it would be presented more clearly if it placed itself in the category of statistical physics bounds on phase transitions in sparse models, rather than in the category of the all-or-nothing phenomenon, which is established rigorously. (Again, one can ignore this comment given the new rigorous bounds.)

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper studies the fundamental limits of a sparse estimation problem that can be viewed as estimation in a generalized linear estimation model or learning the parameters a prediction rule for a one layer neural network (i.e., the perceptron). Previous work has rigorously characterized interesting phase transitions for quality of recovery (with optimal methods) in settings where "sparsity" means that a fixed *fraction* of components is nonzero. The contribution of the present paper is to extend these results to sub-linear sparsity where the fraction of nonzeros converges to zero as the problem dimensions diverge. In this regime, very recent work focusing on the *linear* model has demonstrate an "all-or-nothing" phenomenon where the MSE converges to a step function as a function the sample complexity. The results of the paper depend on some nontrivial technical advances. Most significant is a refinement of the adaptive interpolation method that allows for different problem dimensions to diverge at different rates. As a theory paper, this is a nice and timely contribution to active research area showcasing the value of exact asymptotic theory. From the perspective of machine learning, this paper would benefit by provided some more discussion about how the results relate more usual learning problem.

Strengths: The paper builds upon sophisticated and very recent tools to provide interesting and exact analysis of a (stylized but still relevant) inference problem. This a solid contribution for the fundamental problem of learning the parameters of the perception, which should be of interest to the theoretical side of ML

Weaknesses: There is a lot of notation set up needed to understand and interpret the results. Some simple examples upfront may help reach a broader audience. It would be nice if analysis of all-or-nothing behavior from the potential (Section 3 int he Supplement) were rigorous instead of heuristic.

Correctness: The proof it quite long but I am fairly confident that the results are sound.

Clarity: It is written well enough given the space constraints. (More examples and discussion would definitely help though.)

Relation to Prior Work: I think the authors do a decent job in this regard.

Reproducibility: Yes

Additional Feedback: I suggest the authors look at the at Section 3.1 as well as the proof in the online version of citation [33], which deals with the linear model. I suspect the same ideas could be applied with minor modification to establish the all-or-nothing behavior from the potential function rigorously. In any event, the illustrations of the normalized potential (Figure 2 in [33]) may also be helpful. EDIT: I have read the authors response and appreciate that they have made the result rigorous. My score remains the same.

Review 3

Summary and Contributions: The paper considers the generalized linear model, in the high-dimensional sparse setting. i.e. a sublinear number of features are important (or even informative) for predicting the outcomes or observations. In this setting the authors prove the existence of an 'all-or-nothing phenomenon' (see GZ, RXZ) wherein there exists a function m_*(n) (sublinear in the ambient dimension n) such that - if the # samples m > (1+eps) m_*, the error in estimating the signal vanishes; and - if m < (1-eps) m_*, the error converges to the 'trivial' error of estimating from the prior. Classical statistical or learning theory typically demonstrates a 'graceful' decay of error as the number of samples grows large. The current work is in a line of work demonstrating a 'sharp cutoff' phenomenon instead. This was initiated by GZ, RXZ in linear regression setting. This is closely related to notions like channel capacity in information theory, but is different in the sublinear scaling, which can cause technical complications. It is analogous to the 'cutoff phenomenon' in mixing of markov chains (see reference D) The authors carry this out using an extension of statistical physics techniques to the 'sublinear' regime. The steps are 1. To compare (or interpolate) the mutual information in the current problem with that of simpler, 'scalar' observation problem with a calibrated signal-noise-ratio. Via an adaptive interpolation method (BM), the authors then prove upper and lower bounds that (asymptotically) coincide for an appropriately calibrated scalar problem 2. Demonstrate that the mutual information goes from 'trivial' to 'full' as m crosses m_* 3. Link the mutual information with the estimation and generalization errors using the I-MMSE identity. Reference key GZ: Gamarnik , Zadik 'High dimensional regression with ...' RXZ: Reeves, Xu, Zadik 'The all-or-nothing ...' DAM: Deshpande, Abbe, Montanari, 'Asymptotic mutual information ...' D: Diaconis 'The cutoff phenomenon in finite Markov Chains' BM: Barbier Macris 'The adaptive interpolation ....'

Strengths: The papers main strengths are: 1. The problem considered is fundamental, and the phenomenon demonstrated is interesting in its own right, as well for algorithm design questions in statistical estimation and learning. The distinct nature of the phenomenon to relatively 'standard' arguments in statistics and learning is of significant interest to the NeurIPS community, and I would welcome more of it. 2. In terms of techniques, the proof methods have been used signficantly in the past few years (see BM and later papers) but the application to the sublinear data regime is also nice. I think the paper would make a valuable addition to the NeurIPS program.

Weaknesses: I don't think the paper has signficant weaknesses. The model setting is admittedly quite limited, but given the relatively sparse literature on the cutoff phenomenon in learning, I do not think this is a strong complaint. Some suggestions to improve: - I would recommend the authors to distinguish the all-or-nothing or cutoff phenomenon from usual statistical bounds that the machine learning and NeurIPS community is familiar with. - Mention that 'teacher student' setting is another phrasing of 'well-specified' models in statistics, 'realizable setting' in learning theory, and 'proper learning' in computer science. - Regularizing noise: if this is needed for the proof, preferably write it as such. Presumably a limiting argument works for Delta vanishing, e.g. one can always add small noise to the observvations. - It would also be nice to have a rigorous proof for the MMSE portion. Since this is not done in the current paper, but potentially accessible to present techniques, the authors should identify Eq (10) as a 'Claim'.

Correctness: The paper's results seem correct to me, while I did not verify the proofs in full detail. The authors also provide code to reproduce the synthetic experiments.

Clarity: The paper is well written and easy to understand for the general reader. Some small suggestions regarding writing are in the Other comments portion of the review.

Relation to Prior Work: The related work is correctly referenced and introduced. The i-mmse connection was first introduced in DAM, so I recommend to include the reference.

Reproducibility: Yes

Additional Feedback: - L.62 'more precisions' is weird phrasing - L. 69 intuitively explain the scaling of alpha_n - L.81 This neural-network interpretation is somewhat forced. It is okay to care about models/algs that are not neural nets. - L.98 n^{-1/9} is odd scaling... is this a proof artifact? If so rephrase the line to indicate this. - L.131 'absent of' is odd - Table 1 suggests for zero noise there is not an 'all or nothing' phenomenon for linear models since gamma_c = 0 - Figure 1: what are the lighter colored lines, presumably potential value at q=1? EDIT POST RESPONSE: After reading the author response, I am maintaining my score on the paper.

Review 4

Summary and Contributions: In this paper, the authors study the estimation error of generalized linear models (GLM) in cases which the amount of samples is sublinear with respect to the total number of features of the signal. They motivate this problem by first providing an estimation interpretation in which the goal is to find the lowest reconstruction error as number of samples grow. They also provide a learning interpretation which consists of a teacher-student scenario in which the student wants to learn the weights of the teacher’s perceptron with general activation function. To this end, the authors provide a variational formula for the asymptotic mutual information per sample as system size goes to infinity. This Theorem is the main contribution of the paper. To highlight the importance of the provided Theorem, they use analytical arguments to show the existence of the all-or-nothing phenomenon for the estimation error of a GLM and the generalization error of a perceptron neural network with a general activation function. To illustrate this phenomenon, the authors plot the obtained MMSE for Bernoulli and Bernoulli-Rademacher priors and various activation functions. It is shown that in all different configurations, the MMSE jumps from a value close to 1 to a value close to 0.

Strengths: The paper is well written and results are non-trivial and interesting. They consider the GLM which can be seen as an extension of others works that consider the linear model. They compare their theoretical result with prior work done on the linear model.

Weaknesses: While the linear model is motivated by compressed sensing, it would be interesting if authors could tie the GLM model with other applications. For example, where is the teacher-student network interpretation used in practical applications? Figures 2 and 3 do not have sufficient captions. What is the small plot inside the big one? Why do some of the dotted lines divide into two lines? The authors addressed this in the rebuttal.

Correctness: I did not check the proofs. The empirical methodology looks correct.

Clarity: The paper is well written. It is sometimes hard to follow the math, but I think the authors have made an effort to make it flow and move proofs to the end of the paper or supplementary material.

Relation to Prior Work: The authors mention prior work and highlight distinctions with their work throughout the paper.

Reproducibility: Yes

Additional Feedback: