NeurIPS 2020

Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

Review 1

Summary and Contributions: The authors study phase retrieval with Gaussian data with gradient flow, in an asymptotic regime of infinite input size, number of samples and training time. The authors present evidence that gradient flow can get stuck in so called threshold states which are spurious local minimizers if not enough samples are used. At a certain critical value of the sample complexity relative to the dimension alpha, a phase transition in the structure of the Hessian at these points appears to occur and a direction of negative curvature emerges. Using tools from statistical physics of disordered systems and under some additional assumptions about the structure of the threshold states, their distribution and the critical value of alpha are computed. The authors also present empirical evidence illustrating the phenomena discussed. The critical value of alpha computed seems to differ from the values observed in experiment, possibly because of finite system size effects. Also, the values of alpha at which recovery is possible with reasonable probability appears much lower than the computed value. I thank the authors for the explanations in the rebuttal and the additional figures provided. I still believe the paper should be accepted.

Strengths: Problems in machine learning that involve solving a nonconvex optimization problem with gradient descent are ubiquitous and still somewhat poorly understood theoretically, and understanding this phenomenon is an important goal for the field. One way to better understand it is the detailed study of nonconvex optimization problems that can be solved efficiently with gradient descent, and this work studied one such problem. The empirical evidence of the phenomena described seem compelling. There also appears to be good agreement between the leading moments of the distribution of the threshold states as evaluated empirically and the computed moments.

Weaknesses: One of the main assumptions used in characterizing the "threshold states" at which the gradient flow dynamics appear to get trapped is that the Hessian is positive semidefinite. On the other hand, in figure 2 as the training loss crosses the threshold energy the minimal eigenvalues of the Hessian appear clearly negative, unless I am misunderstanding the figure. The authors do not appear to address this point. Could the soundness of this assumption also account for the inaccuracy of the computed value of alpha at which the phase transition occurs which can be seen in figure 4? The main result of the paper - the computation of the relative sample size alpha at which the phase transition occurs, doesn't seem to be very accurate when compared to experiments in figure 3 and 5. It would have also been helpful to plot this value in the figure in order to make this point clear. The discrepancy could be a result of finite size effects as the authors claim, but could also be a result of say the assumption made about the Hessian at the threshold states or the accuracy of the 1RSB ansatz. The fact that recovery is possible for much lower values of alpha than those predicted by the calculation, as seen in figure 5, raises a suspicion that there may be another mechanism in play, at least in the setting in which the experiment was conducted. Insufficient details are provided about the experiments performed. In particular it is unclear how the learning rate chosen affects the dynamics.

Correctness: The methodology is a widely accepted one in statistical physics, including a number of assumptions made and steps in the derivation that are not rigorous yet yield accurate predictions in many known cases.

Clarity: The paper is well written overall. A number of results from previous works are quoted and used. The clarity could be improved if these are introduced in greater detail. While this is impractical to do in the main text due to space constraints, a more detailed overview of these in the supplementary materials would have been helpful. The critical value of alpha from the calculation could have been added to figure 3 for clarity.

Relation to Prior Work: Yes, the details of prior work as well as key distinctions are discussed.

Reproducibility: Yes

Additional Feedback: Is there any sense of how the learning rate used affect the time it takes to escape the threshold states?

Review 2

Summary and Contributions: The authors in this work study the behavior of gradient flow in the phase retrieval problem. They extend known results that show why GF can converge even in the presence of spurious local minima by investigating the behavior of the “threshold states”, i.e. states to which GF is attracted and undergo a phase transition. They empirically show that the geometry of these states changes from minima to saddles as we increase the number of samples, allowing the algorithm to escape towards the direction of the true global minimum. The authors investigate when this transition happens and they provide empirical indications for a poly-log gap between the number of samples that are known to be needed for GF to converge to the global minimum and the ones that are actually needed.

Strengths: The paper makes interesting contributions towards understanding non-convex optimization and the behavior of our optimization algorithms. There has been an increasing number of works in the last few years that help us understand the optimization landscape of various machine learning problem and this work nicely fits on that literature.

Weaknesses: The main weakness of the paper is that it provides only empirical indications for some of the claims instead of actual proofs. However, I still find the experiments well-executed and the results interesting. The authors also mention that the formula for the number of samples where the transition happens holds if we know the distribution P(\hat{y}, y). Is this a reasonable assumption? It would be good for the authors to add some clarifications regarding that. In general, I would say that the paper lacks a few explanations and intuition regarding the technical part, it is fairly hard to follow.

Correctness: The goals of the paper are well defined and the experiments are elaborate and support them well. The authors convincingly demonstrate the convergence to threshold states before transitioning towards the minimum as well as the evolution of the eigenvalues of the Hessian to characterize the phase transition. It is difficult to assess the correctness of the theoretical part though.

Clarity: Overall, the paper is well-written and well-motivated. As I mentioned before the technical part lacks some explanations and intuition to understand what’s happening. For example, in line 113: it would be useful to include the description of the theorem and some intuition about what these quantities mean and how they emerge. Some minor comments: - the phase retrieval section in page 2 ends kind of abruptly. You could mention for example why we need the Hessian, or find any way to make the transition more smooth. - in the phase retrieval section it would be better to define the number of samples as m instead of \alpha N in my opinion, since you are talking about the problem generally and you don’t have to be constrained in linear samples. You can refer to m = \alpha N, in the related work section as you do. - in line 61 you write “with perfect recovery achievable with the approximate message passing algorithm only for α > αalg = 1.13”. Do you mean that above that threshold the recovery happens in poly-time or that it is information theoretically possible? It is not clear from the context. - line 46: typo: gradient-descent flow instead of gradient flow - in Section 3: you use \mathbb{P} for the distribution while in the rest of the paper you only use $P$.

Relation to Prior Work: The paper mentions many relevant related works and has clearly stated contributions to show the extensions upon what is known. A relevant citation that the authors might want to include is “A convergence analysis of gradient descent for deep linear neural networks” by Arora et al, 2019. The authors in this work so that for linear NNs, the trajectories that gradient flow takes are such that allow convergence to a global optimum despite the fact that there are spurious local minima and saddle points in the landscape.

Reproducibility: Yes

Additional Feedback: A question that I have by looking at the graphs of Figure 1 is: is there some intuitive explanation as to why is the threshold energy higher for more samples?

Review 3

Summary and Contributions: The authors study a non-convex variant of the perceptron teacher-student learning dynamics problem, namely the so called phase retrieval problem.

Strengths: Depending on the ratio of the number of measurements over the input dimension, various dynamical phenomena are discussed in the asymptotic regime. The success in finding the global minimum by escaping critical points through unstable directions is identified with the BBP-type transition known in random matrix theory. The results provide new insights on learning dynamics and on the role of the so called threshold states. The paper is a nice interplay between analytical techniques from physics and numerics. In my opinion these techniques deserve to be better known in the ML community as they are capable of revealing quite sophisticated dynamical phenomena.

Weaknesses: The results are limited to the perceptron architecture. But going beyond this is know to be quite challenging. In the statistical physics community, the results would be considered as an application of spin glass theory. From a qualitative perspective, the results do not display any real new phenomena compared to previous studies of teacher-student learning.

Correctness: yes.

Clarity: yes.

Relation to Prior Work: yes.

Reproducibility: Yes

Additional Feedback: