NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality: I am not convinced that the contributions of this paper are more significant than that of [1], which have been cited in this paper already. Specifically, in comparison with [1] in Line 82, the authors state that these conclusions apply to a smaller set in weight space. I would appreciate it if the authors could quantify the difference here and have a discussion section to show the comparison with some form of mathematical comparison. Further, there have been quite a few papers that show convergence of GD on neural networks using something like strong convexity. eg. [3] and I would appreciate a discussion on how that it relates to this. Clarity The paper is written quite clearly and it is easy enough to follow the paper. I specially appreciate the proof sketch given below the statement of the main theorems. But in certain places in the appendix, there is some confusion regarding the scope of the index variables (I have pointed it out in the next section). However, certain things needs a bit more elaboration eg. Line 214 - “The bound in (6) is restrictive”. Could you elaborate on what the implications of this is? A minor comment would be using the \|W\|_F notation to represent the Frobenius norm instead of the \| W \| notation, which is commonly used to denote the matrix 2-norm. I am aware that it you have clearly defined the notation though. Technical Comments 1. Line 88-89:], I would request the authors to elaborate on Line 88-89 about “then differentiable local minima should satisfy …. as the network have enough capacity”. However, [2] deals with a very different setting- including leaky-ReLUs and perhaps more importantly a multiplicative noisy regularizor like dropout and a mild over-parameterization requirement. The maths to connect this work with [2] doesn’t seem entirely straightforward to me. Moreover, the set U here depends on both the spectral norm and the loss.
2. My first major doubt is in the derivation of (7). If I understand correctly, you impose the constraint $ \|W\|_* \le R$ into (6) to obtain (7). However, $ \|W\|_* \le R => 1/ (\|W\|_*) \ge 1/R $ and it doesn’t imply (7). In that case, I think your claim in Line 176 that U(\lambda, \theta) \cap B(R) doesn’t include all points with training error below that certain bound.
3. Could you please clarify, what you mean by “bounded region” in Line 196 - “ .. in bounded region, as prescribed by the set U(\lambda, \theta) “ ? Which bounded region are you referring to ?
4. In Line 198, that the size of U(\lambda,\theta) depends on the training data. Could you clarify, if you refer to whether a small enough loss can be obtained by a small enough W (in terms of spectral norm) ? It would be very useful in somehow quantifying/giving an idea of how large this set U(\lambda, \theta) actually is.
5. Lemma 5,6 : Lemma 5 and 6 says that every critical point is an isolated local minima. However, it doesn’t guarantee the prescece of one such local minima does it ? Particularly, because the loss is regularized it is not trivial to expect a zero-loss and thus I don’t see how you can guarantee that U(\lambda,\theta) is simply not a trivial set which doesn’t contain any minimizers ?
6. Could you please elaborate on Line 214 - “The bound in (6) is restrictive”? Do you mean that the set defined by U in (6) is practically uninformative ?
7. Further the experiments in (4) are quite far from providing evidence for the theory:
1. They do not talk about strong convexity but only when restricted to the gradient path. These are quite different.
2. They work with exponential loss which do not have a finite minimizer instead of quadratic losses, which do have finite minimizer. In some sense, the exponential loss are not expected to have any minima in any bounded set of parameters. So, it can never satisfy the hypothesis of minima in a bounded set.
8. In Supplementary Line 11 - 15, the indices i,j are quite confusing. They seem to be referring to different things. Please make it clearer.
9. I don’t understand how you derived (3) from the previous line in the suppmentary. Could you please elaborate why it is okay to ignore the second term ?
10. Line 56 in Supplementary: Definition of U(\lambda, \theta) also admits a strict inequality. So, 15 should also have a strict inequality.
11. In the proof of Lemma 5 in Line 69, you start by assuming that W is a critical point. It doesn’t guarantee the existence of one. Could you please clarify if this is not true ?
In summary, I find the claim of the paper to not quite match the results of the paper in that it is a overstatement of the result for reasons pointed above. For this reason, I am rejecting the paper but I will be willing to change my score if you can argue against my points above. [1] Safran, Itay, and Ohad Shamir. "On the quality of the initial basin in overspecified neural networks." International Conference on Machine Learning. 2016. [2] Soudry, Daniel, and Yair Carmon. "No bad local minima: Data independent training error guarantees for multilayer neural networks." arXiv preprint arXiv:1605.08361 (2016). [3] Du, Simon S., et al. "Gradient descent finds global minima of deep neural networks." arXiv preprint arXiv:1811.03804 (2018). ---------------------------------------------------------- Thank you for answering my questions in the rebuttal and for pointing out my mistake in understanding the derivation of (7). Based on the fact that it was my misunderstanding and not a problem with the paper that had initially caused me to give lower scores and upon discussion with other reviewers, I am now updating my score to reflect this. I hope you will incorporate the rest of changes and corrections we talked about. I do like the contributions of the paper and I hope the authors would continue working in this field and potentially identify cases where the theory merges better with empirical findings, which in my opinion is the strongest drawback of the paper. Other than that, I would like the author to discuss the applicability of Lemma 4 as to, if it ever kicks in and identify problems where the $\lambda$ actually needs to be large and the pre-requisities of the lemma kick in.
Reviewer 2
This paper is well-written and makes novel contributions to the analysis of the energy landscape of neural networks (although, admittedly, I am not an expert on this topic). The author's results contribute to the understand of the optimization problems to be solved during training and have the potential to foster new convergence results for gradient descent on such problems. The experimental section is, however, weaker than the theoretical one. It is a little unfortunate that the necessary bound (6) was too difficult to satisfy in practise even on small toy problems such that the numerical experiments cannot confirm the theory. This clearly limits the significance of the findings for practical applications. Nevertheless, the experiments could lead to the conjecture that for different problems (classification with cross-entropy, softmax, batchnorm, etc.) similar properties hold. The criteria investigated in the experiments, namely convexity along a trajectory, is weaker than convexity of the (high dimensional) energy landscape, and works that have computed the Hessian after the convergence of gradient descent often still report a few negative eigenvalues (e.g. https://arxiv.org/abs/1611.07476). Thus, I am wondering to what extend theory and numerical experiments actually demonstrate the same effect. I this respect it seems unnatural to even deviate from the setting of a regression problem to a classification problem. In any case, I think it would be interesting to consider a network with few enough parameters to actually compute the Hessian and its eigenvalues. Despite the weak link between the experiments and the theory, I consider the theory to clearly be the main contribution, which is why I see this submission on the positive side.
Reviewer 3
I kind of like this paper. This is the first theoretical work, at least in my sight, that does not apply any unrealistic assumption and toy architecture of the deep network. The model studied here is both deep and nonlinear, which is something that the society is actually using. If there is no other unrealistic assumption hiding at somewhere I have not noticed, I would give strong credit. The piecewise strong convexity is a nice and reasonable notation to characterize the network, and something that aligns with my personal conjecture. Too bad, I am not the one to propose and prove it. The paper is written well and math seems to be solid and clean. I wonder whether the theory could work to explain some existing empirical findings. E.g. what is the relation of learning rate with the number of layers, and what is the relation with the number of nodes per layer, without the overparameterization assumption. This would make a huge difference if the theory is applied to some nontrivial findings.