Paper ID: | 324 |
---|---|

Title: | Deep Learning without Poor Local Minima |

This paper proves several important properties for the standard loss function L (as a function of connection weights) of essentially any linear or rectifier deep neural network with training data set (X,Y) such that XX^T and XY^T have full ranks: (i) L is non-convex and non-concave, (ii) Every local minimum is a global minimum, (iii) Every critical point that is not a global minimum is a saddle point (i.e., there are no local maxima), (iv) under some reasonable assumption, the Hessian at any saddle point has a strictly negative eigenvalue. Item (i) excludes the simple cases, while items (ii)-(iii) show that there are no poor local minima and no local maxima. Item (iv) implies that for a network with just one hidden layer the Hessian at any saddle point has always a strictly negative eigenvalue while for deeper networks there are saddle points with no strictly negative eigenvalue. It is shown that these "bad saddle points" can be eliminated by perturbation. While these results are simple to state, the proofs provided are quite long and dense and they involve the use of several lemmas and consider different cases of the rank of the product of the weight matrices. As far as I could see the proofs are correct though some clarification is required (see below).

The results in the paper are theoretical, yet quite significant. They surely expand our knowledge about deep learning even though they may not have any immediate applications as the paper itself acknowledges. Here are my comments and questions about the paper: (i) Line 97: I suggest you include a summary of the reasons provided in previous work (by Baldi and Hornik 1989) why the assumptions underlying your results (that XX^T and XY^T have full ranks) are realistic. This would definitely strengthen the paper and make it more attractive to the reader. (ii) Line 139: It is not clear why \Psi_j the total number of paths from the inputs to the j-th output depends on j. Given the linear structure I would have thought that \Psi_j is independent of j. (iii) Line 146: I think the product in the end of this line should go from k=1 to k=H+1. (iv) Line 466: I think the proof in footnote 6 should be provided in the body of the document rather than in the footnote. (v) Lines 468-472: This is where I think the proof becomes unclear. In line 185, {\cal R}(M) is defined as the range space of the matrix M. So what do you mean by {\cal R}(\overline{\cal L}}) and by {\cal R}(f) given that \overline{\cal L}} and f are not matrices. In addition, since r'=(W'X-Y)T and r=(W_{H+1}....W_1X-Y)T with W'=W_{H+1}...W_1, it follows that r=r'. So what do you mean by {\cal R}(r)\subseteq {\cal R}(r')? Response to Author's feedback: Your remark on this item is unsatisfactory and does not clarify the issue. You say "{\cal R}(f) means the range of a map f (as a function of W)". This does not make any sense. The range of a map is a SET and cannot be considered as a function. In particular the range of f is simply a subset and not a function of W. As sets {\cal R}(\overline{\cal L}}) and{\cal R}(f) are identical. And {\cal R}(r) and {\cal R}(r') are the same sets. (vi) Line 563 W --> We

2-Confident (read it all; understood it all reasonably well)

The paper proves two theorems: one unconditional result about the loss landscape of deep linear networks (a chain of matrices) and one assumption-based calculation for deep relu networks. Both answer questions raised in previous work: the first is a strengthening of a conjecture and the second uses fewer assumptions than a previous result. The results provide excellent intuition towards the robustness of neural network optimization. Although the proofs are a fairly technical calculation, the theorems themselves are extremely intuitive. The paper is likely to benefit researchers trying to understand why simple algorithms (SGD) suffices for technically NP-hard optimization problems, and may lead to further strengthenings or similar results.

The paper makes a significant dent in the central mystery of deep learning: why are neural networks so easy to optimize? The precision of the linear result is quite impressive! The nonlinear result still makes a implausible assumption, but is stronger than previous work and the implausibility is carefully discussed. The paper itself was a delight to read, though I will admit that the full proofs were a bit of a slog. My main question is whether the proofs could be simplified. Two approaches come to mind: use random perturbation tricks to avoid some of the length devoted to singularity, and constructing a monotonic descent path from any point to the global minimum. Here's a sketch of how to construct a descent path. First, I claim that the global minimum is min_{rank W = p} 1/2 || Y - W X ||^2 This is clearly at least as good as the global minimum, and can be achieved by decomposing the above optimal W as W = U D V' = Ue 1 ... 1 De 1 ... 1 Ve' = W_{H+1} ... W_1 where U D V' is the SVD of W, Ue/De/Ve are zero extended versions of U/D/V, and the 1's are rectangular extended identity matrices. Now consider an arbitrary point W = W_{H+1} ... W_1, and decompose each factor via SVD. Given a matrix decomposition D U V (resp. U V D) for diagonal D and orthogonal U/V, there is a smooth path to a decomposition of the form D 1 V (resp. U 1 D) that does not change the value of the product. Similarly, D U E for diagonal D/E can be transformed into the form 1 U E without changing the product. Using a series of these transforms chained together (which will be continuous if not smooth), we can find a path ending in the form Ue 1 ... 1 De 1 ... 1 Ve' A final path changing Ue/De/Ve' can be constructed to reach the global minimum, where the loss decreases monotonically. Note that the path can also be made to increase after staying constant. If correct, this is a much simpler proof of 2.3(i) and (ii). The above shows that every point W can be monotonically decreased to a global minimum. However, I believe the above proof can be extended to show that every nonglobal minimum is a saddle point, essentially by rearranging the path construction to change singular values immediately. This would prove 2.3(iii). I am not sure if there is a similarly easy proof of 2.3(iv). Additional comments: 1. Does Theorem 2.3 need a size > 0 condition to ensure (i)? 2. "scaler" to "scalar" in the proof of Lemma 4.6. 3. It's good that A.7 avoids discussing the other direction, since it is false. For example, if $A = 0 \kronecker 0$ then all matrices are $A^-$, but only some matrices are Kronecker products.

2-Confident (read it all; understood it all reasonably well)

This paper settles a couple of conjectures from the machine learning community about neural networks. They prove in the linear model case (e.g., Y=W_H * W_{H-1} … W_1 *X) that every local minimum of the least squares regression problem actually attains the global minimum value. However they show that there exist (for deep networks with > 3 layers) that there can be “bad” saddle points where the Hessian might not have a negative eigenvalue. The authors show that small perturbations make these bad saddle points go away (but I was not able to understand this part very well). They also extend this to handle nonlinear networks with ReLU nonlinearities (which are much closer to networks that are used in practice). This result extends a recent set of results from Choromanska et al to work with weaker assumptions, however still relies on a few somewhat questionable assumptions.

This was a long and highly technical paper and I will say that unfortunately I was not able to give the proofs the time that they deserve (and so my confidence rating is low). My sense is that the contribution is quite important, since a theoretical understanding of “why neural networks work” still does not exist yet. There are several critical questions relating to overfitting and optimization. This paper addresses the optimization problem: why can we usually find good local optima for neural net optimization problems? Dauphin et al conjectured that (1) most local optima are “good” and (2) almost all critical points are saddle points. This paper addresses this first conjecture (and to a lesser extent the second one) by showing that all local optima are global optima. More concretely, the authors really nail the linear neural network case, and improving a recent result by Choromanska et al showing the same thing for nonlinear neural networks. In the nonlinear case, there are still several assumptions that must be discarded before this problem will be considered to be “solved”, such as assumption “A5u”, which says that whether a “path” is active or not is independent of the inputs to the network. However even here, the result is a big step forward and relies on much more elementary results than the Choromanska paper, which appealing.

1-Less confident (might not have understood significant parts)

Some major claims are made: (i) If the output dimensionality is less than the input dimensionality, all local minima for the loss surface of a deep linear network (with any depth and width values) will be globally optimal; and (ii) under a set of assumptions from Choromanska et al 2015, an analogous result also holds for nonlinear deep models.

- This is potentially an important paper, if all the proofs of the assertions are verified to be correct. - Without any conditions on p, Theorem 2.3 (iv) is vacuous - since the assertion implies that that claim holds for any p. - The presentation needs to be significantly improved. The author should make some effort to make the paper self-contained. - The proof sketch for Theorem 3.2 is missing and the proof is entirely delegated to supplementary material, making verification in a conference submission somewhat hard. Insights are not provided on what assumptions make the loss surface well behaved in terms of not having poor local minima. - Overall, pretty hard to read but could be promising line of work.

1-Less confident (might not have understood significant parts)

This paper characterizes the critical points of linear neural network and non-linear neural network. For linear neural network, it shows that there is no poor local minima, and for shallow network all saddle points are strict saddle, while for deep network there could be bad saddle points. For non-linear neural network, with two additional assumptions, the authors show that it can be reduced to the linear neural network case, thus the same claim holds. This can be seen as one possible explanation for the success of training deep neural networks, because all local minima are global minima, and most saddle points have negative eigenvalues to escape. (Although assumption A5u-m which says the activation of any path is independent of the input data, is not realistic)

In terms of technical quality: The whole proof is very intricate, and I didn't follow A.6 and B.1. But other parts seem correct to me. The lemmas provided in section 4 are intuitive and helpful, which characterizes the first and second order information of the critical points. The proof sketch provided in the main paper seems correct to me (except the one for Theorem 2.3(ii) which I didn't follow). Since I'm not sure about the correctness of the whole proof, but the parts that I checked seem correct, I give 3 points for technical quality. In terms of novelty: The paper is about proving a conjecture and (partially) solving an open problem. Both problems are important problems about training deep neural networks, and the paper gives general descriptions about the loss surface of linear network, as well as non-linear network (with additional assumptions). The loss surface of linear network, especially the properties of the saddle points and local minima, were not known previously. So I think the novelty is good. In terms of potential impact: If the proof and the claim are correct, I think this paper could be potentially very important. Although the results on non-linear neural network are based on some unrealistic assumptions, they are fewer compared with the prior work. Moreover, the results on linear neural networks do not rely on any assumptions. Thus, it could help people understand the loss surface of neural network, and understand why simple first order optimization methods like SGD works for training neural networks. In terms of clarity: I feel the paper is well written. It would be better if the authors could write down all the discarded assumptions instead of just referring to their names A1p, A2p, A3p, etc. The proof sketch for Theorem 2.3 (ii) is still too abstract, a little bit more intuition might be more helpful. MISC: Line 59: "which with" -> "with which" Line 114: "corolalry 2.4" -> "corollary 2.4" Line 604: "theorem 2.3 (i) "-> "theorem 2.3 (iii)" Line 563: "proof w" -> "proof we"

1-Less confident (might not have understood significant parts)

The main contribution of this paper is to prove that for deep "linear" networks, under some assumption on the training data and for the squared error loss function, every local minima is a global minima. They further show that picking a subset of assumptions in (Choromanska et al. 2015b) would reduce the model to a linear network.

Understanding the optimization in deep learning is very challenging and any progress in this area even with several assumptions is valuable. This paper addresses the optimization under 3 assumptions: 1- There is no activation function (deep linear networks) 2- Assumption on training data (XX^T and XY^T are full rank) 3- The loss function is squared error loss. This is a significant contribution and almost all of the technical body of the paper are proofs that lead to the theorem 2.3 which is using the above assumptions. However, my main problem with the paper is with the claim about deep NON-LINEAR networks. I find it a bit misleading to claim something about non-linear networks but pick a set of assumptions that makes the network linear! How can someone call a network non-linear if the setting is such that the output of the network is linear in the input and the parameters? In particular, based on the assumptions on the non-linear setting, the loss of the network is EXACTLY EQUAL to the loss of the linear network and one can only understand this by looking at the appendix and the discussion in the main text is rather vague. Also, authors claim that their statement is tighter than what is suggested in (Choromanska et al. 2015b). However, as far as I understand the assumptions 2 and 3 in this work are different than the ones in (Choromanska et al. 2015b). I would appreciate some explanation about that. I do agree that the assumptions seem more realistic but the problem is with general claims when such conclusions cannot be easily made. Suggestions: I think the paper will be much better if it is clear from the beginning that the contribution of the paper is on deep linear networks with above assumptions. The abstract is currently very misleading because the current paper is not proving anything about a really non-linear setting. I suggest removing the section on deep non-linear networks and instead briefly discuss that picking a subset of assumptions from (Choromanska et al. 2015b) would reduce their setting to linear networks which is analyzed in the paper. If you insist on having a separate theorem for deep non-linear networks, I think you should bring the discussions about the theorem from the appendix to the main text and make it clear that the network is in fact linear after adopting those assumptions. Finally, I think section 4 needs more text that explains intuitions behind lemmas or otherwise it is not really helpful to have the lemmas in the main text.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)