NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 324 Deep Learning without Poor Local Minima

### Reviewer 1

#### Summary

This paper proves several important properties for the standard loss function L (as a function of connection weights) of essentially any linear or rectifier deep neural network with training data set (X,Y) such that XX^T and XY^T have full ranks: (i) L is non-convex and non-concave, (ii) Every local minimum is a global minimum, (iii) Every critical point that is not a global minimum is a saddle point (i.e., there are no local maxima), (iv) under some reasonable assumption, the Hessian at any saddle point has a strictly negative eigenvalue. Item (i) excludes the simple cases, while items (ii)-(iii) show that there are no poor local minima and no local maxima. Item (iv) implies that for a network with just one hidden layer the Hessian at any saddle point has always a strictly negative eigenvalue while for deeper networks there are saddle points with no strictly negative eigenvalue. It is shown that these "bad saddle points" can be eliminated by perturbation. While these results are simple to state, the proofs provided are quite long and dense and they involve the use of several lemmas and consider different cases of the rank of the product of the weight matrices. As far as I could see the proofs are correct though some clarification is required (see below).

#### Qualitative Assessment

The results in the paper are theoretical, yet quite significant. They surely expand our knowledge about deep learning even though they may not have any immediate applications as the paper itself acknowledges. Here are my comments and questions about the paper: (i) Line 97: I suggest you include a summary of the reasons provided in previous work (by Baldi and Hornik 1989) why the assumptions underlying your results (that XX^T and XY^T have full ranks) are realistic. This would definitely strengthen the paper and make it more attractive to the reader. (ii) Line 139: It is not clear why \Psi_j the total number of paths from the inputs to the j-th output depends on j. Given the linear structure I would have thought that \Psi_j is independent of j. (iii) Line 146: I think the product in the end of this line should go from k=1 to k=H+1. (iv) Line 466: I think the proof in footnote 6 should be provided in the body of the document rather than in the footnote. (v) Lines 468-472: This is where I think the proof becomes unclear. In line 185, {\cal R}(M) is defined as the range space of the matrix M. So what do you mean by {\cal R}(\overline{\cal L}}) and by {\cal R}(f) given that \overline{\cal L}} and f are not matrices. In addition, since r'=(W'X-Y)T and r=(W_{H+1}....W_1X-Y)T with W'=W_{H+1}...W_1, it follows that r=r'. So what do you mean by {\cal R}(r)\subseteq {\cal R}(r')? Response to Author's feedback: Your remark on this item is unsatisfactory and does not clarify the issue. You say "{\cal R}(f) means the range of a map f (as a function of W)". This does not make any sense. The range of a map is a SET and cannot be considered as a function. In particular the range of f is simply a subset and not a function of W. As sets {\cal R}(\overline{\cal L}}) and{\cal R}(f) are identical. And {\cal R}(r) and {\cal R}(r') are the same sets. (vi) Line 563 W --> We

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

The paper proves two theorems: one unconditional result about the loss landscape of deep linear networks (a chain of matrices) and one assumption-based calculation for deep relu networks. Both answer questions raised in previous work: the first is a strengthening of a conjecture and the second uses fewer assumptions than a previous result. The results provide excellent intuition towards the robustness of neural network optimization. Although the proofs are a fairly technical calculation, the theorems themselves are extremely intuitive. The paper is likely to benefit researchers trying to understand why simple algorithms (SGD) suffices for technically NP-hard optimization problems, and may lead to further strengthenings or similar results.

#### Qualitative Assessment

The paper makes a significant dent in the central mystery of deep learning: why are neural networks so easy to optimize? The precision of the linear result is quite impressive! The nonlinear result still makes a implausible assumption, but is stronger than previous work and the implausibility is carefully discussed. The paper itself was a delight to read, though I will admit that the full proofs were a bit of a slog. My main question is whether the proofs could be simplified. Two approaches come to mind: use random perturbation tricks to avoid some of the length devoted to singularity, and constructing a monotonic descent path from any point to the global minimum. Here's a sketch of how to construct a descent path. First, I claim that the global minimum is min_{rank W = p} 1/2 || Y - W X ||^2 This is clearly at least as good as the global minimum, and can be achieved by decomposing the above optimal W as W = U D V' = Ue 1 ... 1 De 1 ... 1 Ve' = W_{H+1} ... W_1 where U D V' is the SVD of W, Ue/De/Ve are zero extended versions of U/D/V, and the 1's are rectangular extended identity matrices. Now consider an arbitrary point W = W_{H+1} ... W_1, and decompose each factor via SVD. Given a matrix decomposition D U V (resp. U V D) for diagonal D and orthogonal U/V, there is a smooth path to a decomposition of the form D 1 V (resp. U 1 D) that does not change the value of the product. Similarly, D U E for diagonal D/E can be transformed into the form 1 U E without changing the product. Using a series of these transforms chained together (which will be continuous if not smooth), we can find a path ending in the form Ue 1 ... 1 De 1 ... 1 Ve' A final path changing Ue/De/Ve' can be constructed to reach the global minimum, where the loss decreases monotonically. Note that the path can also be made to increase after staying constant. If correct, this is a much simpler proof of 2.3(i) and (ii). The above shows that every point W can be monotonically decreased to a global minimum. However, I believe the above proof can be extended to show that every nonglobal minimum is a saddle point, essentially by rearranging the path construction to change singular values immediately. This would prove 2.3(iii). I am not sure if there is a similarly easy proof of 2.3(iv). Additional comments: 1. Does Theorem 2.3 need a size > 0 condition to ensure (i)? 2. "scaler" to "scalar" in the proof of Lemma 4.6. 3. It's good that A.7 avoids discussing the other direction, since it is false. For example, if $A = 0 \kronecker 0$ then all matrices are $A^-$, but only some matrices are Kronecker products.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

This paper settles a couple of conjectures from the machine learning community about neural networks. They prove in the linear model case (e.g., Y=W_H * W_{H-1} … W_1 *X) that every local minimum of the least squares regression problem actually attains the global minimum value. However they show that there exist (for deep networks with > 3 layers) that there can be “bad” saddle points where the Hessian might not have a negative eigenvalue. The authors show that small perturbations make these bad saddle points go away (but I was not able to understand this part very well). They also extend this to handle nonlinear networks with ReLU nonlinearities (which are much closer to networks that are used in practice). This result extends a recent set of results from Choromanska et al to work with weaker assumptions, however still relies on a few somewhat questionable assumptions.

#### Qualitative Assessment

This was a long and highly technical paper and I will say that unfortunately I was not able to give the proofs the time that they deserve (and so my confidence rating is low). My sense is that the contribution is quite important, since a theoretical understanding of “why neural networks work” still does not exist yet. There are several critical questions relating to overfitting and optimization. This paper addresses the optimization problem: why can we usually find good local optima for neural net optimization problems? Dauphin et al conjectured that (1) most local optima are “good” and (2) almost all critical points are saddle points. This paper addresses this first conjecture (and to a lesser extent the second one) by showing that all local optima are global optima. More concretely, the authors really nail the linear neural network case, and improving a recent result by Choromanska et al showing the same thing for nonlinear neural networks. In the nonlinear case, there are still several assumptions that must be discarded before this problem will be considered to be “solved”, such as assumption “A5u”, which says that whether a “path” is active or not is independent of the inputs to the network. However even here, the result is a big step forward and relies on much more elementary results than the Choromanska paper, which appealing.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 4

#### Summary

Some major claims are made: (i) If the output dimensionality is less than the input dimensionality, all local minima for the loss surface of a deep linear network (with any depth and width values) will be globally optimal; and (ii) under a set of assumptions from Choromanska et al 2015, an analogous result also holds for nonlinear deep models.

#### Qualitative Assessment

- This is potentially an important paper, if all the proofs of the assertions are verified to be correct. - Without any conditions on p, Theorem 2.3 (iv) is vacuous - since the assertion implies that that claim holds for any p. - The presentation needs to be significantly improved. The author should make some effort to make the paper self-contained. - The proof sketch for Theorem 3.2 is missing and the proof is entirely delegated to supplementary material, making verification in a conference submission somewhat hard. Insights are not provided on what assumptions make the loss surface well behaved in terms of not having poor local minima. - Overall, pretty hard to read but could be promising line of work.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 5

#### Summary

This paper characterizes the critical points of linear neural network and non-linear neural network. For linear neural network, it shows that there is no poor local minima, and for shallow network all saddle points are strict saddle, while for deep network there could be bad saddle points. For non-linear neural network, with two additional assumptions, the authors show that it can be reduced to the linear neural network case, thus the same claim holds. This can be seen as one possible explanation for the success of training deep neural networks, because all local minima are global minima, and most saddle points have negative eigenvalues to escape. (Although assumption A5u-m which says the activation of any path is independent of the input data, is not realistic)

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 6

#### Summary

The main contribution of this paper is to prove that for deep "linear" networks, under some assumption on the training data and for the squared error loss function, every local minima is a global minima. They further show that picking a subset of assumptions in (Choromanska et al. 2015b) would reduce the model to a linear network.