NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal

### Reviewer 2

This paper studies the optimization of one-hidden-layer neural network for solving multi-class classification problems with the softmax loss. It shows that with a large enough over-parametrization (that depends polynomially on dimension factors and 1/\eps), mini-batch SGD converges to a classifier with at least 1-\eps classification accuracy. I feel like this is a strong paper. The part which surprised me the most is Lemma 5.2, which says that with enough over-parametrization, whenever the "pseudo" classification error is still large, the gradient norm of the pseudo-net is also large, so that SGD will keep making progress. This is a joint statement on approximation theory (there has to be a point with low classification error) and on the optimization landscape (if you're not there your gradient has to be large). This joint statement seems too strong to be true generically but here it probably comes from the linearity of the pseudo-net. Combining with Lemma 5.1, which couples the original-net and the pseudo-net for a bounded number of iterations (T <= O(1/\eta)), one gets that SGD on the original-net has to make progress unless the classification error is low, which is the main result. This entire analysis pattern is pretty novel and I have not seen similar analyses in the literature. The discussion session is very well-written and insightful as well. In particular, pretty inspiringly, it postulates that with over-parametrized models, there can be a solution not far from the initialization with high probability. This phenomenon is also presented in the experiments. I wonder that whether the relative Frobenious error scales as 1/sqrt{K} -- it looks like this is the case for any fixed t. As a potential exploration, it might be good to try to choose different t for different K based on some pre-determined metric (e.g. classification accuracy > 90%) and see the scale of the relative error, to see if this phenomenon is more convincing. One limitation is that the setting might make the classification problem itself easy (each class is mixture of L well-separated components). But given the novelty of the result and the analysis, I think it brings in interesting new ideas and directions for future work, which makes it a nice contribution.

### Reviewer 3

This paper analyzes the generalization performance of two-layer ReLU networks in an overparametrized regime. Specifically, the networks are trained with SGD over separable multiclass problems. In such setting, the authors prove that the learned network has small generalization error with high probability. This result generalizes the result in [7], which only considers linearly separable data. This paper is well organized and easy to follow. The proof sketch and discussion on insights make it easy to understand the technique in use and the implications of this work. There is no doubt that the authors address an important problem. Explaining the generalization performance of deep learning is a very core task of the modern machine learning. But the contribution of this work seems incremental. The theoretical result of this paper generalizes the result in [7] by adopting a more general separable dataset. It is true that such structured (multiclass) data is closer to the practical scenario. While the truly practical scenarios normally involve non-separable dataset, which cannot be easily analyzed with the approach used in the paper (the separability parameter \delta > 0 seems necessary even for the simplified case in Section 5). Moreover, mixing the generalization and optimization in the main theorem likely complicates the analysis and restricts the scope of the result. In particular, SGD might have nothing special in regard to learning a solution with good generalization performance. It just happens to be the only practical method for neural network training. For random forest [a] and kernel machine [b], solutions that are not learned by SGD (e.g. exact solution for kernel regression) can achieve both (almost) zero training error and excellent generalization performance. Note that all these models/architectures are overparametrized. So we see empirically that for several overparametrized architectures, methods other than SGD can learn a solution with a small generalization error. Thus, it is hard to tell which inductive bias is more important to obtain the good generalization performance, the one derived from the optimization method or the one enforced by the architecture selection. Some minor things: Line 123: or every \epsilon ---> for every \epsilon. Line 141: by SGD ---> by GD. Line 276: converges to 0 ---> converges to 100%. [a] Wyner, Abraham J., et al. "Explaining the success of adaboost and random forests as interpolating classifiers." [b] Belkin, Mikhail, et al. "To understand deep learning we need to understand kernel learning." ===== After Rebuttal ===== After reading the rebuttal and all reviews, I feel that this paper deserves a higher score.

### Reviewer 4

Summary: This paper analyzes SGD on cross-entropy objective for a 2 layer neural network with ReLU activations. They assume a separability condition on the input data. Pros: 1. This paper analyzes a very interesting version of the problem I believe. That is, they stick to the cross-entropy loss and they study SGD the algorithm used most often in practice. 2. The techniques seems novel and innovative which is different from past work. I think this is an important contribution and adds strength to this work. Questions: 1. The assumption A1 seems very similar to linear separability. I don't understand why one cannot club D_{i,j} for all j \in l into a single distribution D_i. Then assumption A1 is the same as separability. Am I missing something? It would be helpful to explain the assumption A1 pictorially. 2. The rates in Theorem 4.1 should be made explicit. Currently it is extremely opaque and hard to interpret the result and verify its correctness. 3. The idea of a pseudo loss and coupling that to the true loss is interesting. I think it would improve the presentation of the paper to add a bit more intuition of the technique of page 5.