NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 95 Understanding Weight Normalized Deep Neural Networks with Rectified Linear Units

### Reviewer 2

This paper presents a general framework for norm-based capacity control for L_{p,q} weight normalized deep fully connected networks with ReLu activation functions while considering the bias term in all layers, by establishing the upper bound on Radamacher complexity. For case p \geq 1, they need to use Radamacher average which results in dependency of upper bound on average width of the network and this is inevitable. Then with an L_{1,q} normalization they discuss the Next, the authors analyze the regression setting and provide approximation error as well as generalization error bound. In this case they argue that if you do L_{1,\infty} normalization of the weights, both generalization and approximation error will be bounded by L_1 norm of the output layer. In this case the upper bound does not depend on width and depth of the network, whereas if p >1, the upper bound will be a function of average width of the network. The authors analyze the binary classification and just mention the case for multi-class and how the bounds can change in one line (198). I would like to see more explanation. I find the analysis interesting and the proofs accurate and well-written. A downside of the paper is that the authors did not propose a practical algorithm to impose the weight normalization, specifically for L_{1,\infty} normalization. It is of interest to see how big c_0 is in practice. Can the authors argue about it theoretically? The next important thing then would be to show that the method works in practice (on a couple of datasets such as CIFAR100). However, I think the paper is worth getting accepted as it is provides the first step towards better regularization for deep networks. Minor point: In some places in the paper (line 13,18, 42,129), the fully connected DNNs are called feed forward DNNs. Note that, while in earlier days this was ok, it is better not to use it anymore since convnets are essentially feeding the input forward and this will cause confusion. Please use “fully connected” throughout the paper.

### Reviewer 3

This paper examines the Rademacher complexity of L_{p,q} normalized ReLU neural networks. Results are given that imply that generalization and approximation error can be controlled by normalization. These results are interesting and represent a meaningful contribution to the understanding of ReLU networks. Some improvements do need to be made to the writing. In particular, the results are simply listed in the main text with all accompanying motivation or intuition deferred to the proofs in the supplement. Since the paper appears to have too much white space between lines, there should be plenty of room to provide reasonably detailed intuition and/or proof sketches for the main results in the main text. This addition would greatly improve the readability of the paper for a larger audience. Edit: The rebuttal addressed all issues I had, and the new proof they indicate will be in the final version is interesting. I therefore vote for accept.