NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1398 Initialization of ReLUs for Dynamical Isometry

### Reviewer 1

The authors analyze propagation of low-order moments in artificial neural network, with a particular focus on networks with ReLU activation functions. The difference between the proposed approach and those used in previous works is that they consider a non-asymptotic limit; i.e. the authors do not require the width of each layer to go to infinity. Instead, the authors show that for Gaussian weights ensembles, the distribution of the pre-activations conditioned on the preceding layer post-activation is Gaussian. They then recursively compute the distribution of 2-norms of outputs and expected covariances both conditioned on inputs. The authors analyze the change of measure between two layers using an integral operator and study its spectrum. They use these results to propose an initialization scheme for ReLU networks which allows them to train very deep networks without skip connections and batch normalization. The results in this manuscript are interesting, however their relation to preceding work, both quoted un-quoted ought to be better explained. Firstly, regarding the novelty of the main contribution --- previous work by Pennington et al. has devised isometric and nearly isometric initializations for ReLU networks by shifting the bias appropriately. Another approach, identical to the one taken in this work has emerged from the Shatterew gradients paper [Balduzzi]. While the considerations were not explicitly motivated by mean-field theory, they do consider two point correlations just like this current work, casting doubt about the novelty of this work. Secondly, the relation between the integral operator and the Jacobian matrix used in the previous papers using mean field theory is not made explicit. Both the authors of the work under review and the authors of the mean field papers used a weak-derivative operator but either focused on the change of measure or the treated them as random matrices. It should therefore be stressed that the approach proposed in this work is another interpretation of the same method applied to the same problem. This in no way detracts from the value of the paper, and would only strengthen its connections to existing literature. Finally, a few small issues make it harder to understand the authors points: * Theorem 1 is presented in a super dense fashion * independence and distribution claims are sometimes ambiguous (conditional independence and distribution over random weight distributions?) * N-fold convolution $$p^{*N_{l-1}}_{\phi(h_y)}$$ and L-fold application of $T_l$ should be clearly explained * The generalized inverse is not defined $\phi(\cdot)^{-1}$ * Figure two is not clearly labeled. I would suggest changing the colors and/or labeling * Figure 3 $\&$ 4 would benefit from having the legend outside, to make it clearer that the labels apply to both.