Review for NeurIPS paper: Stochastic Normalization

NeurIPS 2020

Stochastic Normalization

Meta Review

Though the reviewers remark that the paper brings no insights/analysis, it was well-received by reviewers on average as an empirical architecture design idea, addressing an important problem. The experimental validation is conducted to the standards in the field and shows that the method is empirically useful. The combination BSS+StochNorm is particularly promising. The authors are invited to submit the final version, considering the following improvements: - the paper can be densified to avoid self-repetitions and redundancy (of definitions of normalizations, descriptions of the contribution -- something like trice, of the existing methods, the algorithm and its description and Fig 1) - this space and the 9th page could be used to clarify important details of the experimental setup that are needed to understand what is the basis of comparison: how the hyperparameters are chosen per method, whether the 5 trials include a random train-validation splitting; include additional results from the rebuttal and discuss more along along the points below relating to the literature. As pointed out by reviewers, using moving averages is princily different from using batch statistics in that the moving average is considered as a constant for back-propagation. The issues related to this should be highlighted and discussed more in the paper. In particular the claim that "BN reduces covariate shift" I believe was an imprecise statement in the original paper by Ioffe and is misinterpreted by many subsequent works in the field. It should mean that BN allows for a quick adjustment when a covariate shift happens. The empirical design idea to randomly mix parameter-sensitive batch statistics with parameter-insensitive moving averages brings something new that can motivate further research. - The following works should be discussed (more) as they are directly relevant: Ioffe, S. (2017) "Batch renormalization: Towards reducing minibatch dependence in batch-normalized models". because it combines batch statistics and running averages (deterministically with a schedule) Guo, etc. "Double Forward Propagation for Memorized Batch Normalization", 2018 a different variant related to the above Shekhovtsov and Flach (2018): "Stochastic Normalizations as Bayesian Learning" combine full-dataset statistics approximated analytically and introduce randomness to replicate the regularizing effect of BN Li et al. "Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift" (2019) can be used to support arguments about dropout, as it directly studied the (in)compatibility of BN and dropout has been investigated If the discussion of Disout is relevant, please note that the journal paper on dropout also shows that a "gaussian dropout" with noise N(1,sigma^2) works equally well and thus already have shown a method that "perturbs neuron outputs rather than drops them" - A discussion and possibly some comparison with semi-supervised learning methods (as the experimental setup with 15% of data clearly allows this) would be important in order to see the larger picture. If I may, also a private request: the colorful hyperlinks (citations, sections, equations) are quite distractive, could you keep the hyperlinks but make them text black?