NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1835 Inherent Weight Normalization in Stochastic Neural Networks

### Reviewer 1

Summary ------- This work combines stochastic neural networks with binarised neural networks. They introduce neural sampling machines (NSMs), which are neural networks with additive and/or multiplicative noise on the pre-activations. The main focus of the paper is on NSMs with multiplicative noise, however, which exhibit the weight normalising effects. Gaussian and Bernoulli noise sources are considered and training is done by backpropagating through the probability function of the binary activations, rather than using some variant of the REINFORCE algorithm. Comments -------- *originality*: The contributions of this paper are to the best of my knowledge original. The idea of self-normalisation, however, has been introduced with Self-Normalizing Neural Networks (SNNs) (Klambauer et al., 2017), which makes the title somewhat confusing, as well as the usage of SNN for stochastic neural network. *quality*: If I understood the purpose of binarised networks correctly, they are mainly interesting for porting networks to low-end hardware. Therefore, I would not expect binarised networks to outperform regular networks. This paper does claim to have binarised networks that are better than their non-binarised counterpart. From the current experiments, it is not clear to me where exactly the improvements comes from. Therefore, I think it would be interesting to additionally compare: - with standard binarised networks to assess improvements due to normalisation effect + stochasticity/learning rule - with normalised binarised networks to assess improvements due to normalisation effect - with normalised stochastic neural networks to assess improvements/deterioration due to binarisation. Am I correct to say that with the initialisation for CIFAR 10 (line 231), the NSM is practically a fine-tuned version of the deterministic model that you compare to? If yes, this would be a rather unfair comparison. Apart from the above and the typos listed below, this is a clean piece of work. - line 110: Shouldn't the probability of of a neuron firing be $P(z_i = 1 \mid z) = 1 - \Phi(u_i \mid z)$? - lines 136 and 142: the bias term should not appear in these equations for multiplicative noise, see line 118. - equation 7: The second term contains a derivative w.r.t. $\beta_i$, shouldn't this be w.r.t. $v_{ij}$? *clarity*: Overall, the paper is well written and easy to read. There are some points, however, where rewriting or adding more details might be useful to improve understanding and reproducability (might be partly resolved by code release): - line 122: "where $\beta$ is a parameter to be determined later" is rather uninformative. I would opt for something like "where $\beta$ models the effects due to the noise, $\xi$". - line 129: How important is this term $a_i$? Can't $\beta_i$ be controlled enough by tweaking variance (with Gaussian noise) or probability of success (with Bernoulli noise)? - line 170: Why are the gradients through the probability function a biased estimate of the gradient of the loss? - line 174: If you rely on automatic differentiation, does this imply that you have to compute both activations and probabilities in the forward pass? - line 187: What is this hardware implementation? Doesn't it suffice to test on software level? - line 191: What is "standard root-mean-square gradient back-propagation" in this context? From the text, it appears that the authors use a softmax layer with cross-entropy loss (quite confusing to also mention the negative log-likelihood loss, which is only an implementation detail in the end). So I assume it has nothing to do with the root mean squared error. It is also not clear how the data was split into training/validation and test data. The tables mention that test error was computed over 100 samples, but it is not clear how these samples were chosen. *significance*: Having powerful methods enabled in low-end hardware or even in hardware implementations will probably become increasingly important in a worl of "smart" tools. Stochasticity has already proven useful for uncertainty estimation, regularisation, etc. This paper effectively enables these tools in binarised networks, which are much easier to implement in hardware. The benefit of inherent normalisation with this method, makes this an especially interesting approach.