Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
After reading the rebuttal and the other reviews, my score stays the same. Please add the discussed clarifications to the paper. ======== Overall I liked the submission, mostly for the well thought-out experiments that highlight an interesting and useful phenomenon: that the linear scaling rule for batch size extends farther when momentum or preconditioning are used. The theory also led to intuitive, well-presented results that were predictive of the experimental behavior. This is a new, significant result for the community, and overall I recommend acceptance. There are a couple of changes that would substantially improve the paper: First, in trying to motivate their co-diagonalization assumptions, as well as their theoretical assumption that they precondition by a power of the Hessian, the authors frequently conflate the Hessian and Fisher matrices. This conflation also appears in the study of the eigenvalues of the Fisher (in the appendix but referenced in the main text), which they pass off as eigenvalues of the Hessian. As a result some conclusions are misleading as currently stated. This would be remedied by formally defining each and changing the language around the two terms to make it more clear they are different. (To be clear: I think it is perfectly fine to make the assumptions the authors make here, since they do not rely upon the results theoretically, but rather treat them as predictions of what may (and apparently does) happen in practice) Second, the authors should be more clear about the precise forms of the algorithms they study. In particular, the authors state and prove results regarding momentum without actually stating what SGD with momentum is. Yes, this is common knowledge, but still imperative to have in the paper for the reader to follow precisely what you are doing. The same goes for later on in the experiments where preconditioning and momentum are studied together -- without an explicit algorithm, it is impossible to know e.g. in which order momentum and the preconditioned are being applied.
This paper studies how the critical batch size changes based on properties of the optimization algorithm, including SGD, momentum, and preconditioning. Theoretically, the authors analyzed the effect of batch size and learning rate via a simple quadratic model. Empirically, the authors investigated deep learning experiments and confirmed their theoretical findings for the simplified model. The paper is clearly written. The theoretical developments are based on a simple diagonalized quadratic model. 1. Regarding the proposed quadratic model, as the Hessian H is diagonal, the loss can be decomposed dimension-wise and therefore the optimization in each dimension evolves independently. These formulations are very restricted and far from the practice of deep learning models. Can the author comment on the generalizability of the analysis to non-diagonal quadratic models or even simple neural network models? 2. Regarding the optimizers, the author adds some Gaussian noise to the gradients to model the effect of stochastic sampling. Such a noise model is fixed throughout optimization and has diagonal covariance matrix, which is different from SGD whose stochastic noise also evolves and the covariance matrix can be non-diagonal. Also, the author claims that Adam can be viewed as a preconditioned SGD, but the preconditioning matrix considered takes a very special form of H^p. While all these simplifications can lead to an analytical understanding of the optimization, they do not necessarily cover the practice in training deep models. 3. Overall I think the theoretical contribution of this paper is limited. The author tries to explain and understand the optimization mechanism of deep learning by studying a simplified quadratic model. The deep learning scenario violates many of the assumptions in this paper, e.g., diagonal Hessian, independent optimization among dimensions, fixed noise, etc. While the theoretical results of the quadratic model fit (to some extent) the empirical observations in training deep models, there is no good justification for a solid connection between these two models. It would be better if the authors can justify (to some extent) the motivation of simplifying deep learning scenarios into such a simple one. I have read the authors' response. It addresses my concerns on the diagonal assumption of the Hessian of NQM. Overall, I think this is an interesting paper that tries to model the relationship among the optimal choices of hyper-parameters for different optimizers in training neural networks. I am still a bit concerned about the use of a convex quadratic model and its generality. I raised my score to be 6 marginally above the threshold.
[Edit after the author feedback]: I thank the authors for addressing my comments during the author feedback. I have read the authors' response as well as the other reviews. The authors' response addresses my concerns on the simplicity of NQM. Overall, I think this submission is interesting and provides a different direction to understand neural network optimization. I am happy to raise my rating. ========================================================== Summary: Motivated by recent various batch size phenomena in neural network training, this paper proposes a simple noisy quadratic model (NQM) to capture/predict the features of several optimization algorithms with different batch sizes in neural network training. The experimental results demonstrate the effectiveness of predictions of NQM on image classification tasks and the language modeling task. Also, the experimental results are consistent with previous studies. Pros: - The proposed model in this paper is well aligned with various phenomena in deep neural network training, which could be an interesting direction to study the role of batch size in training deep neural nets. - Empirically, this paper provides detail and importance analysis on the connection between NQM and different optimization algorithms. - The experiments are extensive, clear, and well designed, which characterize the key features of the batch size effect. Limitation: - Although the proposed NQM agrees well with previous empirical results (i.e., Goyal et al. , Shallue et al. ), as the objective function in (1) is convex and the quadratic form is diagonal, I think the model is not powerful enough to explain those phenomena in neural network training. Questions: - As shown in Eq (1), the objective function is very simple. Could the authors provide more explanations to justify the problem setup? As the problem is much easier than training deep neural networks in real applications. - L95, 'we assume without loss of generality that the quadratic form is diagonal'. Could you explain more on the diagonal quadratic form? There is an omission in the related work on large batch training: https://arxiv.org/abs/1709.05011 https://arxiv.org/abs/1904.00962