NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality: While lots of works have studied the property of the endpoint found by SGDs, the literature looking at the SGD training dynamics in the context of deep neural networks is sparser, and the loss contribution metric appears novel to me. The paper is therefore original from that aspect. Quality: The paper is in general of good quality. However, few specific points could be improved: - It would be nice to characterize the approximation errors introduced by the first order taylor expension - Authors claim that the Loss contribution is grounded while other Fisher information-based metrics heavily depends on the parametrization chosen. Could the authors expend on this point and provided a more detailed comparison between LC and the metrics introduced in [1] and [13] - In the introduction, authors claim that entire layers drift on the wrong direction during training. However, in the section, this observation only seems to apply to CIFAR/MNIST Resnet. It would be nice to characterize how robust is this observation. Clarity: The paper is clear and enjoyable to read. Significance. Looking at the training dynamics is an important topic. The authors propose a new metric to study the dynamics and provide nice empirical observation (learning is noisy, some layers sometime drift in wrong direction, learning is synchronized between layers). However, it is not clear how significant the loss contribution metric is, my main concerns are: - How does the LC compare with previously define metrics [1,13]. In which case the LC is more informative? - Is LC informative of the quality of the final point found by the optimizers? LC performs the dot products between the update and the batch gradient. It is not clear to me why using the batch gradient is a sensible thing to do. In particular, would networks training using batch-gradient descent using higher LC? We know that in practice, large-batch gradient find solutions that exhibit worse generalization - Would it make sense to use the gradient computed on the validation set to have a better estimate of the expected loss gradient instead of the empirical one? Update: Thank you for your rebuttal that did a convincing job about the utility of their metrics wrt to the FIM. On the other hand, after discussion, I also agree with R2 that further empirical validation is required to ensure the the metric can find neuron that are "hurting/helping" in training, or more generally why LC is informative of the quality of the final point found by the optimizers. In light of the second review, I find that the paper is borderline and I decided to keep my score. I do think that the idea explored in this paper is an interesting one and I encourage the authors to continue working in that direction.
Reviewer 2
Efforts to shine light on the black-box optimization process of large over-parameterized networks is an interesting and challenging topic. The authors investigate this problem by analyzing the per-parameter contribution to any loss function reduction. The paper is well-written overall and makes a few new and interesting observations. There are however, some major concerns I have with the paper that I detail next. - I think that the "help" or "hurt" heuristic is too simplistic and myopic. The authors claim that if the contribution of a parameter at any given iteration is negative, then the parameter is said to have hurt the loss reduction. While this is true in the mathematical sense, I feel that this is too simplistic. A local increase in loss function may lead to an eventual (greater) decrease contribution of the same parameter. By only looking at individual iteration snapshots, the authors have no way of accounting for such an effect. If the authors instead choose, say, an RMS value of the contribution, the interpretation of the term will be less myopic. Along the same lines, it might also be interesting to investigate the nodes that are maximally hurtful/helpful. Problems such as dead ReLUs (stemming from poor initialization) or connection to infrequent features might disable neurons (and hence, paths) and it would be interesting to see if the author's approach could discover these. - As a thought experiment to verify the above claim, I suggest the following experiment. At any given iteration, take a learning step and compute the sets of helpful, hurtful and indifferent parameters. Then, undo this iteration and take the same step but only for the variables which were helped. If second-order effects are not present and this myopic view is true, then such an experiment should yield similar outcomes to the original training trajectory (or better). - The approach ignores curvature from the discussion. Rather that using simply the first-order terms, maybe the authors could try using higher-order terms too? This may be expensive in some circumstances but an approximation of this RMS values of the moments of the gradients (similar to what Adam/Adagrad maintain) might be worthwhile too. - "Freezing the last layer results in significant improvement." is a known phenomenon, see "FIX YOUR CLASSIFIER: THE MARGINAL VALUE OF TRAINING THE LAST WEIGHT LAYER" from ICLR 2018. - The paper is missing actionable uses from the analysis. While a detailed analysis is enough for publication to NeuRIPS, I feel that this paper is incomplete without some sample uses of the analysis. The authors discuss some possibilities in the last section (using LC for identifying over-fitting, or for architecture search) but stay short of presenting them. I highly encourage the authors to have exemplar uses of their analysis in the paper. -- UPDATE -- I read the rebuttal and other reviews and I increase my score from 4 to a 5. I feel that, especially in the light of their rebuttal, a score of 4 is unfair to the authors. I wish to note that the author's rebuttal was very well presented. However, I am still concerned about the metric. While it has benefits over FIM, I feel that there is inadequate validation of whether this metric highlights what is intended. The authors do not discuss the experiment I suggested in my review to tease this effect. Further, the claim about myopic vs aggregate effects (Rebuttal #5) is not convincing. If the myopic and aggregate views are different, that requires a reconciliation with some of the other claims.
Reviewer 3
The main contribution is the loss contribution metric. The metric is then applied to analysing deep neural networks. It is a challenging task to define a clear and interpretable metric that shows a new suprising perspective on the training of deep networks and the authors managed to do it. I believe that the experiments clearly demonstrate utility of the metric and the results are surprising. The paper is very well written. I was a bit let down that there is no novel practical tricks presented (see below for details). Nevertheless, the paper will be clearly of interest to the community, and I am quite optimistic that future work will bring more practical applications of the developed metric. Detailed comments 1. Showing oscillatory-like behavior in training is not very novel. "Walk with SGD" (https://arxiv.org/abs/1802.08770) and "On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length" (https://arxiv.org/abs/1807.05031) seem to already show a quite related dynamic in training. The first paper shows that the gradient oscillates (i.e. cosine is negative between subsequent iterations). The latter paper shows that there are directions in the weight space (corresponding to the largest eigenvalues of the Hessian) where training is unstable. What is novel, I agree, is that such a behavior happens on the parameter-level, that it is as dominant as shown, and how parameters switch between helping and hurting. It would be nice to contextualize prior work in a bit better. 2. Instability of the last layer was discussed by some prior work, e.g. https://openreview.net/forum?id=r14EOsCqKX. Freezing layers, especially the last one, is also not novel. In https://openreview.net/forum?id=r14EOsCqKX they also freeze the last layer. 3. The paragraph "Learning is heavy-tailed" could be made a bit more precise. For instance, how would refining the view of learning as a Wiener process alter the conclusions made by these papers? It wasn't very clear to me. 4. I would, though it is personal taste, remove exclamation marks. I think using them is not the best practice in scientific writing. 5. Experiments show on the example of the first layer that freezing a layer that hurts might hurt even more because other layers help less. This is not very intuitive. It also seems to limit applicability of the developed metric (if the metric shows layer hurts, we do not know if we should improve it or not). If possible, it would be nice to explain the result better. Update Thank you for the well-written rebuttal! I decided to keep my score. I would encourage authors, in case paper ends up rejected, to run experiments suggested by one of the reviewers and examine effect of skipping updates that are calculated to be negative, or any related experiment that would pinpoint the causal effect these dynamics have on the training performance.