NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The result is not that surprising, but to the best of my knowledge its statement is novel and it could be an interesting addition to the literature on implicit bias. It appears that the analysis mostly leverages the results of Soudry et. al. [2018], i.e. the authors make appropriate transformation to AdaGrad iterations and relate them to "induced" GD iterates. Also, event-hough simple, I find the small examples in Section 3.3.1 quite enlightening. On the other hand, not so convinced about the value of Prop. 3.1.(especially after seeing the stringent assumption of Lemma A.6)
Reviewer 2
The main contribution of the paper is to characterize the implicit bias of the adagrad on linear classification problems using logistic loss (and other losses). They show that the adagrad converges to a direction which is a solution of a quadratic optimization problem depending on the initial conditions, hyperparameters and the data itself unlike gradient descent which converges to the max margin direction. They also give a few toy examples to demonstrate the properties of the adagrad solution and the difference as compared to the gradient descent direction. Significance: It is an important problem to understand the implicit bias of various optimization algorithms on the solution and there have been several recent works along this direction. The motivation comes from understanding the generalization abilities of overparametrized deep neural networks. People have observed that the generalization abilities for deep neural networks with adagrad are not as good as with gradient descent and hence, understanding the exact form of implicit bias for adagrad is an interesting and important problem. Originality: This paper is the first to characterize the explicit direction to which adagrad converges for linear classification problems with logisitic or exponential loss. The framework and the assumptions and some parts of proofs are borrowed from Soudry et. al. However, overall the proof has a lot of new components as compared to Soudry et. al and they proceed via a different argument where they do not characterize the convergence rate just an estimate on the direction of convergence. [1] also talks about the convergence direction of adagrad depending on the initial conditions and the hyperparameters. Can the authors please discuss more on that in the related work section? Quality: Overall, I find the paper is well written. It would be nice to see more intuition of the proof of the convergence. A minor comment: 1) Appendix Page 4: line 43: \lambda is not defined and \eta seems to be missing here? [1] Gunasekar, Suriya, et al. "Characterizing implicit bias in terms of optimization geometry." ---------- I have read the authors' response and thank them for their response. It would have been nice to include some empirical results as a part of the paper on somewhat larger datasets than are currently in the feedback document.
Reviewer 3
Originality: The results are original to the best of my knowledge. Quality: Mostly looks OK. I only have an issue with Proposition 3.1. It is claimed that with positive probability property (i) of Lemma A.6 holds for any absolutely continuous distribution I don't understand why this is true. To the best of my understanding, such distributions are simply distributions with a density function. However, such distributions can have finite support (e.g., a uniform distribution over [0,1]), and therefore property (i) may have zero probability. Clarity: Mostly OK, but I feel the authors tend to focus on special cases, an don't give the complete picture, even when it's easy to explain/draw. For example, I was confused by the presentation of Example 3.1. It took me a while to understand the full solution: \tilde{w} is pointing in the (sign(cos(\theta),sin(\theta))) direction - except when \theta is pointing in the one of the axis directions, then the solution is non-unique. If the authors add (or draw) this general result, it will be much easier to understand what is going on in section 3.3.2. Significance: Personally, the results do not seem very surprising to me, but it is also important to prove results which are not surprising, and perhaps the proof method has some novelty. Minor comments: 1) line 81: "Lemma" -> "Theorem" 2) line 94: I guess it is assumed that y_n=0 w.l.o.g from this point on? Later y_n appears (e.g before line 146) again and should be noted why. 3) line 118: what is "decisive" part? 4) Example 3.1: Why use two points? Isn't one point enough? 5) line 151: what does "irrelevant to x_1" mean? %%% Edited after rebuttal %%% First, I thank the authors for clarifying Proposition 3.1, and I, therefore, increased my score as promised. The revised version seems correct now, but I think the authors overstate its significance in the rebuttal. First, it is not clear from the proposition, and the discussion around it, how small is this positive probability. The proof suggests that to me that "generically" this probability is exponentially vanishing with the number of dimensions and data points. Such a rarely occurring result does not seem to be very strong. Second, I think it is also somewhat inaccurate to say that "It negates the claim 'the implicit bias of AdaGrad does indeed depend on the initial conditions, including initialization and step size' on Page 8, Gunasekar, Suriya, et al. [2018a].". Specifically, I think the claim in Gunasekar et al. was not made for every dataset (as clearly, it not true with d=1), but that such dependency can exist, in contrast to gradient descent (and, indeed this seems to be the generic case from this paper). Still, I still think it is a borderline paper, as the final results are somewhat incremental (there are no surprises or new exciting theoretical directions), the characterization of the implicit bias seems incomplete (it still depends on h_{\infty}, which we don't know), and the writing still requires polishing (as admitted by the authors).