__ Summary and Contributions__: The paper proves convergence of gradient descent to population risk of O(OPT^{1/2}) + epsilon for learning a single neuron with some reasonable assumptions on the activation function. This generalizes previous results such as Isotron and GLMtron to the full agnostic case and to the vanilla gradient descent algorithm. The main trick is to bound the objective by auxiliary functions.

__ Strengths__: Strong paper. Significantly improves our understanding of a long standing problem, with a lot of previous work.

__ Weaknesses__: none.

__ Correctness__: I didn't verify all of the proofs but only some of them and the main proof sketch. Seems correct to me.

__ Clarity__: very well written.

__ Relation to Prior Work__: clearly discussed.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper considers the problem of agnostically learning a single neuron with respect to the squared loss via gradient descent (GD). The focus of the paper is on specifically understanding the guarantees GD obtains.
Under only boundedness of the input distribution, the authors show that for strictly increasing (gradient bounded below by a constant) activation functions, GD finds a point that achieves error O(\sqrt{opt}) + \eps where opt is the loss of the best fitting neuron. They extend the result to ReLU under standard anti-concentration assumptions (similar to [1,2]). For ReLU, it is known that you in fact get O(opt) + \eps under essentially the same assumptions (see [1]) using an alternate algorithm.
Further they show that under the mean-0 sub-gaussian noise setting, the results can be improved to opt + eps. Note that it is known that in this setting there is an algorithm (GLMtron [3]) which gives the same guarantee under weaker assumptions.
[1] Diakonikolas, I., Goel, S., Karmalkar, S., Klivans, A.R. and Soltanolkotabi, M., 2020, January. Approximation Schemes for ReLU Regression. In Conference on Learning Theory.
[2] Diakonikolas, I., Kontonis, V., Tzamos, C. and Zarifis, N., 2020. Learning Halfspaces with Massart Noise Under Structured Distributions. In Conference on Learning Theory.
[3] Kakade, S.M., Kanade, V., Shamir, O. and Kalai, A., 2011. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems (pp. 927-935).

__ Strengths__: The paper furthers the study of learning guarantees for vanilla gradient descent in the context of neural networks. Prior to this work, the guarantees known for GD were mostly restricted to the realizable setting and not for the challenging agnostic setting. Despite the fact that the results are obtainable by other algorithms [1,3], I think it is good to know the guarantees obtained via GD since it is the primary choice of algorithm for training NNs.

__ Weaknesses__: The bounds obtained by the authors for GD are not as good as the ones known for prior work especially in the ReLU setup. The paper does not have a lower bound showing that their results are tight. Thus, it is not clear if it is due to some looseness in the analysis in the paper or due to a more fundamental issue. It would have been great if the authors explored this direction a bit further.

__ Correctness__: The proofs seem correct to me. I have checked the proofs in the main text but have not verified all the proofs in the appendix, so it is possible I might have missed some small detail.

__ Clarity__: The paper is generally well-written and easy to follow.

__ Relation to Prior Work__: The authors do a commendable job of putting their work in context with prior known results. They have consolidated the known results for learning a single neuron under varying noise models as well as distributional assumptions. I believe this is useful for anyone hoping to work in this area. I also appreciate the honesty of mentioning when prior work gets stronger guarantees.
One important work that the authors have missed is [1]. Since this work is especially relevant to the paper, the authors should add relevant comparison/discussion.

__ Reproducibility__: Yes

__ Additional Feedback__: Minor typos:
- In equation (3.3) and (3.4) āvā is not defined which makes it a bit confusing.
- Also proof of Claim 3.4 is missing.
[Post Author Response]
My score remains unchanged. The bounds in the paper are not optimal however I still think the work makes good progress towards our understanding of GD.
I believe the results from Diakonikolas et. al. can be extended to the strictly increasing setting in a straightforward manner. I do agree that the authors' result is under more general distributional setting however achieves a weaker guarantee. As far as the dimension dependence goes, I think this is an artifact of the norm of the input (the paper assumes norm 1 whereas for gaussian it would be sqrt{d}).

__ Summary and Contributions__: This paper considers learning a single neuron using gradient descent. If the optimal solution achieves a population risk of OPT, then gradient descent can roughly achieve a population risk of sqrt{OPT} if there is no relationship between features and labels, and OPT in the noisy teacher network setting.

__ Strengths__: This paper provides a thorough solution to the problem of learning a single neuron with gradient descent. The proofs are clean, and contain some interesting ideas: for example, that the gradient points to a good direction before finding a low risk.

__ Weaknesses__: The main limitation is that only one neuron is allowed. It would be more interesting if the analysis can handle more complicated models, such as two-layer networks.

__ Correctness__: I do not see any correctness issue.

__ Clarity__: I think the paper is well-written.

__ Relation to Prior Work__: The discussion of prior work is thorough as far as I know.

__ Reproducibility__: Yes

__ Additional Feedback__: Reply to the feedback:
Thanks for the response! I agree that agnostic learning of a single neuron is an interesting and important problem, and this paper provides a thorough analysis of it. On the other hand, while I agree it is natural to start from the single neuron case, I am not sure whether techniques in this setting can help us analyze the multiple neuron case: for example, can we generalize eq. (3.6) to the general case? It would be better to include some discussion.

__ Summary and Contributions__: The paper analyzes the complexity of learning with a single neuron by gradient descent both in agnostic setting and noisy teacher network setting. The main contribution is being a first result for agnostic learning of a single neuron.

__ Strengths__: The paper is very clearly written and has novel result in the agnostic learning setting.

__ Weaknesses__: The significance is a bit limited by restricting to a single neuron setting since much of the difficulty of learning a neural network lies in jointly learning the mixture weights with the neurons.

__ Correctness__: Yes.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: In general, a very well-written paper that is very clear in introducing the problems as well as explaining the main proof strategy.
The main contribution is in the agnostic learning setting since the noisy teacher setting is very similar to GLM.
==========================================
After reading other reviews and author's rebuttal, I maintain the same rating.