This is a very nice paper showing that gradient descent on a carefully regularized hinge loss, two layer network can learn certain sparse parity problems with sample complexity in a certain sense exponentially better than linear methods (including RKHS and NTK after converting to finite width). The reviews and discussions were all eventually in favor, and I personally really enjoyed this paper and also its proof, which I studied in some detail, and plan to investigate even more. --- Minor comments. (a) I think it would be useful to include a corollary with an explicit choice or range of q, which combines both the upper and lower bound into a single clear compelling separation. Personally, to understand things, I chose q= k^18 and n>= k^40, which I think is kindof implied as natural by the upper bound (it makes the bound like 1/k). While these exponents are large, I think it would help the story for many readers. You can even write "poly"... (b) I think you can spend some time/space explaining the distributional assumption, and what breaks down if you remove it. (c) There was a reviewer question about a proof detail, which you ignored in your (short) rebuttal. Ultimately, it was me who checked the proof and explained the step to the reviewer, and not another reviewer. Please in the future clarify such issues, it is IMO one of the good things about these rebuttals, dealing with actual technical questions.