Thank you for your submission to NeurIPS. This was a borderline paper, with three of the reviewers being position (including one that went from negative to positive after the discussion period), and one remaining quite unconvinced about the paper. Having read through the paper myself, I agree with the overall assessment that there are both some strong and weak aspects to the paper. On the positive side, I found the proposed modeling approach here genuinely interesting. The idea of parameterizing a kind of energy function (I know the authors don't use this notion, but I do feel it is warranted ... this refers to the loss function that is ultimately minimized to find the modes) as the combination of the squared value of a the function f plus the penalty of the function derivative away from zero, this was rather interesting, and I believe likely to have applications well beyond just modal regression. At this same time, I agree with some of the reviewers that the presentation and derivation of this function seems rather strange: the authors motivate finding points where f(x,y) = 0, but of course these aren't the actual points of interests to find, since the "true" function f will be discontinuous, and approximations will have a sharp positive slope away from data so that the function can then have a zero with negative slope at all the modes. It seems much more natural to me to simply parameterize the proposed loss function as a kind of energy function that obviates the need for negative sampling by means of the gradient penalty. Obviously, the authors should present this as they feel best suits the paper, but this point in particular caused a great deal of discussion amongst the reviewers, and was one of the chief reasons for the disagreement. I would strongly encourage the authors to consider the alternative rationale about why minimizing the full regularized loss term rather than f alone is a more appropriate target, especially since the entire method is originally motivated by finding zeros of f, yet ultimately this is not what is done. The other weakness of the paper is frankly that the presented results are underwhelming (it's unclear that it's performing any better than alternative approaches on the real data examples, and even these are quite limited and straightforward), though I believe that the presented approach may still be valuable even with rather preliminary results. If there is any way to conduct additional experiments to validate the proposed method, especially on real problems with multi-modal response variables, this would greatly strengthen the paper. Ultimately, I believe that these strengths make the paper worth acceptance, but I strongly encourage the authors to address all the points above.