This paper proposes a modification to the way that the critic is typically trained in DPG, optimizing the critic to minimize a bound on the error of the value gradient rather than the error of the value, with the targets computed using a learned model. R1, R3, and R4 highlight that the approach is interesting and well-motivated, with R1 praising the insightfulness of the approach. Although R3 questioned the novelty of the method, and R2 was unconvinced by some of the theoretical claims, I think that the authors have done a good job of demonstrating empirically that their approach improves upon the baseline. Additionally, even if the idea is not completely new, there is value in fleshing out an idea and getting it to work in practice. I believe this paper will be of broad interest to both model-free and model-based researchers in the RL community at and thus recommend acceptance. However, I ask the authors for the camera-ready to please make address the following points from R2 and R3: (1) be more explicit that Proposition 3.1 only applies to DPG and not to the general case of PG; (2) discuss the effect of bad gradients in the learned model; and (3) temper the claims about optimizing the action-gradient being the right thing to do, in light of the practical difficulties encountered when optimizing this (requiring the TD error to be included as a regularization term).