This paper brings direct optimization to RL and propose a novel way of computing a policy gradient, using heuristic search (and a simulator to produce a stream of trajectories). The contribution is novel enough to be interesting for the NeurIPS community. We encourage the authors to clarify the experimental setup in their revision (which generated some confusion in the initial reviews) by explaining how the different algorithms use the simulator, and make sure the comparison is fair in terms of the number of interactions with the environment.