The authors propose an RL-inspired way of fitting a conditional generative model to the training data with the aim of generating discrete structures, such as molecules, satisfying some desired properties. Unlike policy gradients in RL, the proposed algorithm does not require sampling from the model/policy, instead approximating the expectation of interest using the training data reweighted with the normalized rewards. This is done to avoid high gradient variance of policy gradient algorithms. The reviewers liked the novelty of the approach to this important problem. While the experimental results are not spectacular and there were some concerns about missing RL baselines and connections to reward-augmented ML, the author response addressed them in large part. One high level question that needs to be discussed in the paper is whether the proposed method really is a way of optimizing the RL objective as is claimed, or whether it is essentially a weighted maximum likelihood method. For example, unlike a true RL method, the proposed approach seems incapable of discovering truly new structures as it cannot stumble upon them during training because of not sampling from the model. The proposed entropy penalty might just smooth the predictions around the training data rather than lead to discovery of novel structures.