In this submission, a new bootstrapping optimization technique is proposed, based on the idea of adding the log-policy to the immediate reward. This is shown to bring strong empirical gains, and the theoretical analysis helps understand why. Although reviewers remained divided even after an active discussion period (7, 7, 5, 5), I believe this is a paper worth publishing at NeurIPS. Simple ideas bringing significant improvements, like this one, are typically those most impactful. I also appreciate the efforts made to better understand the theoretical properties of the proposed algorithm, beyond the basic intuition. The main remaining concerns of R4 and R5 were related to the significance of the theoretical results. However, in my opinion they did not provide specific enough criticism that could confidently invalidate these results, nor disprove their novelty. Consequently, I am recommending acceptance, aligning myself with R1 and R2.