NeurIPS 2020

Off-Policy Interval Estimation with Lipschitz Value Iteration

Meta Review

This paper provides a novel strategy to get upper and lower bounds on action-values, under assumptions that the action-values are Lipschitz and that the environment is deterministic. The paper has a novel idea, but could benefit from a few key changes to significantly improve the paper. 1. Reviewers pointed out making less strong claims about issues with previous approaches, since determinism can be exploited here. In the response, you said you would contrast this more clearly, and I highly recommend spending a reasonable amount of time explaining how this new approach you are taking differs from the more common (and well-understood) statistical approaches like bootstrapping. 2. The restriction to deterministic problems is acceptable. But, a discussion about extensions to the stochastic setting would make the contribution stronger. 3. Some of the notation needs to be improved. As mentioned by a reviewer, R[Q] is odd. We also got confused in some places by interchanging Bpi as meaning the true Bellman operator and the one based only on observed data. This operator should not be overloaded, and it might be better to call the one based on data Bhatpi, or even Bpi_n where n is the number of samples.