Reviews: Off-Policy Evaluation via Off-Policy Classification

Originality: As stated in the contributions, the paper presents an approach to OPE that is a departure from classic approaches. As far as I am aware this is a very novel way to address this issue Clarity: I found the paper to somewhat lack structure, which hurts readability and clarity: - The description of the positive unlabeled learning and its application to OPE (section 3) is rather convoluted and not always easy to follow. This section could use some intuitive examples and a more detailed step-by-step development. - Overall structure: Section 3 ends with short disconnected section 3.3 on metrics, then moves on to a largely unrelated section 4 motivating applicability, before section 5 discusses related work and more metrics. Significance & Quality: The paper addresses an important problem using a novel approach. It will hopefully inspire new approaches to the problem. The method as presented here do have some important limitations: - The method is restricted to deterministic systems with a single binary success reward per episode. As noted by the authors this setting is quite restrictive. Even though many problems may be interpreted as binary success problems, the inability to deal with stochasticity in the problem may have a large impact on applicability. It would also be good to test how important deterministic dynamics are to the performance of the method in practice. - The approach seems to simply ignore one of the key issues in OPE, i.e. the distribution mismatch in the data. In section 3.1, the authors note that they simply assume that the classification error is a good enough proxy and that it seems to be robust in their evaluations. This assumption is not sufficiently tested. It would be good to explicitly evaluate robustness in the Tree example and explicitly evaluate how / if the method degrades when the data is generated by policies that are not very broad/random or far away from the target policy. - Unlike existing OPE approaches, the method does not seem to offer strong theoretical guarantees / bounds on performance. - It was not clear to me how data efficient the method is. Lack of data is a key issue in the settings considered in the paper (i.e. where no simulation for the real test problem is possible), so it would be good to see how the performance depends on the amount of data.

Originality: The work seems novel, introducing both a new metric for OPPE, and a new way to benchmark deep RL algorithms Quality: Given the assumptions provided, the proofs all seem correct, and the method seems to work in the settings advertised. Clarity: The proofs and arguments are all easy to follow. It was hard for me to understand how the correlation coefficient, and I hope I reached the right understanding. Perhaps a more detailed series of steps as I outline below would help clarify the steps needed. Significance: If this is more generalizable to less restrictive assumptions, it could be a very novel way to do OPPE, and might advance the whole domain. L72: Why does the \gamma = 1 restriction exist? It doesn't seem to be used in the proofs, and the experiments seem to have bene done with \gamma != 1. The assumption that the transition matrix P being deterministic does not seem like a good assumption for many real world problems, although it's likely not a bad one for the specific robot grasping task. For example: due to imprecision in hardware, the same motor command in the same state twice will probably lead to a set of states. Specifically in the tasks chosen, there is very little stochasticity - the robotic hand would probably end up in a close enough position to disregard the stochasticity. On complex real world tasks, you might see a significant amount of stochasticity, which hurts the argument of this paper. The restriction is therefore quite strong and seems somewhat core to the proofs. I do not see a simple way to extend the work to include stochastic transition matrices. It seems that the metric changes according to how the Q functions were trained. If I understand correctly, the measurement is this: - A few Q functions are trained according to different hyperparameter sets - They are all evaluated on policy to obtain the "true average return" - They are also evaluated on some off-policy data to obtain the metric values - The correlation between the true average return and metric values are obtained. This implies the evaluation of novel techniques in this framework depend heavily on the set of Q functions trained. Thus, it's hard for new works to reproduce the results here, since if the choice of algorithm changes, then so might the ranking among algorithms. I'm confused to why some of the baselines have a strong negative correlation with the true average return. I wonder if the authors can analyze this.

Paper ID:	2909
Title:	Off-Policy Evaluation via Off-Policy Classification

Reviewer 1

Reviewer 2

Reviewer 3