Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
We consider the problem of offline policy evaluation~(OPE) with Markov decision processes~(MDPs), where the goal is to estimate the utility of given decision-making policies based on static datasets. Recently, theoretical understanding of OPE has been rapidly advanced under (approximate) realizability assumptions, i.e., where the environments of interest are well approximated with the given hypothetical models. On the other hand, the OPE under unrealizability has not been well understood as much as in the realizable setting despite its importance in real-world applications.To address this issue, we study the behavior of a simple existing OPE method called the linear direct method~(DM) under the unrealizability. Consequently, we obtain an asymptotically exact characterization of the OPE error in a doubly robust form. Leveraging this result, we also establish the nonparametric consistency of the tile-coding estimators under quite mild assumptions.