NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4404
Title:Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle

Reviewer 1


		
***Post rebuttal*** I read the other reviews and the author feedback. I appreciate the effort of the authors in addressing my issues. The authors managed to provide significant examples in which the performance gap is finite. Still, I believe that the assumption might be restrictive and that the knowledge of the performance gap (or some lower bound) is not straightforward. The authors clarified my confusion about Section 5. I am raising the score to 6. I think that the paper provides nice contributions, although the resulting algorithm DMQ is quite complex and the authors did not completely succeed in explaining the general idea. Moreover, I have some doubts on the assumptions. I made a high-level check of the math and it seems ok. Overall, the paper is written in good English and reads well, although the organization of the contents can be improved. Here are my detailed comments. ***Major*** - Definition 3.1: I am fine with this definition when considering finite state spaces, but it seems to me that the minimum over the state space cannot be so easily used when passing to infinite state spaces. Indeed, it might be the case that such minimum does not exist. In this case, it would be more correct to use the infimum. However, the infimum could be zero and, consequently, the bound of Theorem 6.1 would become vacuous. Are the authors assuming that the gap exists and it is strictly greater than zero? It seems that this is the case as stated in Section 7 (Regret Bound paragraph). If so, I suggest to clearly state the assumption sooner. How much is this assumption restrictive? Moreover, the $\gamma$ is used to define the hyperparameter $\epsilon_t$, which is used by the algorithm. Since $\gamma$ depends on the optimal value function itself, how can $\gamma$ be computed in practice? Are the authors assuming that $\gamma$ is known in advance? - Section 5: the proposed algorithm turned out to be quite elaborate (it is split into three pseudocodes Algorithm 1-3). While Section 5 explains step-by-step the functioning of the algorithm, I found quite hard to understand why certain choices were made. I think that the main problem with this section is that it lacks an overview of the general idea of the algorithm. For instance, it is not immediately clear the motivation behind the definition of the policy at Equation (1) or the actual role of the oracles in the algorithm. ***Medium*** - The abstract seems a bit too long, and dives into many details, like the explanation of the oracles. - Section 1: the authors report in the Introduction the statement, although informal, of Theorem 1.1, which is the main contribution of the paper. I think that reporting it in the introduction is premature, I suggest to describe the meaning of the theorem without the statement. - Oracle 4.1: the regularizer $\Lambda(f_1,f_2)$ is used in the definition but explained only later. I suggest anticipating the definition of $\Lambda$. ***Minor*** - line 146: "By Q-learning, we mean ..." this sentence is not very clear - line 196-197: "does not rely any label information" this sentence is not very clear ***Typos*** - line 97: algorithm. which -> algorithm which - line 216: choose -> chooses - line 216: use -> uses - line 238: in the definition of $M_1$ and $M_2$, the subscripts of $D$ ($1$ and $2$) should appear inside the absolute value - line 272: Note we -> Note that we

Reviewer 2


		
After rebuttal: I read the author response and other reviews. I'll keep my assessment. --------------------------- This paper studies the Q-learning problem with linear function approximation and proposes the first provably efficient algorithm DMQ for general stochastic setting. Melo and Ribeiro [2007] and Zou et al. [2019] are most related work but assumes the exploration policy is fixed. This paper also introduces a DESC oracle, which examines whether all functions perform well on both distribution D1 and D2. In the DMQ algorithm, the learned predictor will be near optimal when the DESC oracle returns False. In addition, DESC oracle will only return true at most polynomial number of times thus avoiding exponential complexity. The DESC oracle is novel and has a nice property. Overall, this paper is clearly written and theoretically sound. Some comments: The proposed DESC oracle has an interesting property to me. By definition, it tests whether all functions in the function class work well under both distribution D1 and D2. In the DMQ algorithm, it will guarantee the learned \hat{\theta} is close to the true \theta. One limitation of this work is the assumption that the true Q-function is exactly linear, which will not hold in general. Is it possible to extend to the approximate case? Will the algorithm also return near optimal policy when the best linear approximation is close to the true Q-function? The sample complexity is polynomial in terms the 1/\gamma, where \gamma is the suboptimality gap. If few state-action pairs have 0 or quite small gap, but most state-action pairs have large gaps, can the sample complexity be improved? In Algorithm 2, the agent needs to collect on-the-go reward y, which has higher computation complexity than collecting reward r since the agent need to roll out the timestep H. The proof of Theorem 6.1 is a little unclear to me. In line 401 and 402, why label ys in D_h^a have bias bounded by \epsilon? Is it because the result hold for h'=h+1,...,H so that the induction can be applied? The \gamma notation denotes the suboptimality gap in this paper. Although it will not lead to confusion, I think it would be better to change to another notation since \gamma is usually reserved for the discount factor.

Reviewer 3


		
Overall the paper is clear and well presented. The intuition of the algorithm is also explained. I only have minor comments on the results. 1. Can we make Theorem 6.1 in a more general setting, maybe under some assumption on the error checking oracle or has the number of this oracle been called in the bound? 2. As pointed out by the authors, the proposed algorithm is not optimal. This can be seen from a particular setting: stochastic bandits. It may also help the readers to better understand the algorithm by discussing its behaviour on a stochastic bandits setting. In particular, in order to identify the best arm, Algorithm 2 line 6-8 is not optimal. Have the authors tried to improve this part? 3. Can the results on Lemma A.4 be generalized to other setting rather than linear approximation? ============== After rebuttal: I have read the rebuttal and other reviews. My score remains the same.