NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 9083 A Family of Robust Stochastic Operators for Reinforcement Learning

### Reviewer 1

SUMMARY: The paper considers the problem of designing a Bellman-like operator with certain properties: 1) Optimality preserving property: The greedy policy of the converged action-value function be the optimal policy. 2) Action-gap increasing property: The action-gap (the difference between the action-value function of the optimal action and suboptimal ones) increases. The motivation for the action-gap increasing property comes from the result of Farahmand  that shows that the distribution of the action-gap is a factor in the convergence to the optimal policy. Roughly speaking, when the action-gap is large, errors in estimating the action-value function Q becomes less important. The result is that we might converge to the optimal policy even though the estimated action-value function is far from the optimal one. Bellemare et al.  propose some operators that have these properties. The premise is that if the action-gap is increased by the use of an alternative operator that still has the optimality preserving property, the errors in the approximation of the action-value function becomes less important. These errors happens whenever we use a function approximator and finite data. One of the operators of Bellemare et al. , which is called the Advantage Learning operator, is defined as (T_AL Q)(x,a) = (T_B Q)(x,a) - beta [ V(x) - Q(x,a) ], where T_B is the conventional Bellman optimality operator and V(x) = max_a Q(x,a). In the work of , this beta is a constant. A main contribution of paper is the extension of T_AL by allowing beta to be a random variable that can change at every iteration: (T_{beta_k} Q_k)(x,a) = (T_B Q)(x,a) - beta_k [ V_k(x) - Q_k(x,a) ]. This generalizes the previous operator, and potentially adds extra flexibility in designing operators with better action-gap increasing property. Theorem 3.1 of this paper shows that this operator indeed has the desired properties (1) and (2) almost surely. The paper also provides a few other results. For example, Theorem 3.4 shows that if we have two sequences of beta_k and beta’_k with the same expectation (for a fixed k), but different variance (say, Var[beta_k] < Var[beta’_k]), the variance of the estimated action-value function would have the same ordering. As it can be shown that the action-value function of their greedy action is the same, this implies that their action-gaps are different and satisfying the same order relation. The paper empirically compare the conventional Bellman operator, consistent Bellman operator, and this new Robust Stochastic operator on a series of testbeds. EVALUATION: Originality: It is relatively original: The stochastic aspect of the new operator is novel; the general form of the operator is similar to the one introduced by Bellemare et al. . Quality: The paper comes with theoretical guarantees. I have a few questions about some steps of the proof. The empirical results are supportive, but I am not completely convinced about their statistical significance. Clarity: The paper is relatively well-written. Significance: Studying new action-gap increasing operators might have important consequences for how design RL algorithms. Detailed Comments and Questions: Understanding the action-gap and designing operators that possibly make the learning faster and more efficiently by increasing their action-gap is an important topic. The paper makes a contribution in that direction, and I think it is valuable. I have some issues with the proof of a theoretical result. I am also not completely convinced that the added stochasticity helps much. I have some question, which I hope the authors can clarify. I use [*], [**], etc. to show the importance of each comment, so that the authors can prioritize their answers. + [**] It is clear that the stochastic beta_k is more general than a constant one, but as far as I can see its benefit is not obvious. - [**] First of all, the empirical comparisons are done with the consistent Bellman operator, defined as (T_C Q)(x,a) = r(x,a) + gamma E[ Id{x \neq X’} max_{a’} Q(X’,a’) + Id{ x = X’} Q(X’,a) ], as opposed to perhaps more obvious case of Advantage Learning operator, which has a constant beta. I encourage the authors to compare with(T_AL Q), perhaps with a beta that is close to 1. - [**] Second, the as the authors mentioned the empirical results do not show that any distribution of beta performs better than simply choosing it from a uniform distribution with a support of [0,2). If no benefit has been observed in changing how beta_k is selected, how can we hope to find a maximally efficient operator by optimizing over the distribution of beta_k? Of course, it is possible that finding the right beta_k is a difficult problem, but a better ones exist nonetheless. + [***] The proof of Theorem 3.1 needs some clarifications. I am not sure about a few of its steps. - In Eq. (9), shouldn’t we have P[ \hat{Q}(x,a) - T_B \hat{Q}(x,a) > eps] = 0 instead of P[ \hat{Q}(x,a) - T_B \hat{Q}(x,pi^{b_k}) > eps] = 0? I believe we want it to hold for the same action a, and not comparing action a with pi^{b_k}? Compare this line of argument with the proof of Theorem 2, and especially Eq. (15), of Bellemare et al. (I am referring to this version: https://arxiv.org/abs/1512.04860 ). Or for the right equation of (9), shouldn’t we have P{\hat{V}(x) - T_B \hat{Q}(x,pi^{b_k}) < -eps} = 0? Also I think it helps a lot if you expand the arguments a bit in order to make them more precise and easy to understand. I believe the proofs of Bellemare et al.  have the right level of detail; the proofs here are more condensed. - [*] In the proof of Theorem 3.1, right after (8), it is argued that V_k(x) + f_k is monotone. I believe the authors meant that it is monotonically increasing. In that case, don’t we need to show the existence of an upper bound before claiming its convergence? - [**] Please provide more discussions regarding the proof of statement on LL421-424. Currently it is a bit unclear. + [*] In L150, it is mentioned that the new results shows that the optimization problem for finding a maximally efficient operators can be done in an infinite dimensional space of sequence of r.v. beta_k. This is not an accurate statement. If we optimize (beta_k) directly, they won’t be independent random variables, hence, the probabilistic arguments of this work would not hold anymore. It is, however, possible that the optimization is performed on the parameters of some probability distributions from which beta_k are independently sampled. + [*] What are the definitions of stochastic ordering and convex ordering symbols used in Theorems 3.2 and 3.3? I think these should be clarified in order to be useful for the audience of this conference. + [**] Are the empirical results statistically significant? For example, for the Cart Pole, the average scores are between 183-190, and their standard deviations are between 20-29. There appear to be a lot of overall between the scores. + [*] Using an alternative Bellman operator, such as the introduced one, changes the definition and semantics of the action-value function. It is not the value function of the original problem, but a quantity whose maximum over actions is still the same as the optimal value function (but not for other actions). As a result, an increased action-gap does not directly say anything about the action-gap of the original problem (which a property of the MDP and not the estimate). Therefore, the result of Farahmand  that shows the performance loss depends on the error in estimating the action-value function and the action-gap property of the MDP does not directly transfer to this modified action-value function and its corresponding action-gap. + [*] Typos and Other Minor Issues: - L82: This sentence is a bit awkward. - L109: “Let x’ always denotes” —> “Let x’ always denote”.

### Reviewer 2

Paper summary: The paper suggests replacing the Bellman operator in Q-learning with a sequence of stochastic operators, which in the limit, can preserve optimal actions while increasing the Q-value difference between optimal and suboptimal actions (thus reducing the impact of inaccurate value estimate). This sequence of operators add an advantage term to the Bellman operator, with random coefficients whose means are between 0 and 1. The proposed method, named RSO, is tested on 4 simple control tasks on OpenAI Gym. Using tabular Q-learning, RSO outperforms normal Q-learning by a significant margin. Strengths: 1. Novel design of new Bellman-like operators with better properties. 2. Detailed explanation of the experimental design. 3. Easy to read and to understand. Weaknesses: 1. Some experimental results are not convincing enough. For example, Bellman and consistent Bellman completely fail in Fig 1. That usually is due to lack of exploration. It is possible that RSO works better simply because it has more randomness. In addition, section 4.5 claims that the uniform \beta distribution is better than constant, but no numerical results are provided. Also in Fig 2 (b), the difference between RSO and consistent Bellman seems very small compared to the scale of rewards. 2. It is perhaps better to explain intuitively why RSO is optimality-preserving and action-gap-increasing based on equation (4). Then the reader wouldn't have to read through the proofs in the appendix before getting a rough understanding.

### Reviewer 3

I believe that the idea of stochastic Bellman operators is interesting (though a bit incremental, due to the presence of [5,12]) in the context of gap-action increasing Bellman operator. Though it is not super related, https://arxiv.org/pdf/1311.5918.pdf also studied random Bellman operators, and it might be proper to reference this work and discuss on relations. Although the idea is interesting I find this paper problematic in terms of its theoretical work. On the other hand, its empirical contribution is not important enough (in my view) to cover for the problematic theoretical work. Here are my remarks. - A suggestion : Think it is not proper to refer to the Bellman operators as 'Robust'. Although I understand the reason for author's choice 'robust' in the field of decision making usually refers to the case adversarial perturbations exists. - The authors use the terms stochastic ordering and convex order. It must be defined in the preliminaries, that's why preliminaries are for. -The proofs in the appendix are not written good enough in my opinion, they are quite sloppy in my opinion. Furthermore, I believe the authors require pessimistic initialization of the q function to make the proofs work. Here are my remarks on the proofs. (**) Remarks on the proof of Theorem 1. - In the proof, line 408, why T_{\beta_k} Q_k(x,a) \leq T_B Q_k(x,a)? shouldn't you assume that V_k(x)-Q_k(x,a)\geq 0 for that? if so, where that you prove it? - The third and forth relations of the passages below line 411 are not clear. - In the paragraph below line 417, why did you analyzed convergence in probability all of a sudden? and how, based on that, you returned and gave an a.s. convergence result? (**) Remarks on the proof of Theorem 3.2 -) Why do you need to consider any increasing f(\cdot)? why not explicitly show by induction what you wish to prove? Furthermore, I didn't understand why E[f(\hat{Q}_{k+1}] \geq E[f(\bar{Q}_{k+1}] results in \hat{Q}_{k} \geq \bar{Q}_{k}. -) In the last part of the proof the authors again use f(\cdot). What's the reason for that? in the Definition 3.2 you solely use the increasing advantage. (**) Theorem 3.3 and Theorem 3.4 written in a really sloppy manner in my opinion. It is not clear what the authors mean by saying `convex-order'. The proof of Theorem 3.4 is hardly stated- the induction is not specified (base case, induction hypothesis and induction step). Moreover, the authors does not explain why their claim holds. -) The experimental results are nice, but can hardly be seen as an important contribution in my opinion as they are quite basic and test a straightforward implication of the suggested approach. Furthermore, the authors did not prove that the q-learning version which they empirically test indeed converges. In the theoretical part, the authors analyzed a value-iteration like algorithm in which the *entire state space* is updated per iteration.