{"title": "A Family of Robust Stochastic Operators for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 15652, "page_last": 15662, "abstract": "We consider a new family of stochastic operators for reinforcement learning with the goal of alleviating negative effects and becoming more robust to approximation or estimation errors. Various theoretical results are established, which include showing that our family of operators preserve optimality and increase the action gap in a stochastic sense. Our empirical results illustrate the strong benefits of our robust stochastic operators, significantly outperforming the classical Bellman operator and recently proposed operators.", "full_text": "A Family of Robust Stochastic Operators for\n\nReinforcement Learning\n\nYingdong Lu, Mark S. Squillante, Chai Wah Wu\n\nYorktown Heights, NY 10598, USA\n\n{yingdong, mss, cwwu}@us.ibm.com\n\nMathematical Sciences\n\nIBM Research\n\nAbstract\n\nWe consider a new family of stochastic operators for reinforcement learning that\nseeks to alleviate negative effects and become more robust to approximation or\nestimation errors. Theoretical results are established, showing that our family of\noperators preserve optimality and increase the action gap in a stochastic sense.\nEmpirical results illustrate the strong bene\ufb01ts of our robust stochastic operators,\nsigni\ufb01cantly outperforming the classical Bellman and recently proposed operators.\n\n1\n\nIntroduction\n\nReinforcement learning has a rich history within the machine learning community to solve a wide\nvariety of decision making problems in environments with unknown and possibly unstructured\ndynamics. Through iterative application of a convergent operator, value-based reinforcement learning\n(RL) generates successive re\ufb01nements of an initial value function [14, 22, 21]. Q-learning [24] is\na particular RL technique in which the computations of value iteration consist of evaluating the\ncorresponding Bellman equation without a model of the environment.\nWhile Q-learning continues to be broadly and successfully used in RL to determine the optimal actions\nof an agent, the development of new Q-learning approaches that improve convergence speed, accuracy\nand robustness remains of great interest. One area of particular interest concerns environments in\nwhich there exist approximation or estimation errors. Of course, when no approximation/estimation\nerrors are present, then the corresponding Markov decision process (MDP) can be solved exactly\nwith the Bellman operator. However, in the presence of nonstationary errors \u2013 a typically encountered\nexample being when a discrete-state, discrete-time MDP is used to approximate a continuous-state,\ncontinuous-time system \u2013 then the optimal state-action value function obtained through the Bellman\noperator does not always describe the value of stationary policies. Hence, when the optimal state-\naction value function and the suboptimal state-action value functions are reasonably close to each\nother, approximation/estimation errors can cause suboptimal actions to be chosen instead of an\noptimal action and thus in turn potentially causing errors in identifying truly optimal actions.\nTo help explain and formalize this phenomena, Farahmand [13] introduced the notion of action-gap\nregularity and showed that a larger action-gap regularity implies a smaller loss in performance.\nBuilding on action-gap regularity and its bene\ufb01ts with respect to (w.r.t.) performance loss, Bellemare\net al. [6] considered a particular approach to having the value iteration converge to an alternative\naction-value function Q associated with the same optimal action policy \u2013 i.e., maintain properties of\noptimality-preserving \u2013 while at the same time achieving a larger separation between the Q-values of\noptimal actions and those of suboptimal actions \u2013 i.e., maintain properties of action-gap increasing.\nThe former properties ensure optimality whereas the latter properties may assist the value-iteration\nalgorithm to determine the optimal actions of an agent faster, more easily, and with less errors of\nmislabeling suboptimal actions. Therefore, by exploiting weaker optimality conditions than the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBellman equation and due to the known bene\ufb01ts of larger action-gap regularity, this approach can\npotentially lead to alternatives to the classical Bellman operator that improve the convergence speed,\naccuracy and robustness of RL in environments with approximation/estimation errors.\nFollowing this approach, Bellemare et al. [6] propose purely deterministic operator alternatives\nto the classical Bellman operator and show that the proposed operators satisfy the properties of\noptimality-preserving and gap-increasing. Then, after empirically demonstrating the bene\ufb01ts of\ntheir proposed deterministic operator alternatives, the authors raise a number of open fundamental\nquestions w.r.t. the possibility of weaker conditions for optimality, the statistical ef\ufb01ciency of their\nproposed operators, and the possibility of \ufb01nding a maximally ef\ufb01cient operator.\nAt the heart of the problem is a fundamental tradeoff between the degree to which the preservation of\noptimality is violated and the degree to which the action gap is increased. Although the bene\ufb01ts of\nincreasing action-gap regularity are known [13], increasing the action gap beyond a certain region\nin a deterministic sense can lead to violations of optimality preservation (due to deviating too far\nfrom Bellman optimality), thus rendering value iterations that may not ensure convergence to optimal\nsolutions. Hence, any purely deterministic operator alternative is unfortunately limited in the degree\nto which it can be both gap-increasing and optimality-preserving, and thus in turn limited in the\ndegree to which it can address the above problems of approximation/estimation errors in RL.\nWe therefore consider an approach based on a novel stochastic framework that can increase the action\ngap well beyond such a deterministic region for individual value iterations \u2013 via a random variable\n(r.v.) \u2013 while controlling in a probabilistic manner the overall value iterations \u2013 via a sequence of\nr.v.s \u2013 to ensure optimality preservation in a stochastic sense. Our general approach is applicable to\narbitrary Q-value approximation schemes in which the sequence of r.v.s provides support to devalue\nsuboptimal actions while preserving the set of optimal policies almost surely (a.s.), thus making it\npossible to increase the action gap between the Q-values of optimal and suboptimal actions beyond\nthe deterministic region; this can be important in practice because of the potential advantages of\nincreasing the action gap when there are approximation/estimation errors. In devising a family of\noperators within our framework endowed with these properties, we provide a general stochastic\napproach that can address the inherent de\ufb01ciencies of purely deterministic operator alternatives to\nthe classical Bellman operator and that can potentially yield greater robustness w.r.t. mislabeling\nsuboptimal actions in the presence of approximation/estimation errors. To the best of our knowledge,\nthis paper presents the \ufb01rst proposal and theoretical analysis of such types of robust stochastic\noperators (RSOs), which is an approach not often seen in the RL literature and should be exploited to\na much greater extent.\nThe research literature contains a wide variety of studies of operator alternatives to the Bellman oper-\nator, including the \u0001-greedy method [24], speedy Q-learning [3], policy iteration-like Q-learning [8],\nand the Boltzmann softmax operator and its variants [2]. Each of these operator alternatives seeks to\naddress certain issues in RL. In this paper we complement these previous studies of operator alterna-\ntives and focus on operators that seek to achieve greater robustness w.r.t. approximation/estimation\nerrors; in fact, our empirical studies are based on Q-learning with the \u0001-greedy method.\nOur theoretical results include proving that our stochastic operators are optimality-preserving and\ngap-increasing in a stochastic sense. Since the value-iteration sequence generated under our stochastic\noperators is based on realizations of independent nonnegative r.v.s, our family of RSOs subsumes the\nfamily of purely deterministic operators in [6] as a strict subset (because the realizations of r.v.s can be\n\ufb01xed to match that of any deterministic operators as a special case). We further prove that stochastic\nand variability orderings among the sequence of random operators lead to corresponding orderings\namong the action gaps. Our stochastic framework and theoretical results shed new light on the\nopen fundamental questions raised in [6], which includes our family of RSOs rendering signi\ufb01cantly\nweaker conditions for optimality and signi\ufb01cantly stronger statistical ef\ufb01ciency. Another important\nimplication of our results is that the search space for the maximally ef\ufb01cient operator should be\nan in\ufb01nite dimensional space of sequences of r.v.s, instead of the \ufb01nite dimensional space alluded\nto in [6]. Yet another important implication is that the order relationships among the sequences\nof random operators w.r.t. action gaps, corresponding to our stochastic and variability ordering\nresults, may potentially lead to determining the best sequence of r.v.s and possibly even lead to\nmaximally ef\ufb01cient operators. These theoretical results further extend our understanding of the\nrelationship between action-gap regularity and the effectiveness of Q-learning in environments with\napproximation/estimation errors beyond the initial studies in [13, 6].\n\n2\n\n\fWe subsequently apply our RSOs to obtain empirical results for various problems in the OpenAI Gym\nframework [10]. Using these existing codes with minor modi\ufb01cations, we compare the empirical\nresults under our family of stochastic operators against those under both the classical Bellman\noperator and the consistent Bellman operator [6]. These experiments consistently show that our\nRSOs outperform both of these deterministic operators. Appendix C of the supplement provides the\ncorresponding python code modi\ufb01cations used in our experiments.\n\n2 Preliminaries\n\nWe consider a standard RL framework (see, e.g., [7]) in which a learning agent interacts with a\nstochastic environment. This interaction is modeled as a discrete-space, discrete-time discounted\nMDP denoted by (X, A, P, R, \u03b3), where X represents the set of states, A the set of actions, P the\ntransition probability kernel, R the reward function mapping state-action pairs into a bounded subset\nof R, and \u03b3 \u2208 [0, 1) the discount factor. Let Q denote the set of bounded real-valued functions over\nX \u00d7 A. For Q \u2208 Q, de\ufb01ne V (x) := maxa Q(x, a) and use the same de\ufb01nition for variants such as\n\u02c6Q \u2208 Q and \u02c6V (x). Let x(cid:48) always denote the next state r.v. For the current state x in which action a\nis taken, i.e., (x, a) \u2208 X \u00d7 A, denote by P(\u00b7|x, a) the conditional transition probability for the next\nstate x(cid:48) and de\ufb01ne EP := Ex(cid:48)\u223cP(\u00b7|x,a) to be the expectation w.r.t. P(\u00b7|x, a).\nA stationary policy \u03c0(\u00b7|x) : X \u2192 A de\ufb01nes the distribution of control actions given the current state\nx, which reduces to a deterministic policy when the conditional distribution renders a constant action\nfor each state x; with slight abuse of notation, we always write the policy \u03c0(x). The stationary policy\n\u03c0 induces a value function V \u03c0 : X \u2192 R and an action-value function Q\u03c0 : X \u00d7 A \u2192 R where\nV \u03c0(x) := Q\u03c0(x, \u03c0(x)) de\ufb01nes the expected discounted cumulative reward under policy \u03c0 starting in\nstate x and where Q\u03c0 satis\ufb01es the Bellman equation\n\nQ\u03c0(x, a) = R(x, a) + \u03b3EPQ\u03c0(x(cid:48), \u03c0(x(cid:48))).\n\n(1)\n\nTBQ(x, a) := R(x, a) + \u03b3EP max\n\nOur goal is to determine a policy \u03c0\u2217 that achieves the optimal value function V \u2217(x)\n:=\nsup\u03c0 V \u03c0(x),\u2200x \u2208 X, which also produces the optimal action-value function Q\u2217(x, a) :=\nsup\u03c0 Q\u03c0(x, a),\u2200(x, a) \u2208 X \u00d7 A. The Bellman operator TB : Q \u2192 Q is de\ufb01ned pointwise as\n(2)\nor equivalently TBQ(x, a) = R(x, a) + \u03b3EPV (x(cid:48)). The Bellman operator TB is known (see, e.g.,\n[7]) to be a contraction mapping in supremum norm, and its unique \ufb01xed point coincides with the\noptimal action-value function, namely Q\u2217(x, a) = R(x, a) + \u03b3EP maxb\u2208A Q\u2217(x(cid:48), b), or equivalently\nQ\u2217(x, a) = R(x, a) + \u03b3EPV \u2217(x(cid:48)). This in turn indicates that the optimal policy \u03c0\u2217 can be obtained\nby \u03c0\u2217(x) = arg maxa\u2208A Q\u2217(x, a), \u2200x \u2208 X.\nWhile the Bellman operator can exactly solve the MDP when there are no approximation/estimation\nerrors, the previously noted differences between optimal and suboptimal state-action value functions\nin the presence of such errors can result in incorrectly identifying the optimal actions. To address these\nand related nonstationary effects of approximation/estimation errors arising in practice, Bellemare et\nal. [6] propose the so-called consistent Bellman operator de\ufb01ned as\n\nb\u2208A Q(x(cid:48), b),\n\nb\u2208A Q(x(cid:48), b) + 1{x=x(cid:48)}Q(x, a)],\n\nTCQ(x, a) := R(x, a) + \u03b3EP[1{x(cid:54)=x(cid:48)} max\n\n(3)\nwhere 1{\u00b7} denotes the indicator function. The consistent Bellman operator TC preserves a local\nform of stationarity by rede\ufb01ning the action-value function Q such that, if an action a \u2208 A is taken\nfrom the state x \u2208 X and the next state x(cid:48) = x, then action a is taken again. Bellemare et al. [6]\nproceed to show that the consistent Bellman operator yields the optimal policy \u03c0\u2217, and in particular\nthat TC is both optimality-preserving and gap-increasing, according to (deterministic) de\ufb01nitions that\nthey provide which are compatible with those from Farahmand [13].\nThe proofs of our theoretical results involve mathematical arguments and technical details that are\nunique to stochastic operators and stochastic orderings, and distinct from any previous deterministic\noperators. In particular, a r.v. X is stochastically greater than or equal to (\u2265st) a r.v. Y if P[X >\nz] \u2265 P[Y > z],\u2200z, and a r.v. X is greater than or equal to (\u2265cx) a r.v. Y under a convex ordering\nif and only if E[f (X)] \u2265 E[f (Y )], \u2200 convex functions f. Additional technical details on these and\nother probabilistic terms and results underlying our theoretical results can be found in [9, 11, 18].\n\n3\n\n\f3 Robust Stochastic Operators\n\nIn this section we present our stochastic framework which includes proposing a general family of\nRSOs, providing precise de\ufb01nitions of the concepts of optimality-preserving and gap-increasing in a\nstochastic sense for a sequence of random operators, and establishing that any sequence of this general\nfamily of operators is optimality-preserving and gap-increasing. Our introduction of a new family\nof operators and our shifting the focus from one deterministic operator to a sequence of stochastic\noperators has signi\ufb01cant implications w.r.t. the open questions raised in [6]. Speci\ufb01cally, our results\nshow that the conditions for optimality are much weaker and the statistical ef\ufb01ciency of our operators\ncan be made much stronger, both allowing signi\ufb01cant degrees of freedom in \ufb01nding alternatives to the\nBellman operator for different purposes and applications. Meanwhile, these important improvements\ncompletely alter and clarify the question of \ufb01nding the maximally ef\ufb01cient operators from a \ufb01nite\ndimensional parameter optimization problem suggested in [6] to an optimization problem in an\nin\ufb01nite dimensional space (of the in\ufb01nite sequences of r.v.s), for which we establish that stochastic\nand variability orderings among the sequence of random operators lead to corresponding orderings\namong the action gaps. It is important to note that our approach can be extended to variants of the\nBellman operator such as SARSA [17], policy evaluation [19] and \ufb01tted Q-iteration [12].\nFor all Q0 \u2208 Q, x \u2208 X, a \u2208 A and sequences {\u03b2k : k \u2208 Z+} of independent nonnegative r.v.s with\nexpectation \u03b2k := E\u03b2[\u03b2k] between 0 and 1 inclusively for each k \u2208 Z+, we de\ufb01ne\n\nT\u03b2k Qk(x, a) := R(x, a) + \u03b3EP max\n\nb\u2208A Qk(x(cid:48), b) \u2212 \u03b2k(Vk(x) \u2212 Qk(x, a)),\n\n(4)\nor equivalently T\u03b2k Qk(x, a) := R(x, a) + \u03b3EPV (x(cid:48)) \u2212 \u03b2k(Vk(x) \u2212 Qk(x, a)). (Note that the\noperator in (4) is equivalent to the Bellman operator whenever the action a is optimal or \u03b2k = 0, thus\nmaking the difference term zero in these cases.) Then members of the general family of RSOs include\nthe sequence {T\u03b2k} de\ufb01ned over all probability distributions for each r.v. in the sequence {\u03b2k} with\n\u03b2k \u2208 [0, 1]. (Note, in particular, that the r.v.s \u03b2k can follow a different probability distribution for\neach k.) We further de\ufb01ne T F\n\u03b2 to be the general family of RSOs comprising all sequences of operators\n{T }, each as de\ufb01ned in (4), such that there exists a sequence of {\u03b2k} and, for all x \u2208 X and a \u2208 A,\nthe following inequalities hold\n\nTBQ(x, a) \u2212 \u03b2k(Vk(x) \u2212 Qk(x, a)) \u2264 T Q(x, a) \u2264 TBQ(x, a).\n\nObserve that, for any (x, a) in (4) where a is not the optimal action, we have Vk(x) > Qk(x, a)\noccurring very often (i.e., for many k), causing Q(x, a) to (eventually) deviate more from V (x);\notherwise, for a such that Q(x, a) = V (x), then Vk(x) > Qk(x, a) will only happen relatively rarely,\nthus not affecting the end value of V (x). Since the value function V (x) does not change but the action-\nvalue function Q(x, a) does indeed change, this can lead to a larger action gap and can potentially\nrender more ef\ufb01cient ways of ultimately \ufb01nding V (x) through the iterative updating of Q(x, a), as\nindicated in [13, 6]. Moreover, we observe that the multiplier \u03b2k in front of Vk(x) \u2212 Qk(x, a) is\ndesired to be relatively large individually, but its overall efforts should not be so large as to affect\nthe end value of V (x). We therefore introduce a family of RSOs where \u03b2k is allowed to take on any\nvalue as long as its average \u03b2k remains less than or equal to 1. Obviously, these conditions are strictly\nweaker than those identi\ufb01ed in [6] \u2013 theirs being purely deterministic and constrained to [0, 1), and\nours based on r.v.s \u03b2k that can take on values well outside of [0, 1). Since the r.v.s \u03b2k need not be\nidentically distributed (with the sole requirement that \u03b2k is between 0 and 1 inclusively) and since\nrealizations of \u03b2k can take on values far beyond or equal to 1, the family of operators T F\n\u03b2 clearly\nsubsumes the family of previously identi\ufb01ed deterministic operators as a special case.\nFor the analysis of our family of stochastic operators, we consider the following key de\ufb01nitions.\nDe\ufb01nition 3.1. A sequence of random operators {Tk} for M = (X, A, P, R, \u03b3) is optimality-\npreserving in a stochastic sense if for any Q0 \u2208 Q and x \u2208 X, and for the sequence of r.v.s\nQk+1 := TkQk, the following properties hold: Vk(x) := maxa\u2208A Qk(x, a) converges a.s. to a\nconstant \u02c6V (x) as k \u2192 \u221e, and for all a \u2208 A, we have a.s.\n\n(5)\nDe\ufb01nition 3.2. A sequence of random operators {Tk} for M = (X, A, P, R, \u03b3) is gap-increasing in\na stochastic sense if for all Q0 \u2208 Q, x \u2208 X and a \u2208 A, the following inequality holds a.s.:\n\nQ\u2217(x, a) < V \u2217(x) \u21d2 lim sup\nk\u2192\u221e\n\nQk(x, a) < \u02c6V (x).\n\nA(x, a) := lim inf\n\nk\u2192\u221e [Vk(x) \u2212 Qk(x, a)] \u2265 V \u2217(x) \u2212 Q\u2217(x, a).\n\n(6)\n\n4\n\n\f\u03b2 is still optimality-preserving. Moreover, the de\ufb01nition of T F\n\nThe property of the optimality-preserving de\ufb01nition essentially ensures a.s. that at least one optimal\naction remains optimal and all suboptimal actions remain suboptimal, while the property of the\ngap-increasing de\ufb01nition implies robustness when the inequality (6) is strict a.s. for at least one\n(x, a) \u2208 X \u00d7 A. In particular, as the action gap of an operator increases while remaining optimality-\npreserving, the end result can be greater robustness to approximation/estimation errors [13].\nWe next present one of our main theoretical results establishing that our general family of RSOs is\nboth optimality-preserving and gap-increasing in a stochastic sense.\nTheorem 3.1. Let TB be the Bellman operator de\ufb01ned in (2) and {T\u03b2k} a sequence of RSOs as\nde\ufb01ned in (4). Considering the sequence of r.v.s Qk+1 := T\u03b2k Qk on a sample path basis with\nQ0 \u2208 Q, the sequence of operators {T\u03b2k} is both optimality-preserving and gap-increasing in a\nstochastic sense, a.s. Furthermore, all operators in the family T F\n\u03b2 are optimality-preserving and\ngap-increasing in a stochastic sense, a.s.\nEven though the stochastic operators in T F\n\u03b2 are not contraction mappings and therefore do not have a\n\ufb01xed point (as is also true for TC [6]), Theorem 3.1 establishes that each of these stochastic operators\nin T F\n\u03b2 and Theorem 3.1 signi\ufb01cantly\nenlarge the set of optimality-preserving and gap-increasing operators beyond the purely deterministic\noperators identi\ufb01ed in [6]. In particular, our new suf\ufb01cient conditions for optimality-preserving\noperators in a stochastic sense implies that signi\ufb01cant deviation from the Bellman operator is possible\nwithout loss of optimality; in comparison, the deterministic operator in [6] never allows a value of \u03b2k\nequal to or greater than 1. More importantly, the de\ufb01nition of T F\n\u03b2 and Theorem 3.1 imply that the\nsearch space for maximally ef\ufb01cient operators is an in\ufb01nite dimensional space of sequences of r.v.s,\ninstead of the \ufb01nite dimensional space for maximally ef\ufb01cient operators alluded to in [6]. To this\nend and due to our stochastic framework, we now establish results on stochastic ordering properties\namong the sequences of r.v.s {\u03b2k} that lead to corresponding ordering properties among the action\ngaps of the random operators. These results offer key relational insights into important orderings of\ndifferent operators in T F\n\u03b2 , which further demonstrates the bene\ufb01t of our RSOs and can potentially be\nexploited in searching for and attempting to \ufb01nd maximally ef\ufb01cient operators in practice.\nTheorem 3.2. For all \u02c6Q0 = \u02dcQ0 = Q0 \u2208 Q and for each integer k \u2265 0, suppose \u02c6Qk+1 and \u02dcQk+1\nare respectively updated with two different RSOs T \u02c6\u03b2k\nthat are distinguished by \u02c6\u03b2k and \u02dc\u03b2k\n\u02c6Qk and \u02dcQk+1 = T \u02dc\u03b2k\nsatisfying the stochastic ordering \u02c6\u03b2k \u2265st\n\u02dcQk. Then we\nhave that the action gaps of the two systems are stochastically ordered in the same direction, namely\n\u02c6A(x, a) \u2265st \u02dcA(x, a).\nTheorem 3.3. For all \u02c6Q0 = \u02dcQ0 = Q0 \u2208 Q and for each integer k \u2265 0, suppose \u02c6Qk+1 and \u02dcQk+1\nare respectively updated with two different RSOs T \u02c6\u03b2k\nthat are distinguished by \u02c6\u03b2k and \u02dc\u03b2k\nsatisfying the convex ordering \u02c6\u03b2k \u2265cx\n\u02dcQk. Then\nwe have that the action gaps of the two systems are convex ordered in the same direction, namely\n\u02c6A(x, a) \u2265cx \u02dcA(x, a).\nTheorem 3.4. For all \u02c6Q0 = \u02dcQ0 = Q0 \u2208 Q and for each integer k \u2265 0, suppose \u02c6Qk+1 and \u02dcQk+1\nare respectively updated with two different RSOs T \u02c6\u03b2k\nthat are distinguished by \u02c6\u03b2k and \u02dc\u03b2k\nsatisfying E[ \u02c6\u03b2k] = E[ \u02dc\u03b2k] and Var[ \u02c6\u03b2k] \u2264 Var[ \u02dc\u03b2k]; namely \u02c6Qk+1 = T \u02c6\u03b2k\n\u02dcQk.\nThen we have Var[ \u02c6Qk+1] \u2264 Var[ \u02dcQk+1]. Furthermore, the action gaps of the two systems have the\nfollowing properties: E[ \u02c6A(x, a)] = E[ \u02dcA(x, a)] and Var[ \u02c6A(x, a)] \u2264 Var[ \u02dcA(x, a)].\n\n\u02dc\u03b2k; namely \u02c6Qk+1 = T \u02c6\u03b2k\n\n\u02dc\u03b2k; namely \u02c6Qk+1 = T \u02c6\u03b2k\n\n\u02c6Qk and \u02dcQk+1 = T \u02dc\u03b2k\n\n\u02c6Qk and \u02dcQk+1 = T \u02dc\u03b2k\n\nand T \u02dc\u03b2k\n\nand T \u02dc\u03b2k\n\nand T \u02dc\u03b2k\n\nThe \ufb01rst two theorems conclude that, among the sequences of \u03b2k that preserve optimality, those\nstochastically larger and more variable sequences can produce larger action gaps w.r.t. two standard\nand important stochastic orderings. Theorem 3.4 points out that a larger variance for \u03b2k, with the\nsame \ufb01xed mean value, leads to a larger variance for Qk(x, a) while rendering the same expectation\nfor the action gap and a larger variance in the action gap. We know that, in the limit, the optimal\naction will maintain its state-action value function. Then, when k is suf\ufb01ciently large, we can expect\nthat the state-value function Qk(x, b\u2217) for the optimal action b\u2217 in state x will be very close to the\noptimal value Q\u2217(x, b\u2217). A larger variance therefore suggests the potential for a greater separation\nbetween Qk(x, b\u2217) and the state-value function Qk(x, a) for sub-optimal actions a, and thus the\n\n5\n\n\flatter can be understood to have a larger action gap in the limit. Hence, sequences of \u03b2k with large\nvariances can be seen as a very simple instance of the stochastic ordering results.\n\n4 Experimental Results\n\nWithin the general RL framework of interest, we consider a standard, yet generic, form for Q-learning\nso as to cover the various problems empirically examined in this section. Speci\ufb01cally, for all Q0 \u2208 Q,\nx \u2208 X, a \u2208 A and an operator of interest T , we consider the sequence of action-value Q-functions\nbased on the following generic update rule:\n\nQk+1(x, a) = (1 \u2212 \u03b1k)Qk(x, a) + \u03b1kT Qk(x, a),\n\n(7)\nwhere \u03b1k is the learning rate for iteration k. Our theoretical results study the behavior of Q(x, a)\nunder a general class of different operators, establishing the bene\ufb01ts of our RSOs over previously\nproposed operators. We now turn to our empirical comparisons that consist of the Bellman operator\nTB, the consistent Bellman operator TC, and instances of our family of RSOs T\u03b2k under different\ndistributions for the sequence of \u03b2k.\nWe conduct various experiments across several well-known problems using the OpenAI Gym frame-\nwork [10], namely Acrobot, Mountain Car, Cart Pole and Lunar Lander. This collection of problems\nspans a variety of RL examples with different characteristics, dimensions, parameters, and so on. In\neach case, the state space is continuous and discretized to a \ufb01nite set of states; i.e., each dimension is\ndiscretized into equally spaced bins where the number of bins depends on the problem to be solved\nand the reference codebase used. For every problem, the speci\ufb01c Q-learning algorithms considered\nare de\ufb01ned as in (7) where the appropriate operator of interest TB, TC or T\u03b2k is substituted for T ;\nat each timestep, (7) is iteratively applied to the Q-function at the current state and action. The\nexperiments for each problem from the OpenAI Gym were run using the existing code found at [23, 1]\nexactly as is with the default parameter settings and the sole change consisting of the replacement\nof the Bellman operator in the code with corresponding implementations of either the consistent\nBellman operator or RSO; see Appendix C of the supplement for the corresponding python code. It\nis apparent from these codes that RSO can be directly and easily implemented as a replacement for\nthe classical Bellman operator.\nWe note that each of the algorithms from the OpenAI Gym implements a form of the \u0001-greedy method\n(e.g., occasionally picking a random action or using a randomly perturbed Q-function for determining\nthe action) to enable some form of exploration in addition to the exploitation-based search of the\noptimal policy using the Q-function. Our experiments were therefore repeated over a wide range of\nvalues for \u0001, where we found that the relative performance trends of the various operators did not\ndepend signi\ufb01cantly on the amount of exploration under the \u0001-greedy algorithm. In particular, the\nsame performance trends were observed over a wide range of \u0001 values and hence we present results\nbased on the default value of \u0001 used in the reference codebase.\nMultiple experimental trials are run for each problem, where we ensured the setting of the random\nstarting state to be the same in each experimental trial for all of the operators considered by\ninitializing them with the same random seed. We observe in general across all experimental results\nthat for different problems and different variants of the Q-learning algorithm, simply replacing the\nBellman operator or the consistent Bellman operator with an RSO results in signi\ufb01cant performance\nimprovements. The RSOs considered in every set of experimental trials for each problem consist\nof different distributions for the sequence of \u03b2k. Speci\ufb01cally, we empirically study the following\ninstances of our family of RSOs:\n2 and Var[\u03b2k] = 1\n1. \u03b2k sampled from a uniform distribution over [0, 1), thus E[\u03b2k] = 1\n12;\n2. \u03b2k sampled from a uniform distribution over [0, 2), thus E[\u03b2k] = 1 and Var[\u03b2k] = 1\n3;\n3. \u03b2k sampled from a uniform distribution over [0.5, 1.5), thus E[\u03b2k] = 1 and Var[\u03b2k] = 1\n12;\n5 plus a r.v. sampled from a Beta(2, 3) distribution, thus E[\u03b2k] = 1 and Var[\u03b2k] = 1\n4. \u03b2k set to 3\n25;\n9 plus a r.v. sampled from a Beta(2, 7) distribution, thus E[\u03b2k] = 1 and Var[\u03b2k] = 7\n5. \u03b2k set to 7\n405;\n6. \u03b2k set to a r.v. sampled from a Pareto(1, 2) distribution minus 1, thus E[\u03b2k] = 1, Var[\u03b2k] = \u221e;\n2, thus E[\u03b2k] = 1, Var[\u03b2k] = 3\n4;\n7. \u03b2k set to a r.v. sampled from a Pareto(1, 3) distribution minus 1\n8. \u03b2k set to 0.5 and 1.5 in an alternating manner, thus having E[\u03b2k] = 1 and Var[\u03b2k] = 1\n12;\n9. \u03b2k set to 1, thus having E[\u03b2k] = 1 and Var[\u03b2k] = 0.\n\n6\n\n\fObserve that the \ufb01rst and second RSO instances include values of \u03b2k that are equal or relatively close\nto 0; since xm = 1 in the sixth instance together with the subtraction of 1, this also includes values of\n\u03b2k that are equal or relatively close to 0; all other RSO instances exclude values of \u03b2k that are equal\nor relatively close to 0. We note that the last RSO instance is consistent with the advantage learning\noperator in [4, 6], though it is important to note that \u03b2 = 1 was disallowed in [6], unnecessarily so as\nour results have established. To gain insight on the different RSO instances, the results presented in\nthis section focus on the simple case of operators T\u03b2k associated with sequences of r.v.s {\u03b2k} drawn\nfrom speci\ufb01c probability distributions in an independent and identically distributed manner. We\nnote, however, that various experiments were performed with very simple combinations of different\ndistributions for \u03b2k over the iterations k \u2208 Z+. As a speci\ufb01c example, we considered \u03b2k \u223c U [0, 1)\nfor \u03b20, . . . , \u03b2k(cid:48) and then \u03b2k \u223c U [0, 2) for \u03b2k(cid:48)+1, . . ., but these results were not considerably better,\nand often worse, than those presented below for \u03b2k \u223c U [0, 2).\n\n(a) Acrobot problem (training).\n\n(b) Mountain Car problem (training).\n\nFigure 1: Average number of steps needed to solve minimization problems during training phase.\n\n4.1 Acrobot\n\nThis problem is \ufb01rst discussed in [20]. The state vector is 6-dimensional with three actions possible\nin each state, and the score represents the number of timesteps needed to solve the problem. The\nposition and velocity are discretized into 8 bins whereas the other state components are discretized\ninto 10 bins. We ran 50 experimental trials over many episodes, with a goal of minimizing the score.\nFigure 1a plots the score, averaged over moving windows of 1000 episodes across the 50 trials, as a\nfunction of the number of episodes for a subset of operators during the training phase; the full set of\nresults are provided in Figure 3. We observe that the average score under the RSOs generally exhibit\nmuch better performance than under the Bellman operator or the consistent Bellman operator, with\nthe \u03b2k sequences of all ones and from Beta(2, 7) rendering the best performance. Table 1 presents\nthe average score over the last 1000 episodes across the 50 trials together with the corresponding\n95% con\ufb01dence intervals. We observe that the con\ufb01dence intervals for all operators are quite small\nand that the best average scores are consistent with those plotted in Figure 3.\nFigure 2b presents the average score over 1000 episodes across the 50 trials for all operators during\nthe testing phase, together with the corresponding 95% con\ufb01dence intervals. We again observe that\nthe best average scores are obtained under many of the RSOs and that the con\ufb01dence intervals for all\noperators are quite small. We further observe the differences in the performance orderings among the\noperators in comparison with the results in Table 1, where the \u03b2k sequences from Pareto(1, 2) and\nalternating 0.5 and 1.5 render the best performance followed by \u03b2k sequences from U [0, 1).\n\n4.2 Mountain Car\n\nThis problem is \ufb01rst discussed in [16]. The state vector is 2-dimensional with a total of three possible\nactions, and the score represents the number of timesteps needed to solve the problem. The state\nspace is discretized into a 40\u00d7 40 grid. We ran 50 experimental trials over many episodes for training,\neach of which consists of up to 200 steps and with a goal of minimizing the score.\nFigure 1b plots the score, averaged over moving windows of 1000 episodes across the 50 trials, as a\nfunction of the number of episodes for a subset of operators during the training phase; the full set\n\n7\n\n020000400006000080000100000episode140150160170180190200average scoreAcrobotBellmanconsistent BellmanRSO: k U[0,2)RSO: k=1.0RSO: k {0.5,1.5}RSO: k Beta(2,7)0200040006000800010000episode185.0187.5190.0192.5195.0197.5200.0mean of scoreMountain CarBellmanconsistent BellmanRSO: k U[0,2)RSO: k=1.0RSO: k {0.5, 1.5}RSO: k Pareto(2)\fof results are provided in Figure 4. We observe that the average score under the RSOs generally\nexhibit considerably better performance than under the Bellman operator or the consistent Bellman\noperator, with the \u03b2k sequences from Pareto(1, 2) and alternating 0.5 and 1.5 rendering the best\nperformance followed by \u03b2k sequences from U [0, 2). Table 1 presents the average score over the\nlast 1000 episodes across the 50 trials together with the corresponding 95% con\ufb01dence intervals. We\nobserve that the con\ufb01dence intervals for all operators are quite small and that the best average scores\nare consistent with those plotted in Figure 4.\nFigure 2b presents the average score over 1000 episodes across the 50 trials for all operators during\nthe testing phase, together with the corresponding 95% con\ufb01dence intervals. We again observe that\nthe best average scores are generally obtained under the RSOs and that the con\ufb01dence intervals for\nall operators are quite small. We further observe the differences in the average score performance\norderings among the operators in comparison with the results in Table 1, where the \u03b2k sequences\nfrom Pareto(1, 3) and U [0, 2) render the best average score performance.\n\n4.3 Cart Pole\n\nThis problem is \ufb01rst discussed in [5]. The state vector is 4-dimensional with two actions possible in\neach state, and the score represents the number of steps where the cart pole stays upright before either\nfalling over or going out of bounds. The position and velocity are discretized into 8 bins whereas\nthe angle and angular velocity are discretized into 10 bins. We ran 50 experimental trials over many\nepisodes, each of which consists of up to 200 steps with a goal of maximizing the score. The problem\nis considered solved when the score exceeds 195.\nTable 1 presents the average score over the last 1000 episodes across the 50 trials for all operators\nduring the training phase, together with the corresponding 95% con\ufb01dence intervals. We observe\nthat the best average scores are obtained under many of the RSOs, with the \u03b2k sequences of all ones\nand from Beta(2, 7) rendering the best performance followed by \u03b2k sequences from U [0.5, 1.5). We\nfurther observe that the con\ufb01dence intervals for all operators are quite small.\nTable 2 presents the average score over 1000 episodes across the 50 trials for all operators during\nthe testing phase, together with the corresponding 95% con\ufb01dence intervals. We again observe that\nthe best average scores are obtained under many of the RSOs and that the con\ufb01dence intervals for\nall operators are quite small. We further observe the differences in the average score performance\norderings among the operators in comparison with the results in Table 1, where the \u03b2k sequences\nfrom U [0.5, 1.5) and U [0, 2) render the best average score performance.\n\n(a) Average Lunar Lander score (training).\n\n(b) Table of average scores (testing).\n\nFigure 2: Average number of steps needed to solve Lunar Lander maximization problem during\ntraining phase; Average scores for all RSO instances and three problems during testing phase.\n\n4.4 Lunar Lander\n\nThis problem is discussed in [10]. The state vector is 8-dimensional with a total of four possible\nactions, and the physics of the problem is known to be notoriously more dif\ufb01cult than the foregoing\nproblems. The 6 continuous state variables are each discretized into 4 bins. The score represents\nthe cumulative reward comprising positive points for successful degrees of landing and negative\npoints for fuel usage and crashing. We ran 50 experimental trials over many episodes, each of which\nconsists of up to 200 steps with a goal of maximizing the score.\n\n8\n\n020004000600080001000012000episode220200180160140120mean of scoreLunar LanderBellmanconsistent BellmanRSO: k U[0,2)RSO: k=1.0RSO: k {0.5, 1.5}RSO: k Beta(2,3)TestingScoreAcrobotMountainCarLunarLanderBellman189.1\u00b10.17%131.2\u00b10.23%\u2212231.0\u00b10.92%ConsistentBellman185.3\u00b10.20%127.2\u00b10.22%\u2212185.1\u00b10.98%\u03b2k\u223cU[0,2)189.5\u00b10.16%121.2\u00b10.21%\u2212164.4\u00b11.05%\u03b2k\u223cU[0,1)184.9\u00b10.18%126.9\u00b10.23%\u2212207.0\u00b10.94%\u03b2k=1.0189.2\u00b10.18%121.9\u00b10.21%\u2212157.8\u00b11.10%\u03b2k\u2208{0.5,1.5}181.3\u00b10.23%122.3\u00b10.20%\u2212174.0\u00b11.01%\u03b2k\u223cU[0.5,1.5)192.4\u00b10.13%122.8\u00b10.21%\u2212168.1\u00b11.08%\u03b2k\u223cBeta(2,3)185.0\u00b10.20%122.6\u00b10.21%-163.5\u00b11.13%\u03b2k\u223cBeta(2,7)186.2\u00b10.19%122.3\u00b10.21%\u2212164.8\u00b11.06%\u03b2k\u223cPareto(2)180.7\u00b10.37%125.0\u00b10.20%\u2212216.9\u00b10.94%\u03b2k\u223cPareto(3)186.6\u00b10.21%121.1\u00b10.21%\u2212166.2\u00b11.04%1\fFigure 2a plots the score, averaged over moving windows of 1000 episodes across the 50 trials, as a\nfunction of the number of episodes for a subset of operators during the training phase; the full set of\nresults are provided in Figure 5. We observe that the average score under the RSOs generally exhibit\nbetter performance than under the Bellman operator or the consistent Bellman operator, with the\n\u03b2k sequences from Beta(2, 3) and of all ones rendering the best performance. Table 1 presents the\naverage score over the last 1000 episodes across the 50 trials together with the corresponding 95%\ncon\ufb01dence intervals. We observe that the con\ufb01dence intervals for all operators are quite small and\nthat the best average scores are consistent with those plotted in Figure 5.\nFigure 2b presents the average score over 1000 episodes across the 50 trials for all operators during\nthe testing phase, together with the corresponding 95% con\ufb01dence intervals. We again observe that\nthe best average scores are generally obtained under the RSOs and that the con\ufb01dence intervals for\nall operators are quite small. We further observe some consistencies in the performance orderings\namong the operators in comparison with the results in Table 1, where the \u03b2k sequences of all ones\nand from Beta(2, 3) render the best performance followed by \u03b2k sequences from U [0, 2).\n\n5 Conclusions and Discussion\n\nBuilding on the work of Farahmand [13] and Bellemare et al. [6], who argue that increasing the\naction gap while preserving optimality can improve the performance of value-iteration algorithms\nin environments with approximation or estimation errors, we propose and analyze a new general\nfamily of RSOs for RL that subsumes as a strict subset the classical Bellman operator and other\npurely deterministic operators proposed in the literature. Our theoretical results include proving that\nour stochastic operators are optimality-preserving and gap-increasing in a stochastic sense and that\nstochastic and variability orderings among the sequence of random operators lead to corresponding\norderings among the action gaps. In addition, our stochastic framework and theoretical results shed\nnew light on and help to resolve the open fundamental questions raised in [6] related to the possibility\nof weaker optimality conditions, the statistical ef\ufb01ciency of proposed deterministic operators, and the\npossibility of \ufb01nding maximally ef\ufb01cient operators. Speci\ufb01cally, our theoretical results show that the\nconditions for optimality are much weaker and the statistical ef\ufb01ciency of our stochastic operators\ncan be made much stronger, both allowing signi\ufb01cant degrees of freedom in \ufb01nding alternatives to the\nBellman operator for different purposes and applications. Meanwhile, these important improvements\ncompletely alter and clarify the question of \ufb01nding the maximally ef\ufb01cient operators from a \ufb01nite\ndimensional parameter optimization problem suggested in [6] to an optimization problem in an\nin\ufb01nite dimensional space (of the in\ufb01nite sequences of r.v.s), for which our established stochastic and\nvariability orderings among sequences of random operators can potentially assist in searching for\nmaximally ef\ufb01cient operators in practice. Our family of RSOs represents a stochastic approach not\noften seen in the RL literature that should be exploited to a much greater extent.\nA collection of empirical results \u2013 based on well-known problems within the OpenAI Gym frame-\nwork spanning various RL examples with diverse characteristics \u2013 support our theoretical results,\nconsistently demonstrating and quantifying the signi\ufb01cant performance improvements obtained with\nour RSOs over existing operators. We note that, while the focus of our empirical results has been on\nQ-learning, our family of RSOs are applicable to other RL approaches such as DQN [15].\nIt is important to highlight a few fundamental tradeoffs in identifying maximally ef\ufb01cient operators for\ndifferent RL problems, based on our theoretical and empirical results. On the one hand, when sampled\nvalues of \u03b2k are relatively small, then it is possible for the small offset by \u03b2k(Vk(x) \u2212 Qk(x, a))\non truly suboptimal actions a to have limited or no effect on the separation between optimal and\nsuboptimal actions. On the other hand, when sampled values of \u03b2k are relatively large, then it is\npossible for the large offset of \u03b2k(Vk(x) \u2212 Qk(x, a)) to be applied against the truly optimal action\na\u2217 due to approximation or estimation errors. In addition, the level of impact of these and related\nfactors associated with the sequence of r.v.s {\u03b2k} can vary over the value iterations moving from\nk = 0 to the limit as k \u2192 \u221e. We view the problem of \ufb01nding maximally ef\ufb01cient operators for RL\nproblems as identifying sequences of random operators that address these fundamental tradeoffs in\norder to maximize action-gap regularity for the suboptimal actions of each state. Our theoretical\nand empirical results further raise a related fundamental issue that concerns whether maximizing the\naction gap is suf\ufb01cient to improve the performance of value-iteration algorithms in environments\nwith approximation or estimation errors.\n\n9\n\n\fReferences\n[1] M. Alzantot.\n\nhttps://gist.github.com/malzantot /9d1d3fa4fdc4a101bc48a135d8f9a289, 2017.\n\nSolution of mountaincar OpenAI Gym problem using Q-learning.\n\n[2] K. Asadi and M. L. Littman. An alternative softmax operator for reinforcement learning. In\n\nProc. 34th International Conference on Machine Learning, 2017.\n\n[3] M. Azar, R. Munos, M. Gavamzadeh, and H. Kappen. Speedy Q-learning. Advances in Neural\n\nInformation Processing Systems, 24, 2011.\n\n[4] L. Baird. Reinforcement Learning through Gradient Descent. PhD thesis, Carnegie Mellon\n\nUniversity, Pittsburgh, PA, U.S.A., 1999.\n\n[5] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve\ndif\ufb01cult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics,\nSMC-13(5):834\u2013846, Sept. 1983.\n\n[6] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action\ngap: New operators for reinforcement learning. In Proc. Thirtieth AAAI Conference on Arti\ufb01cial\nIntelligence, AAAI\u201916, pages 1476\u20131483. AAAI Press, 2016.\n\n[7] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n\n[8] D. Bertsekas and H. Yu. Q-learning and enhanced policy iteration in discounted dynamic\n\nprogramming. Mathematics of Operations Research, 37, 2012.\n\n[9] P. Billingsley. Convergence of Probability Measures. Wiley, New York, Second edition, 1999.\n\n[10] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenAI Gym. CoRR, abs/1606.01540, 2016.\n\n[11] Y. S. Chow and H. Teicher. Probability Theory: Independence, Interchangeability, Martingales.\n\nSpringer, 3rd edition, 2003.\n\n[12] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal\n\nof Machine Learning Research, 6:503\u2013556, 2005.\n\n[13] A. Farahmand. Action-gap phenomenon in reinforcement learning. Advances in Neural\n\nInformation Processing Systems, 24, 2011.\n\n[14] L. Kaelbling, M. Littman, and A. Moore. Reinforcement learning: A survey. Journal of\n\nArti\ufb01cial Intelligence Research, 4:237\u2013285, 1996.\n\n[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\nPlaying Atari with deep reinforcement learning. In Proc. NIPS Deep Learning Workshop, 2013.\n\n[16] A. Moore. Ef\ufb01cient Memory-Based Learning for Robot Control. PhD thesis, University of\n\nCambridge, Cambridge, U.K., 1990.\n\n[17] G. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical\n\nreport, Cambridge University, 1994.\n\n[18] M. Shaked and J. Shanthikumar. Stochastic Orders. Springer Series in Statistics. Springer New\n\nYork, 2007.\n\n[19] R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning,\n\n3(1):9\u201344, 1988.\n\n[20] R. S. Sutton. Generalization in Reinforcement Learning: Successful Examples Using Sparse\n\nCoarse Coding. Advances in Neural Information Processing Systems, 8:1038\u20131044, 1996.\n\n[21] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2011.\n\n[22] C. Szepesvari. Algorithms for reinforcement learning. In Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning, volume 4.1, pages 1\u2013103. Morgan & Claypool, 2010.\n\n10\n\n\f[23] V. M. Vilches.\n\nBasic reinforcement\n\nGym.\n/README.md, May 2016.\n\nhttps://github.com/vmayoral\n\nlearning tutorial 4: Q-learning in OpenAI\n/basic_reinforcement_learning/blob/master/tutorial4\n\n[24] C. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cambridge, Cambridge,\n\nU.K., 1989.\n\n11\n\n\f", "award": [], "sourceid": 9083, "authors": [{"given_name": "Yingdong", "family_name": "Lu", "institution": "IBM Research"}, {"given_name": "Mark", "family_name": "Squillante", "institution": "IBM Research"}, {"given_name": "Chai Wah", "family_name": "Wu", "institution": "IBM"}]}