{"title": "Action-Gap Phenomenon in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 172, "page_last": 180, "abstract": "Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an agent following the greedy policy \\(\\hat{\\pi}\\) with respect to an action-value function \\(\\hat{Q}\\), the performance loss \\(E[V^*(X) - V^{\\hat{X}} (X)]\\) is upper bounded by \\(O(|| \\hat{Q} - Q^*||_\\infty^{1+\\zeta}\\)), in which \\(\\zeta >= 0\\) is the parameter quantifying the action-gap regularity. For \\(\\zeta > 0\\), our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms.", "full_text": "Action-Gap Phenomenon in Reinforcement Learning\n\nAmir-massoud Farahmand\u2217\n\nSchool of Computer Science, McGill University\n\nMontreal, Quebec, Canada\n\nAbstract\n\nperformance loss E(cid:2)V \u2217(X) \u2212 V \u02c6\u03c0(X)(cid:3) is upper bounded by O((cid:107) \u02c6Q \u2212 Q\u2217(cid:107)1+\u03b6\u221e ),\n\nMany practitioners of reinforcement learning problems have observed that often-\ntimes the performance of the agent reaches very close to the optimal performance\neven though the estimated (action-)value function is still far from the optimal one.\nThe goal of this paper is to explain and formalize this phenomenon by introducing\nthe concept of the action-gap regularity. As a typical result, we prove that for an\nagent following the greedy policy \u02c6\u03c0 with respect to an action-value function \u02c6Q, the\nin which \u03b6 \u2265 0 is the parameter quantifying the action-gap regularity. For \u03b6 > 0,\nour results indicate smaller performance loss compared to what previous analyses\nhad suggested. Finally, we show how this regularity affects the performance of\nthe family of approximate value iteration algorithms.\n\n1\n\nIntroduction\n\nThis paper introduces a new type of regularity in the reinforcement learning (RL) and planning\nproblems with \ufb01nite-action spaces that suggests that the convergence rate of the performance loss to\nzero is faster than what previous analyses had indicated. The effect of this regularity, which we call\nthe action-gap regularity, is that oftentimes the performance of the RL agent reaches very close to\nthe optimal performance (e.g., it always solves the mountain-car problem with the optimal number\nof steps) even though the estimated action-value function is still far from the optimal one.\nFigure 1 illustrates the effect of this regularity in a simple problem. We use value iteration to\nsolve a stochastic 1D chain walk problem (slight modi\ufb01cation of the example in Section 9.1 of [1]).\nThe behavior of the supremum of the difference between the estimate after k iterations and the\noptimal action-value function is O(\u03b3k), in which 0 \u2264 \u03b3 < 1 is the discount factor (notations shall\nbe introduced in Section 2). The current theoretical results suggest that the convergence of the\nperformance loss, which is de\ufb01ned as the average difference between the value of the optimal policy\nand the value of the greedy policy w.r.t. (with respect to) the estimated action-value function, should\nhave the same O(\u03b3k) behavior (cf. Proposition 6.1 of Bertsekas and Tsitsiklis [2]). However, the\nbehavior of the performance loss is often considerably faster, e.g., it is approximately O(\u03b31.85k) in\nthis example.\nTo gain a better understanding of the action-gap regularity, focus on a single state and suppose that\nthere are only two actions available. When the estimated action-value function has a large error, the\ngreedy policy w.r.t. it can possibly choose the suboptimal action. However, when the error becomes\nsmaller than the (half of the) gap between the value of the optimal action and the other one, the\nselected greedy action is the optimal action. After passing this threshold, the size of the error in\nthe estimate of the action-value function in that state does not have any effect on the quality of the\nselected action. The larger the gap is, the more inaccurate the estimate can be while the selected\ngreedy action is the optimal one. On the other hand, if the estimated action-value function does not\nsuggest a correct ordering of actions but the gap is negligibly small, the performance loss of not\n\n\u2217www.SoloGen.net\n\n1\n\n\fFigure 1: Comparison of the action-value estimation error (cid:107) \u02c6Q \u2212 Q\u2217(cid:107)\u221e and the performance loss\n(cid:107)V \u2217 \u2212 V \u02c6\u03c0(cid:107)1 (\u02c6\u03c0 is the greedy policy with respect to \u02c6Q) at different iterations of the value iteration\nalgorithm. The rate of decrease for the performance loss is considerably faster than that of the\nestimation error. The problem is a 1D stochastic chain walk with 500 states and \u03b3 = 0.95.\n\nchoosing the optimal action is small as well. The presence of this gap in the optimal action-value\nfunction is what we call the action-gap regularity of the problem and the described behavior is called\nthe action-gap phenomenon.\nAction-gap regularity is similar to the low-noise (or margin) condition in the classi\ufb01cation literature.\nThe low-noise condition is the assumption that the conditional probability of the class label given\ninput is \u201cfar\u201d from the critical decision point. If this condition holds, \u201cfast\u201d convergence rate is\nobtainable as was shown by Mammen and Tsybakov [3], Tsybakov [4], Audibert and Tsybakov\n[5]. The low-noise condition is believed to be one reason that many high-dimensional classi\ufb01cation\nproblems can be solved with ef\ufb01cient sample complexity (cf. Rinaldo and Wasserman [6]). We\nborrow techniques developed in the classi\ufb01cation literature, in particular by Audibert and Tsybakov\n[5], in our analysis.\nIt is notable that there have been some works that used classi\ufb01cation algorithms to solve reinforce-\nment learning (e.g., Lagoudakis and Parr [7], Lazaric et al. [8]) or the related problem of appren-\nticeship learning (e.g., Syed and Schapire [9]). Nevertheless, the connection of this work to the\nclassi\ufb01cation literature is only by borrowing theoretical ideas from that literature and not in using\nany particular algorithm. The focus of this work is indeed on the value-based approaches, though\none might expect that similar behavior can be observed in classi\ufb01cation-based approaches as well.\nIn the rest of this paper, we formalize the action-gap phenomenon and prove that whenever the MDP\nhas a favorable action-gap regularity, fast convergence rate is achievable. Theorem 1 upper bounds\nthe performance loss of the greedy policy w.r.t. the estimated action-value function by a function of\nthe Lp-norm of the difference between the estimated action-value function and the optimal one. Our\nresult complements previous theoretical analyses of RL/Planning problems such as those by Antos\net al. [10], Munos and Szepesv\u00b4ari [11], Farahmand et al. [12, 13], Maillard et al. [14], who mainly\nfocused on the quality of the (action-)value function estimate and ignored the action-gap regularity.\nThis synergy provides a clearer picture of what makes an RL/Planning problem easy or dif\ufb01cult.\nFinally as an example of Theorem 1, we address the question of how the errors caused at each\niteration of the Approximate Value Iteration (AVI) algorithm affect the quality of the outcome policy\nand show that the AVI procedure bene\ufb01ts from the action-gap regularity of the problem (Theorem 2).\n\n2 Notations\n\nIn this section, we provide a brief summary of some of the concepts and de\ufb01nitions from the theory\nof MDPs and RL. For more information, the reader is referred to Bertsekas and Tsitsiklis [2], Sutton\nand Barto [15], Szepesv\u00b4ari [16].\n\n2\n\n102030405060708090100110\u2212410\u2212310\u2212210\u22121100101k (iteration number)Error/Loss L!!error of the estimated action!value functionPerformance lossO(\"1.85k)O(\"k)\fFor a space \u2126, with \u03c3-algebra \u03c3\u2126, we de\ufb01ne M(\u2126) as the set of all probability measures over \u03c3\u2126.\nB(\u2126) denotes the space of bounded measurable functions w.r.t. (with respect to) \u03c3\u2126 and B(\u2126, L)\ndenotes the subset of B(\u2126) with bound 0 < L < \u221e.\nA \ufb01nite-action discounted MDP is a 5-tuple (X ,A, P,R, \u03b3), where X is a measurable state space,\nA is a \ufb01nite set of actions, P : X \u00d7A \u2192 M(X ) is the transition probability kernel, R : X \u00d7A \u2192 R\nis the reward distribution, and 0 \u2264 \u03b3 < 1 is a discount factor. We denote r(x, a) = E [R(\u00b7|x, a)].\nA measurable mapping \u03c0 : X \u2192 A is called a deterministic Markov stationary policy, or just a policy\nin short. An agent\u2019s following a policy \u03c0 in an MDP means that at each time step At = \u03c0(Xt).\nA policy \u03c0 induces two transition probability kernels P \u03c0 : X \u2192 M(X ) and P \u03c0 : X \u00d7 A \u2192\nM(X \u00d7 A). For a measurable subset A of X and a measurable subset B of X \u00d7 A, we de\ufb01ne\ntransition probability kernel (P \u03c0)m : X \u00d7A \u2192 M(X \u00d7A) for m = 2, 3,\u00b7\u00b7\u00b7 are inductively de\ufb01ned\n\n(P \u03c0)(A|x) (cid:44)(cid:82) P (dy|x, \u03c0(x))I{y\u2208A} and (P \u03c0)(B|x, a) (cid:44)(cid:82) P (dy|x, a)I{(y,\u03c0(y))\u2208B}. The m-step\nas (P \u03c0)m(B|x, a) (cid:44)(cid:82)\nB(X ) \u2192 B(X ) by (P V )(x) (cid:44) (cid:82)\n(\u03c1P )(A) =(cid:82) \u03c1(dx)P (dy|x)I{y\u2208A}. A typical choice of P is (P \u03c0)m : M(X ) \u2192 M(X ). These\n(cid:12)(cid:12)(cid:12) X1 = x\nX (X \u00d7 A) and the agent follows the policy \u03c0. Then V \u03c0(x) (cid:44) E(cid:104)(cid:80)\u221e\n(cid:105)\nQ\u03c0(x, a) (cid:44) E(cid:104)(cid:80)\u221e\n\nX P (dy|x, a)(P \u03c0)m\u22121(B|y, \u03c0(y)) (similarly for (P \u03c0)m : X \u2192 M(X )).\nGiven a transition probability kernel P : X \u2192 M(X ), de\ufb01ne the right-linear operator P\u00b7\n:\nX P (dy|x)V (y). For a probability measure \u03c1 \u2208 M(X )\nand a measurable subset A of X , de\ufb01ne the left-linear operators \u00b7P : M(X ) \u2192 M(X ) by\noperators for P : X \u00d7 A \u2192 M(X \u00d7 A) are de\ufb01ned similarly.\nThe value function V \u03c0 and and the action-value function Q\u03c0 of a policy \u03c0 are de\ufb01ned as follows:\nLet (Rt; t \u2265 1) be the sequence of rewards when the Markov chain is started from state X1 (state-\naction (X1, A1) for the action-value function) drawn from a positive probability distribution over\nand\n\n(cid:12)(cid:12)(cid:12) X1 = x, A1 = a\n(cid:105)\n\nt=1 \u03b3t\u22121Rt\n\nt=1 \u03b3t\u22121Rt\n\n.\n\n(for the action-value functions) are de\ufb01ned as (T \u03c0V )(x) (cid:44) r(x, \u03c0(x)) + \u03b3(cid:82)\nand (T \u03c0Q)(x, a) (cid:44) r(x, a) + \u03b3(cid:82)\n(cid:110)\nr(x, a) + \u03b3(cid:82)\nand (T \u2217Q)(x, a) (cid:44) r(x, a) + \u03b3(cid:82)\n\nFor a discounted MDP, we de\ufb01ne the optimal value and optimal action-value functions by V \u2217(x) =\nsup\u03c0 V \u03c0(x) for all states x \u2208 X and Q\u2217(x, a) = sup\u03c0 Q\u03c0(x, a) for all state-actions (x, a) \u2208 X \u00d7A.\nWe say that a policy \u03c0\u2217 is optimal if it achieves the best values in every state, i.e., if V \u03c0\u2217\n= V \u2217.\nWe say that a policy \u03c0 is greedy w.r.t. an action-value function Q and write \u03c0 = \u02c6\u03c0(\u00b7; Q), if \u03c0(x) =\nargmaxa\u2208A Q(x, a) holds for all x \u2208 X (if there exist multiple maximizers, a maximizer is chosen\nin an arbitrary deterministic manner). Greedy policies are important because a greedy policy w.r.t.\nthe optimal action-value function Q\u2217 is an optimal policy.\nFor a \ufb01xed policy \u03c0, the Bellman operators T \u03c0 : B(X ) \u2192 B(X ) and T \u03c0 : B(X \u00d7A) \u2192 B(X \u00d7A)\nX V (y)P (dy|x, \u03c0(x))\nX Q(y, \u03c0(y))P (dy|x, a). The \ufb01xed point of the Bellman operator\n(cid:111)\nis the (action-)value function of the policy \u03c0, i.e., T \u03c0Q\u03c0 = Q\u03c0 and T \u03c0V \u03c0 = V \u03c0. Similarly, the\nBellman optimality operators T \u2217 : B(X ) \u2192 B(X ) and T \u2217 : B(X \u00d7 A) \u2192 B(X \u00d7 A) (for the\nR\u00d7X V (y)P (dr, dy|x, a)\naction-value functions) are de\ufb01ned as (T \u2217V )(x) (cid:44) maxa\nR\u00d7X maxa(cid:48) Q(y, a(cid:48))P (dr, dy|x, a). Again, these operators enjoy\na \ufb01xed-point property similar to that of the Bellman operators: T \u2217Q\u2217 = Q\u2217 and T \u2217V \u2217 = V \u2217.\nFor a probability measure \u03c1 \u2208 M(X ), and a measurable function V \u2208 B(X ), we de\ufb01ne the Lp(\u03c1)-\nnorm (1 \u2264 p < \u221e) of V as (cid:107)V (cid:107)p,\u03c1\n(cid:107)V (cid:107)\u221e (cid:44) supx\u2208X |V (x)|. For \u03c1 \u2208 M(X \u00d7A) and Q \u2208 B(X \u00d7A), we de\ufb01ne (cid:107)Q(cid:107)p,\u03c1 (1 \u2264 p < \u221e)\nby (cid:107)Q(cid:107)p,\u03c1\n\n(cid:44) (cid:2)(cid:82)\nX |V (x)|p d\u03c1(x)(cid:3)1/p. The L\u221e(X )-norm is de\ufb01ned as\n(cid:105)1/p\nand (cid:107)Q(cid:107)\u221e (cid:44) sup(x,a)\u2208X\u00d7A |Q(x, a)|.\n\n(cid:80)|A|\na=1 (cid:107)Q(\u00b7, a)(cid:107)p\n\n(cid:44)(cid:104) 1|A|\n\np,\u03c1\n\n3 Action-Gap Theorem\n\nIn this section, we present the action-gap theorem for an MDP (X ,A, P,R, \u03b3). To simplify the\nanalysis, we assume that the number of actions |A| is only 2. We denote \u03c1\u2217 \u2208 M(X ) as the station-\n\n3\n\n\fFigure 2: The action-gap function gQ\u2217 (x) and the relative ordering of the optimal and the estimated\naction-value functions for a single state x. Depending on the ordering of the estimates, the greedy\naction is the same as ((cid:88)) or different from (X) the optimal action. This \ufb01gure does not show all\npossible con\ufb01gurations.\n\nary distribution induced by \u03c0\u2217, and we let \u03c1 \u2208 M(X ) be a user-speci\ufb01ed evaluation distribution.\nThis distribution indicates the relative importance of regions of the state space to the user.\nSuppose that algorithm A receives a dataset Dn = {(X1, A1, R1, X(cid:48)\nn)}\nt is being drawn from P (\u00b7|Xt, At)) and outputs \u02c6Q\n(with Ri is being drawn from R(\u00b7|Xt, At) and X(cid:48)\nas an estimate of the optimal action-value function, i.e., \u02c6Q \u2190 A(Dn). The exact nature of this algo-\nrithm is not important and it can be any online or of\ufb02ine, batch or incremental algorithms of choice\nsuch as Q-learning, SARSA [15], and their variants [17], LSPI [1], LARS-TD [18] in a policy it-\neration procedure, REG-LSPI [13], various Fitted Q-Iterations algorithms [19, 20, 12], or Linear\nProgramming-based approaches [21, 22]. The only relevant aspect of \u02c6Q is how well it approximates\nQ\u2217. We quantify the quality of the approximation by the Lp-norm (cid:107) \u02c6Q \u2212 Q\u2217(cid:107)p,\u03c1\u2217 (p \u2208 [1,\u221e]).\nThe performance loss (or regret) of a policy \u03c0 is the expected difference between the value of the\noptimal policy \u03c0\u2217 to the value of \u03c0 when the initial state distribution is selected according to \u03c1, i.e.,\n\n1), . . . , (Xn, An, Rn, X(cid:48)\n\nLoss(\u03c0; \u03c1) (cid:44)\n\n(V \u2217(x) \u2212 V \u03c0(x)) d\u03c1(x).\n\n(1)\n\n(cid:90)\n\nX\n\n(cid:90)\n\nX\n\nThe value of Loss(\u02c6\u03c0; \u03c1), in which \u02c6\u03c0 is the greedy policy w.r.t. \u02c6Q, is the main quantity of interest\nand indicates how much worse the agent following policy \u02c6\u03c0 would perform, in average, compared\nto the optimal one. The choice of \u03c1 enables the user to specify the relative importance of regions in\nthe state space.\nWe de\ufb01ne the action(-value)-gap function gQ\u2217 : X \u2192 R as\n\ngQ\u2217 (x) (cid:44) |Q\u2217(x, 1) \u2212 Q\u2217(x, 2)| .\n\nThis gap is shown in Figure 2. The following assumption quanti\ufb01es the action-gap regularity.\nAssumption A1 (Action-Gap). For a \ufb01xed MDP (X ,A, P,R, \u03b3) with |A| = 2, there exist con-\nstants cg > 0 and \u03b6 \u2265 0 such that for all t > 0, we have\n\nP\u03c1\u2217 (0 < gQ\u2217 (X) \u2264 t) (cid:44)\n\nI{0 < gQ\u2217 (x) \u2264 t} d\u03c1\u2217(x) \u2264 cg t\u03b6.\n\nThe value of \u03b6 controls the distribution of the action-gap gQ\u2217 (X). A large value of \u03b6 indicates that\nthe probability that Q(X, 1) being very close to Q(X, 2) is small and vice versa. The smallness of\nthis probability implies that the estimated action-value function \u02c6Q might be rather inaccurate in a\n\n4\n\n\fFigure 3: The probability distribution P\u03c1\u2217 (0 < gQ\u2217 (X) \u2264 t) for a 1D stochastic chain walk with\n500 states and \u03b3 = 0.95. Here the probability of the action-gap being close to zero is small.\n\nlarge subset of the state space (measured according to \u03c1\u2217) but its corresponding greedy policy would\nstill be the same as the optimal one. The case of \u03b6 = 0 and cg = 1 is equivalent to not having\nany assumption on the action-gap. This assumption is inspired by the low-noise condition in the\nclassi\ufb01cation literature [5]. As an example of a typical behavior of an action-gap function, Figure 3\ndepicts P\u03c1\u2217 (0 < gQ\u2217 (X) \u2264 t) for the same 1D stochastic chain walk problem as mentioned in the\nIntroduction. It is seen that the probability that the action-gap function gQ\u2217 being close to zero is\nvery small. Note that the speci\ufb01c polynomial form of the upper bound in Assumption A1 is only a\nmodeling assumption that captures the essence of the action-gap regularity without trying to be too\ngeneral to lead to unnecessarily complicated analyses.\nAs a result of the dynamical nature of MDP, the performance loss depends not only on the choice\nof \u03c1 and \u03c1\u2217, but also on the transition probability kernel P . To analyze this dependence, we de\ufb01ne\na concentrability coef\ufb01cient and use a change of measure argument similar to the work of Munos\n[23, 24], Antos et al. [10].\nDe\ufb01nition 1 (Concentrability of the Future-State Distribution). Given \u03c1, \u03c1\u2217 \u2208 M(X ), a policy \u03c0,\nand an integer m \u2265 0, let \u03c1(P \u03c0)m \u2208 M(X ) denote the future-state distribution obtained when\nthe \ufb01rst state is distributed according to \u03c1 and we then follow the policy \u03c0 for m steps. Denote the\nsupremum of the Radon-Nikodym derivative of \u03c1(P \u03c0)m w.r.t. \u03c1\u2217 by c(m; \u03c0), i.e.,\n\nIf \u03c1(P \u03c0)m is not absolutely continuous w.r.t. \u03c1\u2217, we set c(m; \u03c0) = \u221e. The concentrability of the\nfuture-state distribution coef\ufb01cient is de\ufb01ned as\nC(\u03c1, \u03c1\u2217) (cid:44) sup\n\n\u03b3mc(m; \u03c0).\n\nc(m; \u03c0) (cid:44)\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n.\n\n(cid:13)(cid:13)(cid:13)(cid:13) d(\u03c1(P \u03c0)m)\n(cid:88)\n\nd\u03c1\u2217\n\n\u03c0\n\nm\u22650\n\nThe following theorem upper bounds the performance loss Loss(\u02c6\u03c0; \u03c1) as a function of (cid:107)Q\u2217\u2212 \u02c6Q(cid:107)p,\u03c1\u2217,\nthe action-gap distribution, and the concentrability coef\ufb01cient.\nTheorem 1. Consider an MDP (X ,A, P,R, \u03b3) with |A| = 2 and an estimate \u02c6Q of the optimal\naction-value function. Let Assumption A1 hold and C(\u03c1, \u03c1\u2217) < \u221e. Denote \u02c6\u03c0 as the greedy policy\nw.r.t. \u02c6Q. We then have\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f321+\u03b6 cg C(\u03c1, \u03c1\u2217)\n\n21+ p(1+\u03b6)\np+\u03b6 c\n\np\u22121\ng C(\u03c1, \u03c1\u2217)\np+\u03b6\n\n(cid:13)(cid:13)(cid:13) \u02c6Q \u2212 Q\u2217(cid:13)(cid:13)(cid:13)1+\u03b6\n\n\u221e ,\n\n(cid:13)(cid:13)(cid:13) \u02c6Q \u2212 Q\u2217(cid:13)(cid:13)(cid:13) p(1+\u03b6)\n\np+\u03b6\n\np,\u03c1\u2217\n\nLoss(\u02c6\u03c0; \u03c1) \u2264\n\n.\n\n(1 \u2264 p < \u221e)\n\n5\n\n00.250.50.7511.251.51.75200.10.20.30.40.50.60.70.80.91tP (0 < gQ(X) ! t)\f(cid:17)\n(cid:21)\n\n(cid:21)\n\n(cid:90)\n(cid:90)\n\nX\n\nX\n\nWe have\n\nF2(x) =\n\nr(x, \u02c6\u03c0(x)) + \u03b3\n\nP (dy|x, \u02c6\u03c0(x))Q\u03c0\u2217\n\n(y, \u03c0\u2217(y))\n\n\u2212\n\nP (dy|x, \u02c6\u03c0(x))Q\u02c6\u03c0(y, \u02c6\u03c0(y))\n\nr(x, \u02c6\u03c0(x)) + \u03b3\n= \u03b3P \u02c6\u03c0(\u00b7|x)F (\u00b7).\n\nTherefore, F = (I \u2212 \u03b3P \u02c6\u03c0)\u22121F1 =(cid:80)\n(cid:90)\n\n\u03c1F =\n\nm\u22650\n\n(cid:88)\nd(cid:0)\u03c1(P \u02c6\u03c0)m(cid:1)\n\nm\u22650\n\nd\u03c1\u2217\n\n\u03b3m\n\nX\n\nm\u22650(\u03b3P \u02c6\u03c0)mF1. Thus,\n\n\u03c1(\u03b3P \u02c6\u03c0)mF1 =\n\n\u03b3m\n\n(cid:90)\n\nX\n\n(cid:0)\u03c1(P \u02c6\u03c0)m(cid:1) (dy)F1(y)\n\n(y)d\u03c1\u2217(y)F1(y)\n\n(cid:20)\n(cid:20)\n\n(cid:88)\n(cid:88)\n\u2264 (cid:88)\n\n=\n\nm\u22650\n\nm\u22650\n\nProof. Let function F : X \u2192 R be de\ufb01ned as F (x) = V \u2217(x) \u2212 V \u02c6\u03c0(x) = Q\u03c0\u2217\nQ\u02c6\u03c0(x, \u02c6\u03c0(x)) for any x \u2208 X . Note that Loss(\u02c6\u03c0; \u03c1) = \u03c1F . Decompose F (x) as\n(x, \u02c6\u03c0(x)) \u2212 Q\u02c6\u03c0(x, \u02c6\u03c0(x))\nF (x) =\n\n(x, \u03c0\u2217(x)) \u2212 Q\u03c0\u2217\n\n(x, \u02c6\u03c0(x))\n\nQ\u03c0\u2217\n\nQ\u03c0\u2217\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n+\n\n(x, \u03c0\u2217(x)) \u2212\n\n= F1(x) + F2(x).\n\n\u03b3mc(m; \u02c6\u03c0)\u03c1\u2217F1 \u2264 C(\u03c1, \u03c1\u2217) \u03c1\u2217F1.\n\n(2)\n\n(x, 2)| \u2264 2\u03b5. To show it, suppose that instead gQ\u2217 (x) = |Q\u03c0\u2217\n\n(x, a) \u2212 \u02c6Q(x, a)| \u2264 \u03b5 (for both a = 1, 2), then it holds that gQ\u2217 (x) = |Q\u03c0\u2217\n\nin which we used the Radon-Nikodym theorem and the de\ufb01nition of concentrability coef\ufb01cient. Let\nus turn to F1 and provide an upper bound for it. We use techniques similar to [5].\nL\u221e result: Note that for any given x \u2208 X , if for some value of \u03b5 > 0 we have \u02c6\u03c0(x) (cid:54)= \u03c0\u2217(x)\nand |Q\u03c0\u2217\n(x, 1) \u2212\n(x, 2)| > 2\u03b5.\nQ\u03c0\u2217\nThen because of the assumption |Q\u03c0\u2217\n(x, a) \u2212 \u02c6Q(x, a)| \u2264 \u03b5 (a = 1, 2), the ordering of \u02c6Q(x, 1) and\n\u02c6Q(x, 2) is the same as the ordering of Q\u2217(x, 1) and Q\u2217(x, 2), which contradicts the assumption that\n\u02c6\u03c0(x) (cid:54)= \u03c0\u2217(x) (see Figure 2).\nDenote \u03b50 = (cid:107)Q\u03c0\u2217 \u2212 \u02c6Q(cid:107)\u221e. Whenever \u02c6\u03c0(x) = \u03c0\u2217(x), the value of F1(x) is zero, so we get\n[I{\u02c6\u03c0(x) = \u03c0\u2217(x)} + I{\u02c6\u03c0(x) (cid:54)= \u03c0\u2217(x)}]\n\n(x, 1) \u2212 Q\u03c0\u2217\n\n(x, \u03c0\u2217(x)) \u2212 Q\u03c0\u2217\n(x, \u03c0\u2217(x)) \u2212 Q\u03c0\u2217\n\nQ\u03c0\u2217\nQ\u03c0\u2217\n\u00d7 [I{gQ\u2217 (x) = 0} + I{0 < gQ\u2217 (x) \u2264 2\u03b50} + I{gQ\u2217 (x) > 2\u03b50}]\n\n(cid:105) I{\u02c6\u03c0(x) (cid:54)= \u03c0\u2217(x)}\n\n(x, \u02c6\u03c0(x))\n(x, 1 \u2212 \u03c0\u2217(x))\n\nF1(x) =\n\n(cid:104)\n(cid:104)\n\n(cid:105)\n\n=\n\n\u2264 0 + 2\u03b50 I{0 < gQ\u2217 (x) \u2264 2\u03b50} + 0.\n\nHere we used the de\ufb01nition of gQ\u2217 (x) and the fact that gQ\u2217 (x) is no larger than 2\u03b50. This result\ntogether with Assumption A1 show that \u03c1\u2217F1 \u2264 2\u03b50 P\u03c1\u2217 (0 < gQ\u2217 (X) \u2264 2\u03b50) \u2264 2\u03b50 cg (2\u03b50)\u03b6.\nPlugging this result in (2) \ufb01nishes the proof of the \ufb01rst part.\nLp result: For any given x \u2208 X , let D(x) = |Q\u03c0\u2217\nWhenever \u02c6\u03c0(x) (cid:54)= \u03c0\u2217(x), we have gQ\u2217 (x) \u2264 D(x). Similar to the previous case, we have\n\n(x, 1) \u2212 \u02c6Q(x, 1)| + |Q\u03c0\u2217\n\n(x, 2) \u2212 \u02c6Q(x, 2)|.\n\n(cid:105) I{\u02c6\u03c0(x) (cid:54)= \u03c0\u2217(x)}\n\nF1(x) =\n\n(x, \u03c0\u2217(x)) \u2212 Q\u03c0\u2217\n\nQ\u03c0\u2217\n\u00d7 [I{gQ\u2217 (x) = 0} + I{0 < gQ\u2217 (x) \u2264 t} + I{gQ\u2217 (x) > t}]\n\n(x, 1 \u2212 \u03c0\u2217(x))\n\n\u2264 D(x) [I{0 < gQ\u2217 (x) \u2264 t} + I{gQ\u2217 (x) > t}]\n\n(cid:104)\n\nTake expectation w.r.t. \u03c1\u2217 and use H\u00a8older\u2019s inequality to get\n\n\u03c1\u2217F1 \u2264 (cid:107)D(cid:107)p,\u03c1\u2217 [P\u03c1\u2217 (0 < gQ\u2217 (X) \u2264 t)]\n\np\u22121\n\np + (cid:107)D(cid:107)p,\u03c1\u2217 [P\u03c1\u2217 (gQ\u2217 (X) > t)]\n\np\u22121\n\np\n\n\u2264 (cid:107)D(cid:107)p,\u03c1\u2217(cid:0)cgt\u03b6(cid:1) p\u22121\n\u2264 (cid:107)D(cid:107)p,\u03c1\u2217(cid:0)cgt\u03b6(cid:1) p\u22121\n\np + (cid:107)D(cid:107)p,\u03c1\u2217 [P\u03c1\u2217 (D(X) > t)]\n\np\u22121\n\np\n\n(cid:107)D(cid:107)p\np,\u03c1\u2217\ntp\u22121\n\np +\n\n.\n\n6\n\n\fp+\u03b6\n\np+\u03b6\n\n\u22121\np+\u03b6\ng\n\np\u22121\np+\u03b6\ng\n\n(cid:107)D(cid:107) p(1+\u03b6)\np,\u03c1\u2217\n\n, which in turn alongside inequality (2) and (cid:107)D(cid:107)p\n\n(cid:107)D(cid:107) p\np,\u03c1\u2217 \u2264 2p(cid:107)Q\u03c0\u2217 \u2212 \u02c6Q(cid:107)p\n\nwhere we used Assumption A1 and the de\ufb01nition of D(\u00b7) in the second inequality, and Markov\u2019s\ninequality in the last one. Minimize the upper bound in t to get t = c\np,\u03c1\u2217. This leads to\n\u03c1\u2217F1 \u2264 2c\nproves the second part of this result.\nThis theorem indicates that if (cid:107) \u02c6Q \u2212 Q\u2217(cid:107)p (1 < p \u2264 \u221e) has an error upper bound of O(n\u2212\u03b2) (with\n\u03b2 typically in the range of (0, 1/2] depending on the properties of the MDP and the estimator), we\nobtain faster convergence upper bounds on the performance loss Loss(\u02c6\u03c0; \u03c1) whenever the problem\nhas an action-gap regularity (\u03b6 > 0).\nOne might compare Theorem 1 with classical upper bounds such as (cid:107)V \u02c6\u03c0 \u2212 V \u03c0\u2217(cid:107)\u221e \u2264 2\u03b3\n1\u2212\u03b3(cid:107) \u02c6V \u2212\nV \u2217(cid:107)\u221e (Proposition 6.1 of Bertsekas and Tsitsiklis [2]). In order to make these two bounds compa-\nrable, we slightly modify the proof of our theorem to get the L\u221e-norm in the left hand side. The\nresult would be (cid:107)V \u2217 \u2212 V \u02c6\u03c0(cid:107)\u221e \u2264 21+\u03b6 cg\n1\u2212\u03b3 (cid:107) \u02c6Q \u2212 Q\u2217(cid:107)1+\u03b6\u221e . If there is no action-gap assumption (\u03b6 = 0\nand cg = 1), the results are similar (except for a factor of \u03b3 and that we measure the error by the\ndifference in the action-value function as opposed to the value function), but when \u03b6 > 0 the error\nbound signi\ufb01cantly improves.\n\np,\u03c1\u2217\n\n4 Application of the Action-Gap Theorem in Approximate Value Iteration\n\nThe goal of this section is to show how the analysis based on the action-gap phenomenon might lead\nto a tighter upper bound on the performance loss for the family of the AVI algorithms. There are\nvarious AVI algorithms (Riedmiller [19], Ernst et al. [20], Munos and Szepesv\u00b4ari [11], Farahmand\net al. [12]), that work by generating a sequence of action-value function estimates ( \u02c6Qk)K\nk=0, in\nwhich \u02c6Qk+1 is the result of approximately applying the Bellman optimality operator to the previous\nestimate \u02c6Qk, i.e., \u02c6Qk+1 \u2248 T \u2217 \u02c6Qk. Let us denote the error caused at each iteration by\n\n\u03b5k (cid:44) T \u2217 \u02c6Qk \u2212 \u02c6Qk+1.\n\n(3)\nThe following theorem, which is based on Theorem 3 of Farahmand et al. [25], relates the per-\nformance loss (cid:107)Q\u02c6\u03c0(\u00b7; \u02c6QK ) \u2212 Q\u2217(cid:107)1,\u03c1 of the obtained greedy policy \u02c6\u03c0(\u00b7; \u02c6QK) to the error sequence\n(\u03b5k)K\u22121\nk=0 and the action-gap assumption on the MDP. Before stating the theorem, we de\ufb01ne the\nfollowing sequence:\n\nThis sequence has \u03b1k \u221d \u03b3K\u2212k behavior and satis\ufb01es(cid:80)K\n\n\u03b1k =\n\n0 \u2264 k < K,\nk = K.\n\nk=0 \u03b1k = 1.\n\n(cid:40) (1\u2212\u03b3)\n1\u2212\u03b3K+1 \u03b3K\u2212k\u22121\n(1\u2212\u03b3)\n1\u2212\u03b3K+1 \u03b3K\n\nTheorem 2 (Error Propagation for AVI). Consider an MDP (X ,A, P,R, \u03b3) with |A| = 2 that\nsatis\ufb01es Assumption A1 and has C(\u03c1, \u03c1\u2217) < \u221e. Let p \u2265 1 be a real number and K be a positive\nk=0 \u2282 B(X \u00d7 A, Qmax) and the corresponding sequence\ninteger. Then for any sequence ( \u02c6Qk)K\n(cid:35) 1+\u03b6\n(\u03b5k)K\u22121\n\nk=0 de\ufb01ned in (3), we have\n\np+\u03b6\n\n(cid:19) p(1+\u03b6)\n\np+\u03b6\n\np\u22121\np+\u03b6\n\ng C(\u03c1, \u03c1\u2217)\nc\n\n\u03b1k (cid:107)\u03b5k(cid:107)p\n\np,\u03c1\u2217 + \u03b1K(2Qmax)p\n\n.\n\nLoss(\u02c6\u03c0(\u00b7, QK); \u03c1) \u2264 2\n\n(cid:18) 2\n\n1 \u2212 \u03b3\n\n(cid:34)K\u22121(cid:88)\n\nk=0\n\nProof. Similar to Lemma 4.1 of Munos [24], one may derive\n\nQ\u2217 \u2212 \u02c6Qk+1 = T \u03c0\u2217\n\nQ\u2217 \u2212 T \u03c0\u2217 \u02c6Qk + T \u03c0\u2217 \u02c6Qk \u2212 T \u2217 \u02c6Qk + \u03b5k \u2264 \u03b3P \u03c0\u2217\n\n(Q\u2217 \u2212 \u02c6Qk) + \u03b5k\n\nwhere we used the property of the Bellman optimality operator T \u2217 \u02c6Qk \u2265 T \u03c0\u2217 \u02c6Qk and the de\ufb01nition\nof \u03b5k (3). By induction, we get\n\nQ\u2217 \u2212 \u02c6QK \u2264 K\u22121(cid:88)\n\n\u03b3K\u2212k\u22121(P \u03c0\u2217\n\n)K\u2212k\u22121\u03b5k + \u03b3K(P \u03c0\u2217\n\n)K(Q\u2217 \u2212 \u02c6Q0).\n\nk=0\n\n7\n\n\fTherefore, for any p \u2265 1, the value of (cid:107)Q\u2217 \u2212 \u02c6QK(cid:107)p,\u03c1\u2217 = \u03c1\u2217|Q\u2217 \u2212 \u02c6QK|p is upper bounded by\n\n(cid:18) 1 \u2212 \u03b3K+1\n(cid:18) 1 \u2212 \u03b3K+1\n\n1 \u2212 \u03b3\n\n(cid:19)p(cid:34)K\u22121(cid:88)\n(cid:19)p(cid:34)K\u22121(cid:88)\n\nk=0\n\n\u03b1k\u03c1\u2217(P \u03c0\u2217\n\n\u03b1k (cid:107)\u03b5k(cid:107)p\n\n\u03c1\u2217|Q\u2217 \u2212 \u02c6QK|p \u2264\n\n\u2264\n\n(cid:35)p\n\n)K\u2212k\u22121|\u03b5k| + \u03b1K\u03c1\u2217(P \u03c0\u2217\n\n(cid:35)\n\n)K|Q\u2217 \u2212 \u02c6Q0|\n\n,\n\nk=0\n\np,\u03c1\u2217 + \u03b1K(2Qmax)p\n\n1 \u2212 \u03b3\n)m = \u03c1\u2217 (for any m \u2265 0) and Jensen\u2019s inequality. The application of\n\nwhere we used \u03c1\u2217(P \u03c0\u2217\nTheorem 1 and noting that (1 \u2212 \u03b3K+1)/(1 \u2212 \u03b3) \u2264 1/(1 \u2212 \u03b3) lead to the desired result.\n(cid:80)K\u22121\nComparing this theorem with Theorem 3 of Farahmand et al. [25] is instructive. Denoting E =\nk=0 \u03b1k(cid:107)\u03b5k(cid:107)2\n2,\u03c1\u2217, this paper\u2019s result indicates that the effect of the size of \u03b5k on Loss(\u02c6\u03c0(\u00b7, \u02c6QK); \u03c1)\ndepends on E 1+\u03b6\n2+\u03b6 , while [25], which does not consider the action-gap regularity, suggests that the\neffect depends on E 1/2. For \u03b6 > 0, this indicates a faster convergence rate for the performance loss\nwhile for \u03b6 = 0, they are the same.\n\n5 Conclusion\n\nThis work introduced the action-gap regularity in reinforcement learning and planning problems\nand analyzed the action-gap phenomenon for two-action discounted MDPs. We showed that when\nthe problem has a favorable action-gap regularity, quanti\ufb01ed by the parameter \u03b6, the performance\nloss is much smaller than the error of the estimated optimal action-value function. The action-gap\nregularity, among other regularities such as the smoothness of the action-value function [13], is a\nstep forward to better understanding of what properties of a sequential decision-making problem\nmakes learning and planning easy or dif\ufb01cult.\nThere are several issues that deserve to be studied in the future. Among them is the extension of\nthe current framework to multi-action discounted MDPs. Also it is important to study the relation\nbetween the parameter \u03b6 of the action-gap regularity assumption to the properties of the MDP such\nas the transition probability kernel and the reward distribution.\n\nAcknowledgments\n\nI thank the anonymous reviewers for their useful comments. This work was partly supported by\nAICML and NSERC.\n\nReferences\n[1] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine\n\nLearning Research, 4:1107\u20131149, 2003.\n\n[2] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming (Optimization and\n\nNeural Computation Series, 3). Athena Scienti\ufb01c, 1996.\n\n[3] Enno Mammen and Alexander B. Tsybakov. Smooth discrimination analysis. The Annals of\n\nStatistics, 27(6):1808\u20131829, 1999.\n\n[4] Alexander B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals\n\nof Statistics, 32 (1):135\u2013166, 2004.\n\n[5] Jean-Yves Audibert and Alexander B. Tsybakov. Fast learning rates for plug-in classi\ufb01ers. The\n\nAnnals of Statistics, 35(2):608\u2013633, 2007.\n\n[6] Alessandro Rinaldo and Larry Wasserman. Generalized density clustering. The Annals of\n\nStatistics, 38(5):2678\u20132722, 2010.\n\n[7] Michail G. Lagoudakis and Ronald Parr. Reinforcement learning as classi\ufb01cation: Leveraging\nmodern classi\ufb01ers. In ICML \u201903: Proceedings of the 20th international conference on Machine\nlearning, pages 424\u2013431, 2003.\n\n8\n\n\f[8] Alessandro Lazaric, Mohammad Ghavamzadeh, and R\u00b4emi Munos. Analysis of a classi\ufb01cation-\nbased policy iteration algorithm. In Proceedings of the 27th International Conference on Ma-\nchine Learning (ICML-10), pages 607\u2013614. Omnipress, 2010.\n\n[9] Omar Syed and Robert E. Schapire. A reduction from apprenticeship learning to classi\ufb01cation.\nIn J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances\nin Neural Information Processing Systems (NIPS - 23), pages 2253\u20132261, 2010.\n\n[10] Andr\u00b4as Antos, Csaba Szepesv\u00b4ari, and R\u00b4emi Munos. Learning near-optimal policies with\nBellman-residual minimization based \ufb01tted policy iteration and a single sample path. Machine\nLearning, 71:89\u2013129, 2008.\n\n[11] R\u00b4emi Munos and Csaba Szepesv\u00b4ari. Finite-time bounds for \ufb01tted value iteration. Journal of\n\nMachine Learning Research, 9:815\u2013857, 2008.\n\n[12] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv\u00b4ari, and Shie Mannor.\nRegularized \ufb01tted Q-iteration for planning in continuous-space Markovian Decision Problems.\nIn Proceedings of American Control Conference (ACC), pages 725\u2013730, June 2009.\n\n[13] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv\u00b4ari, and Shie Mannor.\nRegularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,\nAdvances in Neural Information Processing Systems (NIPS - 21), pages 441\u2013448. MIT Press,\n2009.\n\n[14] Odalric Maillard, R\u00b4emi Munos, Alessandro Lazaric, and Mohammad Ghavamzadeh. Finite-\nsample analysis of Bellman residual minimization. In Proceedings of the Second Asian Con-\nference on Machine Learning (ACML), 2010.\n\n[15] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction (Adaptive\n\nComputation and Machine Learning). The MIT Press, 1998.\n\n[16] Csaba Szepesv\u00b4ari. Algorithms for Reinforcement Learning. Morgan Claypool Publishers,\n\n2010.\n\n[17] Hamid Reza Maei, Csaba Szepesv\u00b4ari, Shalabh Bhatnagar, and Richard S. Sutton. Toward\noff-policy learning control with function approximation. In Johannes F\u00a8urnkranz and Thorsten\nJoachims, editors, Proceedings of the 27th International Conference on Machine Learning\n(ICML-10), pages 719\u2013726, Haifa, Israel, June 2010. Omnipress.\n\n[18] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares tempo-\nral difference learning. In ICML \u201909: Proceedings of the 26th Annual International Conference\non Machine Learning, pages 521\u2013528. ACM, 2009.\n\n[19] Martin Riedmiller. Neural \ufb01tted Q iteration \u2013 \ufb01rst experiences with a data ef\ufb01cient neural\nreinforcement learning method. In 16th European Conference on Machine Learning, pages\n317\u2013328, 2005.\n\n[20] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement\n\nlearning. Journal of Machine Learning Research, 6:503\u2013556, 2005.\n\n[21] Daniela Pucci de Farias and Benjamin Van Roy. The linear programming approach to approx-\n\nimate dynamic programming. Operations Research, 51(6):850\u2013865, 2003.\n\n[22] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs.\nIn Proceedings of the 26th Annual International Conference on Machine Learning, ICML \u201909,\npages 809\u2013816, New York, NY, USA, 2009. ACM.\n\n[23] R\u00b4emi Munos. Error bounds for approximate policy iteration. In ICML 2003: Proceedings of\n\nthe 20th Annual International Conference on Machine Learning, pages 560\u2013567, 2003.\n\n[24] R\u00b4emi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM Journal\n\non Control and Optimization, pages 541\u2013561, 2007.\n\n[25] Amir-massoud Farahmand, R\u00b4emi Munos, and Csaba Szepesv\u00b4ari. Error propagation for ap-\nproximate policy and value iteration. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S.\nZemel, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS -\n23), pages 568\u2013576. 2010.\n\n9\n\n\f", "award": [], "sourceid": 138, "authors": [{"given_name": "Amir-massoud", "family_name": "Farahmand", "institution": null}]}