{"title": "On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1826, "page_last": 1834, "abstract": "We consider infinite-horizon stationary $\\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $\\epsilon$ at each iteration, it is well-known that one can compute stationary policies that are $\\frac{2\\gamma{(1-\\gamma)^2}\\epsilon$-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to $\\frac{2\\gamma}{1-\\gamma}\\epsilon$-optimal, which constitutes a significant improvement in the usual situation when $\\gamma$ is close to $1$. Surprisingly, this shows that the problem of ``computing near-optimal non-stationary policies'' is much simpler than that of ``computing near-optimal stationary policies''.", "full_text": "On the Use of Non-Stationary Policies for Stationary\n\nIn\ufb01nite-Horizon Markov Decision Processes\n\nBruno Scherrer\n\nInria, Villers-l`es-Nancy, F-54600, France\n\nbruno.scherrer@inria.fr\n\nBoris Lesner\n\nInria, Villers-l`es-Nancy, F-54600, France\n\nboris.lesner@inria.fr\n\nAbstract\n\nWe consider in\ufb01nite-horizon stationary \u03b3-discounted Markov Decision Processes,\nfor which it is known that there exists a stationary optimal policy. Using Value\nand Policy Iteration with some error \u0001 at each iteration, it is well-known that one\ncan compute stationary policies that are\n(1\u2212\u03b3)2 \u0001-optimal. After arguing that this\nguarantee is tight, we develop variations of Value and Policy Iteration for com-\nputing non-stationary policies that can be up to 2\u03b3\n1\u2212\u03b3 \u0001-optimal, which constitutes a\nsigni\ufb01cant improvement in the usual situation when \u03b3 is close to 1. Surprisingly,\nthis shows that the problem of \u201ccomputing near-optimal non-stationary policies\u201d\nis much simpler than that of \u201ccomputing near-optimal stationary policies\u201d.\n\n2\u03b3\n\n1\n\nIntroduction\n\nGiven an in\ufb01nite-horizon stationary \u03b3-discounted Markov Decision Process [24, 4], we consider\napproximate versions of the standard Dynamic Programming algorithms, Policy and Value Iteration,\nthat build sequences of value functions vk and policies \u03c0k as follows\n\nApproximate Value Iteration (AVI):\n\nApproximate Policy Iteration (API):\n\n(cid:26)\n\nvk+1 \u2190 T vk + \u0001k+1\nvk \u2190 v\u03c0k + \u0001k\n\u03c0k+1 \u2190 any element of G(vk)\n\n(1)\n\n(2)\n\nwhere v0 and \u03c00 are arbitrary, T is the Bellman optimality operator, v\u03c0k is the value of policy \u03c0k\nand G(vk) is the set of policies that are greedy with respect to vk. At each iteration k, the term \u0001k\naccounts for a possible approximation of the Bellman operator (for AVI) or the evaluation of v\u03c0k\n(for API). Throughout the paper, we will assume that error terms \u0001k satisfy for all k, (cid:107)\u0001k(cid:107)\u221e \u2264 \u0001\nfor some \u0001 \u2265 0. Under this assumption, it is well-known that both algorithms share the following\nperformance bound (see [25, 11, 4] for AVI and [4] for API):\nTheorem 1. For API (resp. AVI), the loss due to running policy \u03c0k (resp. any policy \u03c0k in G(vk\u22121))\ninstead of the optimal policy \u03c0\u2217 satis\ufb01es\n\nlim sup\nk\u2192\u221e\n\n(cid:107)v\u2217 \u2212 v\u03c0k(cid:107)\u221e \u2264\n\n2\u03b3\n\n(1 \u2212 \u03b3)2 \u0001.\n\n2\u03b3\n\n2\u03b3\n\nThe constant\n(1\u2212\u03b3)2 can be very big, in particular when \u03b3 is close to 1, and consequently the above\nbound is commonly believed to be conservative for practical applications. Interestingly, this very\nconstant\n(1\u2212\u03b3)2 appears in many works analyzing AVI algorithms [25, 11, 27, 12, 13, 23, 7, 6, 20, 21,\n22, 9], API algorithms [15, 19, 16, 1, 8, 18, 5, 17, 10, 3, 9, 2] and in one of their generalization [26],\nsuggesting that it cannot be improved. Indeed, the bound (and the\n(1\u2212\u03b3)2 constant) are tight for\nAPI [4, Example 6.4], and we will show in Section 3 \u2013 to our knowledge, this has never been argued\nin the literature \u2013 that it is also tight for AVI.\n\n2\u03b3\n\n1\n\n\fEven though the theory of optimal control states that there exists a stationary policy that is optimal,\nthe main contribution of our paper is to show that looking for a non-stationary policy (instead of a\nstationary one) may lead to a much better performance bound. In Section 4, we will show how to\ndeduce such a non-stationary policy from a run of AVI. In Section 5, we will describe two original\npolicy iteration variations that compute non-stationary policies. For all these algorithms, we will\nprove that we have a performance bound that can be reduced down to 2\u03b3\n1\n1\u2212\u03b3\nbetter than the standard bound of Theorem 1, which is signi\ufb01cant when \u03b3 is close to 1. Surprisingly,\nthis will show that the problem of \u201ccomputing near-optimal non-stationary policies\u201d is much simpler\nthan that of \u201ccomputing near-optimal stationary policies\u201d. Before we present these contributions, the\nnext section begins by precisely describing our setting.\n\n1\u2212\u03b3 \u0001. This is a factor\n\n2 Background\nWe consider an in\ufb01nite-horizon discounted Markov Decision Process [24, 4] (S,A, P, r, \u03b3), where S\nis a possibly in\ufb01nite state space, A is a \ufb01nite action space, P (ds(cid:48)|s, a), for all (s, a), is a probability\nkernel on S, r : S \u00d7 A \u2192 R is a reward function bounded in max-norm by Rmax, and \u03b3 \u2208 (0, 1)\nis a discount factor. A stationary deterministic policy \u03c0 : S \u2192 A maps states to actions. We write\nr\u03c0(s) = r(s, \u03c0(s)) and P\u03c0(ds(cid:48)|s) = P (ds(cid:48)|s, \u03c0(s)) for the immediate reward and the stochastic\nkernel associated to policy \u03c0. The value v\u03c0 of a policy \u03c0 is a function mapping states to the expected\ndiscounted sum of rewards received when following \u03c0 from any state: for all s \u2208 S,\n\n(cid:34) \u221e(cid:88)\n\nt=0\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)s0 = s, st+1 \u223c P\u03c0(\u00b7|st)\n\n(cid:35)\n\n.\n\nv\u03c0(s) = E\n\n\u03b3tr\u03c0(st)\n\nThe value v\u03c0 is clearly bounded by Vmax = Rmax/(1 \u2212 \u03b3).\nIt is well-known that v\u03c0 can be\ncharacterized as the unique \ufb01xed point of the linear Bellman operator associated to a policy \u03c0:\nT\u03c0 : v (cid:55)\u2192 r\u03c0 + \u03b3P\u03c0v. Similarly, the Bellman optimality operator T : v (cid:55)\u2192 max\u03c0 T\u03c0v has as\nunique \ufb01xed point the optimal value v\u2217 = max\u03c0 v\u03c0. A policy \u03c0 is greedy w.r.t. a value function v\nif T\u03c0v = T v, the set of such greedy policies is written G(v). Finally, a policy \u03c0\u2217 is optimal, with\nvalue v\u03c0\u2217 = v\u2217, iff \u03c0\u2217 \u2208 G(v\u2217), or equivalently T\u03c0\u2217 v\u2217 = v\u2217.\nThough it is known [24, 4] that there always exists a deterministic stationary policy that is optimal,\nwe will, in this article, consider non-stationary policies and now introduce related notations. Given\na sequence \u03c01, \u03c02, . . . , \u03c0k of k stationary policies (this sequence will be clear in the context we\ndescribe later), and for any 1 \u2264 m \u2264 k, we will denote \u03c0k,m the periodic non-stationary policy\nthat takes the \ufb01rst action according to \u03c0k, the second according to \u03c0k\u22121, . . . , the mth according to\n\u03c0k\u2212m+1 and then starts again. Formally, this can be written as\n\n\u03c0k,m = \u03c0k \u03c0k\u22121 \u00b7\u00b7\u00b7 \u03c0k\u2212m+1 \u03c0k \u03c0k\u22121 \u00b7\u00b7\u00b7 \u03c0k\u2212m+1 \u00b7\u00b7\u00b7\n\nIt is straightforward to show that the value v\u03c0k,m of this periodic non-stationary policy \u03c0k,m is the\nunique \ufb01xed point of the following operator:\n\nTk,m = T\u03c0k T\u03c0k\u22121 \u00b7\u00b7\u00b7 T\u03c0k\u2212m+1.\n\nFinally, it will be convenient to introduce the following discounted kernel:\n\n\u0393k,m = (\u03b3P\u03c0k )(\u03b3P\u03c0k\u22121)\u00b7\u00b7\u00b7 (\u03b3P\u03c0k\u2212m+1).\n\nIn particular, for any pair of values v and v(cid:48), it can easily be seen that Tk,mv\u2212Tk,mv(cid:48) = \u0393k,m(v\u2212v(cid:48)).\n\n3 Tightness of the performance bound of Theorem 1\n\nThe bound of Theorem 1 is tight for API in the sense that there exists an MDP [4, Example 6.4]\nfor which the bound is reached. To the best of our knowledge, a similar argument has never been\nprovided for AVI in the literature. It turns out that the MDP that is used for showing the tightness\nfor API also applies to AVI. This is what we show in this section.\nExample 1. Consider the \u03b3-discounted deterministic MDP from [4, Example 6.4] depicted on Fig-\nure 1. It involves states 1, 2, . . . . In state 1 there is only one self-loop action with zero reward, for\neach state i > 1 there are two possible choices: either move to state i \u2212 1 with zero reward or stay\n\n2\n\n\f0\n\n1\n\n\u22122\u03b3\u0001\n\n\u22122(\u03b3 + \u03b32)\u0001\n\n\u22122 \u03b3\u2212\u03b3k\n1\u2212\u03b3 \u0001\n\n0\n\n2\n\n0\n\n3\n\n0\n\n. . .\n\nk\n\n0\n\n0\n\n. . .\n\nFigure 1: The determinisitic MDP for which the bound of Theorem 1 is tight for Value and Policy\nIteration.\n\nwith reward ri = \u22122 \u03b3\u2212\u03b3i\n1\u2212\u03b3 \u0001 with \u0001 \u2265 0. Clearly the optimal policy in all states i > 1 is to move to\ni \u2212 1 and the optimal value function v\u2217 is 0 in all states.\nStarting with v0 = v\u2217, we are going to show that for all iterations k \u2265 1 it is possible to have a\npolicy \u03c0k+1 \u2208 G(vk) which moves in every state but k + 1 and thus is such that v\u03c0k+1(k + 1) =\n1\u2212\u03b3 = \u22122 \u03b3\u2212\u03b3k+1\nTo do so, we assume that the following approximation errors are made at each iteration k > 0:\n\n(1\u2212\u03b3)2 \u0001, which meets the bound of Theorem 1 when k tends to in\ufb01nity.\n\nrk+1\n\nWith this error, we are now going to prove by induction on k that for all k \u2265 1,\n\n(cid:40) \u2212\u0001\n\n\u0001\n0\n\nif i = k\nif i = k + 1\notherwise\n\n.\n\n\u0001k(i) =\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nvk(i) =\n\n\u2212\u03b3k\u22121\u0001\nrk/2 \u2212 \u0001\n\u2212(rk/2 \u2212 \u0001)\n0\n\nif i < k\nif i = k\nif i = k + 1\notherwise\n\n.\n\nSince v0 = 0 the best action is clearly to move in every state i \u2265 2 which gives v1 = v0 + \u00011 = \u00011\nwhich establishes the claim for k = 1.\nAssuming that our induction claim holds for k, we now show that it also holds for k + 1.\nFor the move action, write qm\n1), hence\n\nk its action-value function. For all i > 1 we have qm\n\nk (i) = 0 + \u03b3vk(i\u2212\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u03b3(\u2212\u03b3k\u22121\u0001)\n\n= \u2212\u03b3k\u0001\n\u03b3(rk/2 \u2212 \u0001)\n= rk+1/2\n\u2212\u03b3(rk/2 \u2212 \u0001) = \u2212rk+1/2\n0\n\nqm\nk (i) =\n\nif i = 2, . . . , k\nif i = k + 1\nif i = k + 2\notherwise\n\n.\n\nFor the stay action, write qs\nhence\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\nqs\nk(i) =\n\nk its action-value function. For all i > 0 we have qs\nri + \u03b3(\u2212\u03b3k\u22121\u0001) = ri \u2212 \u03b3k\u0001\nrk + \u03b3(rk/2 \u2212 \u0001) = rk + rk+1/2\nrk+1 \u2212 rk+1/2\nrk+2 + \u03b30\n0\n\nif i = 1, . . . , k \u2212 1\nif i = k\nif i = k + 1\nif i = k + 2\notherwise\n\n= rk+1/2\n= rk+2\n\n.\n\nk(i) = ri + \u03b3vk(i),\n\nk , qs\n\nk (k + 1) = qs\n\nk) + \u0001k+1 gives the result for vk+1.\n\nk(i) for all these states but k + 1 where qm\n\nFirst, only the stay action is available in state 1, hence, since r0 = 0 and \u0001k+1(1) = 0, we have\nk(1) + \u0001k+1(1) = \u2212\u03b3k\u0001, as desired. Second, since ri < 0 for all i > 1 we have\nvk+1(1) = qs\nk(k + 1) = rk+1/2. Using the fact\nqm\nk (i) > qs\nthat vk+1 = max(qm\nThe fact that for i > 1 we have qm\nk(i) with equality only at i = k +1 implies that there exists\na policy \u03c0k+1 greedy for vk which takes the optimal move action in all states but k + 1 where the\nstay action has the same value, leaving the algorithm the possibility of choosing the suboptimal stay\naction in this state, yielding a value v\u03c0k+1(k + 1), matching the upper bound as k goes to in\ufb01nity.\nSince Example 1 shows that the bound of Theorem 1 is tight, improving performance bounds imply\nto modify the algorithms. The following sections of the paper shows that considering non-stationary\npolicies instead of stationary policies is an interesting path to follow.\n\nk (i) \u2265 qs\n\n3\n\n\f4 Deducing a non-stationary policy from AVI\n\nWhile AVI (Equation (1)) is usually considered as generating a sequence of values v0, v1, . . . , vk\u22121,\nit also implicitely produces a sequence1 of policies \u03c01, \u03c02, . . . , \u03c0k, where for i = 0, . . . , k \u2212 1,\n\u03c0i+1 \u2208 G(vi). Instead of outputing only the last policy \u03c0k, we here simply propose to output the\nperiodic non-stationary policy \u03c0k,m that loops over the last m generated policies. The following\ntheorem shows that it is indeed a good idea.\nTheorem 2. For all iteration k and m such that 1 \u2264 m \u2264 k, the loss of running the non-stationary\npolicy \u03c0k,m instead of the optimal policy \u03c0\u2217 satis\ufb01es:\n\n(cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e \u2264\n\n2\n\n1 \u2212 \u03b3m\n\n\u0001 + \u03b3k(cid:107)v\u2217 \u2212 v0(cid:107)\u221e\n\n.\n\n(cid:18) \u03b3 \u2212 \u03b3k\n\n1 \u2212 \u03b3\n\n(cid:19)\n\nWhen m = 1 and k tends to in\ufb01nity, one exactly recovers the result of Theorem 1. For general\nm, this new bound is a factor 1\u2212\u03b3m\n1\u2212\u03b3 better than the standard bound of Theorem 1. The choice that\noptimizes the bound, m = k, and which consists in looping over all the policies generated from the\nvery start, leads to the following bound:\n\n(cid:18) \u03b3\n\n(cid:19)\n\n1 \u2212 \u03b3\n\n\u2212 \u03b3k\n1 \u2212 \u03b3k\n\n\u0001 +\n\n2\u03b3k\n\n1 \u2212 \u03b3k (cid:107)v\u2217 \u2212 v0(cid:107)\u221e,\n\n(cid:107)v\u2217 \u2212 v\u03c0k,k(cid:107)\u221e \u2264 2\n1\u2212\u03b3 \u0001 when k tends to \u221e.\n\nthat tends to 2\u03b3\n\nThe rest of the section is devoted to the proof of Theorem 2. An important step of our proof lies\nin the following lemma, that implies that for suf\ufb01ciently big m, vk = T vk\u22121 + \u0001k is a rather good\napproximation (of the order\n1\u2212\u03b3 ) of the value v\u03c0k,m of the non-stationary policy \u03c0k,m (whereas in\ngeneral, it is a much poorer approximation of the value v\u03c0k of the last stationary policy \u03c0k).\nLemma 1. For all m and k such that 1 \u2264 m \u2264 k,\n\n\u0001\n\n(cid:107)T vk\u22121 \u2212 v\u03c0k,m(cid:107)\u221e \u2264 \u03b3m(cid:107)vk\u2212m \u2212 v\u03c0k,m(cid:107)\u221e +\n\n\u03b3 \u2212 \u03b3m\n1 \u2212 \u03b3\n\n\u0001.\n\nProof of Lemma 1. The value of \u03c0k,m satis\ufb01es:\n\nBy induction, it can be shown that the sequence of values generated by AVI satis\ufb01es:\n\nv\u03c0k,m = T\u03c0k T\u03c0k\u22121 \u00b7\u00b7\u00b7 T\u03c0k\u2212m+1v\u03c0k,m .\nm\u22121(cid:88)\n\nT\u03c0k vk\u22121 = T\u03c0k T\u03c0k\u22121 \u00b7\u00b7\u00b7 T\u03c0k\u2212m+1vk\u2212m +\n\n\u0393k,i\u0001k\u2212i.\n\n(3)\n\n(4)\n\nBy substracting Equations (4) and (3), one obtains:\n\ni=1\n\nT vk\u22121 \u2212 v\u03c0k,m = T\u03c0k vk\u22121 \u2212 v\u03c0k,m = \u0393k,m(vk\u2212m \u2212 v\u03c0k,m ) +\n\nm\u22121(cid:88)\n\ni=1\n\n\u0393k,i\u0001k\u2212i\n\nand the result follows by taking the norm and using the fact that for all i, (cid:107)\u0393k,i(cid:107)\u221e = \u03b3i.\n\nWe are now ready to prove the main result of this section.\n\nProof of Theorem 2. Using the fact that T is a contraction in max-norm, we have:\n\n(cid:107)v\u2217 \u2212 vk(cid:107)\u221e = (cid:107)v\u2217 \u2212 T vk\u22121 + \u0001k(cid:107)\u221e\n\u2264 (cid:107)T v\u2217 \u2212 T vk\u22121(cid:107)\u221e + \u0001\n\u2264 \u03b3(cid:107)v\u2217 \u2212 vk\u22121(cid:107)\u221e + \u0001.\n\n1A given sequence of value functions may induce many sequences of policies since more than one greedy\npolicy may exist for one particular value function. Our results holds for all such possible choices of greedy\npolicies.\n\n4\n\n\fThen, by induction on k, we have that for all k \u2265 1,\n\n(cid:107)v\u2217 \u2212 vk(cid:107)\u221e \u2264 \u03b3k(cid:107)v\u2217 \u2212 v0(cid:107)\u221e +\n\n1 \u2212 \u03b3k\n1 \u2212 \u03b3\n\n\u0001.\n\n(5)\n\nUsing Lemma 1 and Equation (5) twice, we can conclude by observing that\n(cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e \u2264 (cid:107)T v\u2217 \u2212 T vk\u22121(cid:107)\u221e + (cid:107)T vk\u22121 \u2212 v\u03c0k,m(cid:107)\u221e\n(cid:19)\n\n(cid:18)\n\u2264 \u03b3(cid:107)v\u2217 \u2212 vk\u22121(cid:107)\u221e + \u03b3m(cid:107)vk\u2212m \u2212 v\u03c0k,m(cid:107)\u221e +\n\u2264 \u03b3\n+ \u03b3m(cid:0)(cid:107)vk\u2212m \u2212 v\u2217(cid:107)\u221e + (cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e(cid:1) +\n\n\u03b3k\u22121(cid:107)v\u2217 \u2212 v0(cid:107)\u221e +\n\n1 \u2212 \u03b3k\u22121\n1 \u2212 \u03b3\n\n\u03b3 \u2212 \u03b3m\n1 \u2212 \u03b3\n\n\u0001\n\n\u0001\n\n\u03b3 \u2212 \u03b3m\n1 \u2212 \u03b3\n\n\u0001\n\n\u2264 \u03b3k(cid:107)v\u2217 \u2212 v0(cid:107)\u221e +\n\n\u03b3 \u2212 \u03b3k\n1 \u2212 \u03b3\n\n\u0001\n\n(cid:18)\n(cid:18) \u03b3 \u2212 \u03b3k\n\n1 \u2212 \u03b3\n\n1 \u2212 \u03b3k\u2212m\n\n+ \u03b3m\n\n\u03b3k\u2212m(cid:107)v\u2217 \u2212 v0(cid:107)\u221e +\n\n1 \u2212 \u03b3\n(cid:19)\n= \u03b3m(cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e + 2\u03b3k(cid:107)v\u2217 \u2212 v0(cid:107)\u221e +\n\u2264\n\n\u0001 + \u03b3k(cid:107)v\u2217 \u2212 v0(cid:107)\u221e\n\n2\n\n.\n\n1 \u2212 \u03b3m\n\n(cid:19)\n\n\u03b3 \u2212 \u03b3m\n1 \u2212 \u03b3\n\n\u0001\n\n+\n\n\u0001 + (cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e\n2(\u03b3 \u2212 \u03b3k)\n\n1 \u2212 \u03b3\n\n\u0001\n\n5 API algorithms for computing non-stationary policies\n\nWe now present similar results that have a Policy Iteration \ufb02avour. Unlike in the previous section\nwhere only the output of AVI needed to be changed, improving the bound for an API-like algorithm\nis slightly more involved. In this section, we describe and analyze two API algorithms that output\nnon-stationary policies with improved performance bounds.\n\nAPI with a non-stationary policy of growing period Following our \ufb01ndings on non-stationary\npolicies AVI, we consider the following variation of API, where at each iteration, instead of comput-\ning the value of the last stationary policy \u03c0k, we compute that of the periodic non-stationary policy\n\u03c0k,k that loops over all the policies \u03c01, . . . , \u03c0k generated from the very start:\n\nvk \u2190 v\u03c0k,k + \u0001k\n\n\u03c0k+1 \u2190 any element of G(vk)\n\nwhere the initial (stationary) policy \u03c01,1 is chosen arbitrarily. Thus, iteration after iteration, the non-\nstationary policy \u03c0k,k is made of more and more stationary policies, and this is why we refer to it as\nhaving a growing period. We can prove the following performance bound for this algorithm:\nTheorem 3. After k iterations, the loss of running the non-stationary policy \u03c0k,k instead of the\noptimal policy \u03c0\u2217 satis\ufb01es:\n\n(cid:107)v\u2217 \u2212 v\u03c0k,k(cid:107)\u221e \u2264 2(\u03b3 \u2212 \u03b3k)\n1 \u2212 \u03b3\n\n\u0001 + \u03b3k\u22121(cid:107)v\u2217 \u2212 v\u03c01,1(cid:107)\u221e + 2(k \u2212 1)\u03b3kVmax.\n\nWhen k tends to in\ufb01nity, this bound tends to 2\u03b3\noriginal API bound.\n\n1\u2212\u03b3 \u0001, and is thus again a factor\n\n1\n\n1\u2212\u03b3 better than the\n\n5\n\n\fProof of Theorem 3. Using the facts that Tk+1,k+1v\u03c0k,k = T\u03c0k+1Tk,kv\u03c0k,k = T\u03c0k+1 v\u03c0k,k and\nT\u03c0k+1vk \u2265 T\u03c0\u2217 vk (since \u03c0k+1 \u2208 G(vk)), we have:\n\nv\u2217 \u2212 v\u03c0k+1,k+1\n= T\u03c0\u2217 v\u2217 \u2212 Tk+1,k+1v\u03c0k+1,k+1\n= T\u03c0\u2217 v\u2217 \u2212 T\u03c0\u2217 v\u03c0k,k + T\u03c0\u2217 v\u03c0k,k \u2212 Tk+1,k+1v\u03c0k,k + Tk+1,k+1v\u03c0k,k \u2212 Tk+1,k+1v\u03c0k+1,k+1\n= \u03b3P\u03c0\u2217 (v\u2217 \u2212 v\u03c0k,k ) + T\u03c0\u2217 v\u03c0k,k \u2212 T\u03c0k+1v\u03c0k,k + \u0393k+1,k+1(v\u03c0k,k \u2212 v\u03c0k+1,k+1)\n= \u03b3P\u03c0\u2217 (v\u2217 \u2212 v\u03c0k,k ) + T\u03c0\u2217 vk \u2212 T\u03c0k+1vk + \u03b3(P\u03c0k+1 \u2212 P\u03c0\u2217 )\u0001k + \u0393k+1,k+1(v\u03c0k,k \u2212 v\u03c0k+1,k+1)\n\u2264 \u03b3P\u03c0\u2217 (v\u2217 \u2212 v\u03c0k,k ) + \u03b3(P\u03c0k+1 \u2212 P\u03c0\u2217 )\u0001k + \u0393k+1,k+1(v\u03c0k,k \u2212 v\u03c0k+1,k+1 ).\nBy taking the norm, and using the facts that (cid:107)v\u03c0k,k(cid:107)\u221e \u2264 Vmax, (cid:107)v\u03c0k+1,k+1(cid:107)\u221e \u2264 Vmax, and\n(cid:107)\u0393k+1,k+1(cid:107)\u221e = \u03b3k+1, we get:\n\n(cid:107)v\u2217 \u2212 v\u03c0k+1,k+1(cid:107)\u221e \u2264 \u03b3(cid:107)v\u2217 \u2212 v\u03c0k,k(cid:107)\u221e + 2\u03b3\u0001 + 2\u03b3k+1Vmax.\n\nFinally, by induction on k, we obtain:\n\n(cid:107)v\u2217 \u2212 v\u03c0k,k(cid:107)\u221e \u2264 2(\u03b3 \u2212 \u03b3k)\n1 \u2212 \u03b3\n\n\u0001 + \u03b3k\u22121(cid:107)v\u2217 \u2212 v\u03c01,1(cid:107)\u221e + 2(k \u2212 1)\u03b3kVmax.\n\nThough it has an improved asymptotic performance bound, the API algorithm we have just described\nhas two (related) drawbacks: 1) its \ufb01nite iteration bound has a somewhat unsatisfactory term of the\nform 2(k \u2212 1)\u03b3kVmax, and 2) even when there is no error (when \u0001 = 0), we cannot guarantee that,\nsimilarly to standard Policy Iteration, it generates a sequence of policies of increasing values (it\nis easy to see that in general, we do not have v\u03c0k+1,k+1 \u2265 v\u03c0k,k). These two points motivate the\nintroduction of another API algorithm.\n\nAPI with a non-stationary policy of \ufb01xed period We consider now another variation of API\nparameterized by m \u2265 1, that iterates as follows for k \u2265 m:\n\nvk \u2190 v\u03c0k,m + \u0001k\n\n\u03c0k+1 \u2190 any element of G(vk)\n\nwhere the initial non-stationary policy \u03c0m,m is built from a sequence of m arbitrary stationary\npolicies \u03c01, \u03c02,\u00b7\u00b7\u00b7 , \u03c0m. Unlike the previous API algorithm, the non-stationary policy \u03c0k,m here\nonly involves the last m greedy stationary policies instead of all of them, and is thus of \ufb01xed period.\nThis is a strict generalization of the standard API algorithm, with which it coincides when m = 1.\nFor this algorithm, we can prove the following performance bound:\nTheorem 4. For all m, for all k \u2265 m, the loss of running the non-stationary policy \u03c0k,m instead of\nthe optimal policy \u03c0\u2217 satis\ufb01es:\n\n(cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e \u2264 \u03b3k\u2212m(cid:107)v\u2217 \u2212 v\u03c0m,m(cid:107)\u221e +\n\n2(\u03b3 \u2212 \u03b3k+1\u2212m)\n(1 \u2212 \u03b3)(1 \u2212 \u03b3m)\n\n\u0001.\n\n1\u2212\u03b3 better than the standard bound of Theorem 1.\n\nWhen m = 1 and k tends to in\ufb01nity, we recover exactly the bound of Theorem 1. When m > 1\nand k tends to in\ufb01nity, this bound coincides with that of Theorem 2 for our non-stationary version\nof AVI: it is a factor 1\u2212\u03b3m\nThe rest of this section develops the proof of this performance bound. A central argument of our\nproof is the following lemma, which shows that similarly to the standard API, our new algorithm\nhas an (approximate) policy improvement property.\nLemma 2. At each iteration of the algorithm, the value v\u03c0k+1,m of the non-stationary policy\n\n\u03c0k+1,m = \u03c0k+1 \u03c0k . . . \u03c0k+2\u2212m \u03c0k+1 \u03c0k . . . \u03c0k\u2212m+2 . . .\n\ncannot be much worse than the value v\u03c0(cid:48)\n\nof the non-stationary policy\n\nk,m\n\n\u03c0(cid:48)\nk,m = \u03c0k\u2212m+1 \u03c0k . . . \u03c0k+2\u2212m \u03c0k\u2212m+1 \u03c0k . . . \u03c0k\u2212m+2 . . .\n\nin the precise following sense:\n\nv\u03c0k+1,m \u2265 v\u03c0(cid:48)\n\nk,m\n\n\u2212 2\u03b3\n\n1 \u2212 \u03b3m \u0001.\n\n6\n\n\fk,m is related to \u03c0k,m as follows: \u03c0(cid:48)\n\nThe policy \u03c0(cid:48)\nk,m differs from \u03c0k+1,m in that every m steps, it chooses the oldest policy \u03c0k\u2212m+1\ninstead of the newest one \u03c0k+1. Also \u03c0(cid:48)\nk,m takes the \ufb01rst action\naccording to \u03c0k\u2212m+1 and then runs \u03c0k,m; equivalently, since \u03c0k,m loops over \u03c0k\u03c0k\u22121 . . . \u03c0k\u2212m+1,\n\u03c0(cid:48)\nk,m = \u03c0k\u2212m+1\u03c0k,m can be seen as a 1-step right rotation of \u03c0k,m. When there is no error (when\n\u0001 = 0), this shows that the new policy \u03c0k+1,m is better than a \u201crotation\u201d of \u03c0k,m. When m =\n1, \u03c0k+1,m = \u03c0k+1 and \u03c0(cid:48)\nk,m = \u03c0k and we thus recover the well-known (approximate) policy\nimprovement theorem for standard API (see for instance [4, Lemma 6.1]).\nProof of Lemma 2. Since \u03c0(cid:48)\nhave v\u03c0(cid:48)\n\n= T\u03c0k\u2212m+1v\u03c0k,m. Now, since \u03c0k+1 \u2208 G(vk), we have T\u03c0k+1vk \u2265 T\u03c0k\u2212m+1vk and\n\u2212 v\u03c0k+1,m = T\u03c0k\u2212m+1 v\u03c0k,m \u2212 v\u03c0k+1,m\n\nk,m takes the \ufb01rst action with respect to \u03c0k\u2212m+1 and then runs \u03c0k,m, we\n\nk,m\n\nv\u03c0(cid:48)\n\nk,m\n\n= T\u03c0k\u2212m+1 vk \u2212 \u03b3P\u03c0k\u2212m+1\u0001k \u2212 v\u03c0k+1,m\n\u2264 T\u03c0k+1vk \u2212 \u03b3P\u03c0k\u2212m+1 \u0001k \u2212 v\u03c0k+1,m\n= T\u03c0k+1v\u03c0k,m + \u03b3(P\u03c0k+1 \u2212 P\u03c0k\u2212m+1)\u0001k \u2212 v\u03c0k+1,m\n= T\u03c0k+1Tk,mv\u03c0k,m \u2212 Tk+1,mv\u03c0k+1,m + \u03b3(P\u03c0k+1 \u2212 P\u03c0k\u2212m+1 )\u0001k\n= Tk+1,mT\u03c0k\u2212m+1v\u03c0k,m \u2212 Tk+1,mv\u03c0k+1,m + \u03b3(P\u03c0k+1 \u2212 P\u03c0k\u2212m+1)\u0001k\n= \u0393k+1,m(T\u03c0k\u2212m+1v\u03c0k,m \u2212 v\u03c0k+1,m ) + \u03b3(P\u03c0k+1 \u2212 P\u03c0k\u2212m+1)\u0001k\n= \u0393k+1,m(v\u03c0(cid:48)\n\n\u2212 v\u03c0k+1,m ) + \u03b3(P\u03c0k+1 \u2212 P\u03c0k\u2212m+1)\u0001k.\n\nk,m\n\nfrom which we deduce that:\n\n\u2212 v\u03c0k+1,m \u2264 (I \u2212 \u0393k+1,m)\u22121\u03b3(P\u03c0k+1 \u2212 P\u03c0k\u2212m+1)\u0001k\n\nv\u03c0(cid:48)\n\nk,m\n\nand the result follows by using the facts that (cid:107)\u0001k(cid:107)\u221e \u2264 \u0001 and (cid:107)(I \u2212 \u0393k+1,m)\u22121(cid:107)\u221e = 1\n\n1\u2212\u03b3m .\n\nWe are now ready to prove the main result of this section.\n\nProof of Theorem 4. Using the facts that 1) Tk+1,m+1v\u03c0k,m = T\u03c0k+1Tk,mv\u03c0k,m = T\u03c0k+1v\u03c0k,m and\n2) T\u03c0k+1vk \u2265 T\u03c0\u2217 vk (since \u03c0k+1 \u2208 G(vk)), we have for k \u2265 m,\nv\u2217 \u2212 v\u03c0k+1,m\n= T\u03c0\u2217 v\u2217 \u2212 Tk+1,mv\u03c0k+1,m\n= T\u03c0\u2217 v\u2217 \u2212 T\u03c0\u2217 v\u03c0k,m + T\u03c0\u2217 v\u03c0k,m \u2212 Tk+1,m+1v\u03c0k,m + Tk+1,m+1v\u03c0k,m \u2212 Tk+1,mv\u03c0k+1,m\n= \u03b3P\u03c0\u2217 (v\u2217 \u2212 v\u03c0k,m ) + T\u03c0\u2217 v\u03c0k,m \u2212 T\u03c0k+1v\u03c0k,m + \u0393k+1,m(T\u03c0k\u2212m+1v\u03c0k,m \u2212 v\u03c0k+1,m )\n\u2264 \u03b3P\u03c0\u2217 (v\u2217 \u2212 v\u03c0k,m ) + T\u03c0\u2217 vk \u2212 T\u03c0k+1vk + \u03b3(P\u03c0k+1 \u2212 P\u03c0\u2217 )\u0001k + \u0393k+1,m(T\u03c0k\u2212m+1v\u03c0k,m \u2212 v\u03c0k+1,m )\n\u2264 \u03b3P\u03c0\u2217 (v\u2217 \u2212 v\u03c0k,m ) + \u03b3(P\u03c0k+1 \u2212 P\u03c0\u2217 )\u0001k + \u0393k+1,m(T\u03c0k\u2212m+1v\u03c0k,m \u2212 v\u03c0k+1,m).\nConsider the policy \u03c0(cid:48)\nLemma 2 that T\u03c0k\u2212m+1v\u03c0k,m = v\u03c0(cid:48)\n\nk,m de\ufb01ned in Lemma 2. Observing as in the beginning of the proof of\n\n, Equation (6) can be rewritten as follows:\nv\u2217 \u2212 v\u03c0k+1,m \u2264 \u03b3P\u03c0\u2217 (v\u2217 \u2212 v\u03c0k,m ) + \u03b3(P\u03c0k+1 \u2212 P\u03c0\u2217 )\u0001k + \u0393k+1,m(v\u03c0(cid:48)\n\n\u2212 v\u03c0k+1,m).\n\n(6)\n\nk,m\n\nk,m\n\nBy using the facts that v\u2217 \u2265 v\u03c0k,m, v\u2217 \u2265 v\u03c0k+1,m and Lemma 2, we get\n\n(cid:107)v\u2217 \u2212 v\u03c0k+1,m(cid:107)\u221e \u2264 \u03b3(cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e + 2\u03b3\u0001 +\n2\u03b3\n\n= \u03b3(cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e +\n\n\u03b3m(2\u03b3\u0001)\n1 \u2212 \u03b3m\n\nFinally, we obtain by induction that for all k \u2265 m,\n\n1 \u2212 \u03b3m \u0001.\n\n(cid:107)v\u2217 \u2212 v\u03c0k,m(cid:107)\u221e \u2264 \u03b3k\u2212m(cid:107)v\u2217 \u2212 v\u03c0m,m(cid:107)\u221e +\n\n2(\u03b3 \u2212 \u03b3k+1\u2212m)\n(1 \u2212 \u03b3)(1 \u2212 \u03b3m)\n\n\u0001.\n\n7\n\n\f6 Discussion, conclusion and future work\n\nWe recalled in Theorem 1 the standard performance bound when computing an approximately op-\ntimal stationary policy with the standard AVI and API algorithms. After arguing that this bound is\ntight \u2013 in particular by providing an original argument for AVI \u2013 we proposed three new dynamic\nprogramming algorithms (one based on AVI and two on API) that output non-stationary policies for\nwhich the performance bound can be signi\ufb01cantly reduced (by a factor\n\n1\n\n1\u2212\u03b3 ).\n\n\u0001\n\n(1\u2212\u03b3)2 ). Using the informal equivalence of the horizons T (cid:39) 1\n\nFrom a bibliographical point of view, it is the work of [14] that made us think that non-stationary\npolicies may lead to better performance bounds. In that work, the author considers problems with\na \ufb01nite-horizon T for which one computes non-stationary policies with performance bounds in\nO(T \u0001), and in\ufb01nite-horizon problems for which one computes stationary policies with performance\nbounds in O(\n1\u2212\u03b3 one sees that\nnon-stationary policies look better than stationary policies. In [14], non-stationary policies are only\ncomputed in the context of \ufb01nite-horizon (and thus non-stationary) problems; the fact that non-\nstationary policies can also be useful in an in\ufb01nite-horizon stationary context is to our knowledge\ncompletely new.\nThe best performance improvements are obtained when our algorithms consider periodic non-\nstationary policies of which the period grows to in\ufb01nity, and thus require an in\ufb01nite memory, which\nmay look like a practical limitation. However, in two of the proposed algorithm, a parameter m\nallows to make a trade-off between the quality of approximation\n(1\u2212\u03b3m)(1\u2212\u03b3) \u0001 and the amount of\nmemory O(m) required.\n, that is a\nmemory that scales linearly with the horizon (and thus the dif\ufb01culty) of the problem, one can get a\nperformance bound of2\n\n(1\u2212e\u22121)(1\u2212\u03b3) \u0001 \u2264 3.164\u03b3\n1\u2212\u03b3 \u0001.\nWe conjecture that our asymptotic bound of 2\u03b3\n1\u2212\u03b3 \u0001, and the non-asymptotic bounds of Theorems 2\nand 4 are tight. The actual proof of this conjecture is left for future work. Important recent works\nof the literature involve studying performance bounds when the errors are controlled in Lp norms\ninstead of max-norm [19, 20, 21, 1, 8, 18, 17] which is natural when supervised learning algorithms\nare used to approximate the evaluation steps of AVI and API. Since our proof are based on compo-\nnentwise bounds like those of the pioneer works in this topic [19, 20], we believe that the extension\nof our analysis to Lp norm analysis is straightforward. Last but not least, an important research\ndirection that we plan to follow consists in revisiting the many implementations of AVI and API for\nbuilding stationary policies (see the list in the introduction), turn them into algorithms that look for\nnon-stationary policies and study them precisely analytically as well as empirically.\n\nIn practice, it is easy to see that by choosing m =\n\n(cid:109)\n\n2\u03b3\n\n2\u03b3\n\n(cid:108) 1\n\n1\u2212\u03b3\n\nReferences\n\n[1] A. Antos, Cs. Szepesv\u00b4ari, and R. Munos. Learning near-optimal policies with Bellman-\nresidual minimization based \ufb01tted policy iteration and a single sample path. Machine Learning,\n71(1):89\u2013129, 2008.\n\n[2] M. Gheshlaghi Azar, V. Gmez, and H.J. Kappen. Dynamic Policy Programming with Func-\ntion Approximation. In 14th International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), volume 15, Fort Lauderdale, FL, USA, 2011.\n\n[3] D.P. Bertsekas. Approximate policy iteration: a survey and some new methods. Journal of\n\nControl Theory and Applications, 9:310\u2013335, 2011.\n\n[4] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[5] L. Busoniu, A. Lazaric, M. Ghavamzadeh, R. Munos, R. Babuska, and B. De Schutter. Least-\nsquares methods for Policy Iteration. In M. Wiering and M. van Otterlo, editors, Reinforcement\nLearning: State of the Art. Springer, 2011.\n\n[6] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal\n\nof Machine Learning Research (JMLR), 6, 2005.\n\n2With this choice of m, we have m \u2265 1\n\nlog 1/\u03b3 and thus\n\n8\n\n1\u2212\u03b3m \u2264 2\n\n1\u2212e\u22121 \u2264 3.164.\n\n2\n\n\f[7] E. Even-dar. Planning in pomdps using multiplicity automata.\n\nIntelligence (UAI, pages 185\u2013192, 2005.\n\nIn Uncertainty in Arti\ufb01cial\n\n[8] A.M. Farahmand, M. Ghavamzadeh, Cs. Szepesv\u00b4ari, and S. Mannor. Regularized policy itera-\n\ntion. Advances in Neural Information Processing Systems, 21:441\u2013448, 2009.\n\n[9] A.M. Farahmand, R. Munos, and Cs. Szepesv\u00b4ari. Error propagation for approximate policy\n\nand value iteration (extended version). In NIPS, December 2010.\n\n[10] V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer. Classi\ufb01cation-based Policy Iter-\nation with a Critic. In International Conference on Machine Learning (ICML), pages 1049\u2013\n1056, Seattle, \u00b4Etats-Unis, June 2011.\n\n[11] G.J. Gordon. Stable Function Approximation in Dynamic Programming.\n\n261\u2013268, 1995.\n\nIn ICML, pages\n\n[12] C. Guestrin, D. Koller, and R. Parr. Max-norm projections for factored MDPs. In International\n\nJoint Conference on Arti\ufb01cial Intelligence, volume 17-1, pages 673\u2013682, 2001.\n\n[13] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Ef\ufb01cient Solution Algorithms for Fac-\n\ntored MDPs. Journal of Arti\ufb01cial Intelligence Research (JAIR), 19:399\u2013468, 2003.\n\n[14] S.M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University\n\nCollege London, 2003.\n\n[15] S.M. Kakade and J. Langford. Approximately Optimal Approximate Reinforcement Learning.\n\nIn International Conference on Machine Learning (ICML), pages 267\u2013274, 2002.\n\n[16] M.G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch (JMLR), 4:1107\u20131149, 2003.\n\n[17] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-Sample Analysis of Least-Squares Policy\n\nIteration. To appear in Journal of Machine learning Research (JMLR), 2011.\n\n[18] O.A. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh. Finite Sample Analysis of Bell-\nman Residual Minimization. In Masashi Sugiyama and Qiang Yang, editors, Asian Conference\non Machine Learpning. JMLR: Workshop and Conference Proceedings, volume 13, pages 309\u2013\n324, 2010.\n\n[19] R. Munos. Error Bounds for Approximate Policy Iteration. In International Conference on\n\nMachine Learning (ICML), pages 560\u2013567, 2003.\n\n[20] R. Munos. Performance Bounds in Lp norm for Approximate Value Iteration. SIAM J. Control\n\nand Optimization, 2007.\n\n[21] R. Munos and Cs. Szepesv\u00b4ari. Finite time bounds for sampling based \ufb01tted value iteration.\n\nJournal of Machine Learning Research (JMLR), 9:815\u2013857, 2008.\n\n[22] M. Petrik and B. Scherrer. Biasing Approximate Dynamic Programming with a Lower Dis-\ncount Factor. In Twenty-Second Annual Conference on Neural Information Processing Systems\n-NIPS 2008, Vancouver, Canada, 2008.\n\n[23] J. Pineau, G.J. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm\nfor POMDPs. In International Joint Conference on Arti\ufb01cial Intelligence, volume 18, pages\n1025\u20131032, 2003.\n\n[24] M. Puterman. Markov Decision Processes. Wiley, New York, 1994.\n[25] S. Singh and R. Yee. An Upper Bound on the Loss from Approximate Optimal-Value Func-\n\ntions. Machine Learning, 16-3:227\u2013233, 1994.\n\n[26] C. Thiery and B. Scherrer. Least-Squares \u03bb Policy Iteration: Bias-Variance Trade-off in Con-\n\ntrol Problems. In International Conference on Machine Learning, Haifa, Israel, 2010.\n\n[27] J.N. Tsitsiklis and B. Van Roy. Feature-Based Methods for Large Scale Dynamic Program-\n\nming. Machine Learning, 22(1-3):59\u201394, 1996.\n\n9\n\n\f", "award": [], "sourceid": 908, "authors": [{"given_name": "Bruno", "family_name": "Scherrer", "institution": null}, {"given_name": "Boris", "family_name": "Lesner", "institution": null}]}