{"title": "The Fixed Points of Off-Policy TD", "book": "Advances in Neural Information Processing Systems", "page_first": 2169, "page_last": 2177, "abstract": "Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efficiently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods.", "full_text": "The Fixed Points of Off-Policy TD\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nJ. Zico Kolter\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nkolter@csail.mit.edu\n\nAbstract\n\nOff-policy learning, the ability for an agent to learn about a policy other than the\none it is following, is a key element of Reinforcement Learning, and in recent\nyears there has been much work on developing Temporal Different (TD) algo-\nrithms that are guaranteed to converge under off-policy sampling. It has remained\nan open question, however, whether anything can be said a priori about the quality\nof the TD solution when off-policy sampling is employed with function approx-\nimation. In general the answer is no: for arbitrary off-policy sampling the error\nof the TD solution can be unboundedly large, even when the approximator can\nrepresent the true value function well. In this paper we propose a novel approach\nto address this problem: we show that by considering a certain convex subset of\noff-policy distributions we can indeed provide guarantees as to the solution quality\nsimilar to the on-policy case. Furthermore, we show that we can ef\ufb01ciently project\non to this convex set using only samples generated from the system. The end re-\nsult is a novel TD algorithm that has approximation guarantees even in the case of\noff-policy sampling and which empirically outperforms existing TD methods.\n\nIntroduction\n\n1\nIn temporal prediction tasks, Temporal Difference (TD) learning provides a method for learning\nlong-term expected rewards (the \u201cvalue function\u201d) using only trajectories from the system. The\nalgorithm is ubiquitous in Reinforcement Learning, and there has been a great deal of work studying\nthe convergence properties of the algorithm.\nIt is well known that for a tabular value function\nrepresentation, TD converges to the true value function [3, 4]. For linear function approximation\nwith on-policy sampling (i.e., when the states are drawn from the stationary distribution of the policy\nwe are trying to evaluate), the algorithm converges to a well-known \ufb01xed point that is guaranteed\nto be close to the optimal projection of the true value function [17]. When states are sampled off-\npolicy, standard TD may diverge when using linear function approximation [1], and this has led in\nrecent years to a number of modi\ufb01ed TD algorithms that are guaranteed to convergence even in the\npresence of off-policy sampling [16, 15, 9, 10].\n\nOf equal importance, however, is the actual quality of the TD solution under off-policy sampling.\nPrevious work, as well as an example we present in this paper, show that in general little can be said\nabout this question: the solution found by TD can be arbitrarily poor in the case of off-policy sam-\npling, even when the true value function is well-approximated by a linear basis. Pursing a slightly\ndifferent approach, other recent work has looked at providing problem dependent bounds, which use\nproblem-speci\ufb01c matrices to obtain tighter bounds than previous approaches [19]; these bounds can\napply to the off-policy setting, but depend on problem data, and will still fail to provide a reasonable\nbound in the cases mentioned above where the off-policy approximation is arbitrarily poor. Indeed,\na long-standing open question in Reinforcement Learning is whether any a priori guarantees can be\nmade about the solution quality for off-policy methods using function approximation.\n\nIn this paper we propose a novel approach that addresses this question: we present an algorithm that\nlooks for a subset of off-policy sampling distributions where a certain relaxed contraction property\n\n1\n\n\fholds; for distributions in this set, we show that it is indeed possible to obtain error bounds on the\nsolution quality similar to those for the on-policy case. Furthermore, we show that this set of feasible\noff-policy sampling distributions is convex, representable via a linear matrix inequality (LMI), and\nwe demonstrate how the set can be approximated and projected onto ef\ufb01ciently in the \ufb01nite sample\nsetting. The resulting method, which we refer to as TD with distribution optimization (TD-DO),\nis thus able to guarantee a good approximation to the best possible projected value function, even\nfor off-policy sampling. In simulations we show that the algorithm can improve signi\ufb01cantly over\nstandard off-policy TD.\n\n2 Preliminaries and Background\nA Markov chain is a tuple, (S, P, R, \u03b3), where S is a set of states, P : S \u00d7 S \u2192 R+ is a transition\nprobability function, R : S \u2192 R is a reward function, and \u03b3 \u2208 [0, 1) is a discount factor. For\nsimplicity of presentation we will assume the state space is countable, and so can be indexed by\nthe set S = {1, . . . , n}, which allows us to use matrix rather than operator notation. The value\nfunction for a Markov chain, V : S \u2192 R maps states to their long term discounted sum of rewards,\nt=0 \u03b3tR(st)|s0 = s]. The value function may also be expressed via\n\nand is de\ufb01ned as V (s) = E [P\u221e\n\nBellman\u2019s equation (in vector form)\n\nV = R + \u03b3P V\n\n(1)\nwhere R, V \u2208 Rn represent vectors of all rewards and values respectively, and P \u2208 Rn\u00d7n is a\nmatrix of probability transitions Pij = P (s\u2032 = j|s = i).\nIn linear function approximation, the value function is approximated as a linear combination of\nsome features describing the state: V (s) \u2248 wT \u03c6(s), where w \u2208 Rk is a vector of parameters, and\n\u03c6 : S \u2192 Rk is a function mapping states to k-dimensional feature vectors; or, again using vector\nnotation, V \u2248 \u03a6w, where \u03a6 \u2208 Rn\u00d7k is a matrix of all feature vectors. The TD solution is a \ufb01xed\npoint of the Bellman operator followed by a projection, i.e.,\n\n(2)\nwhere \u03a0D = \u03a6T (\u03a6T D\u03a6)\u22121\u03a6T D is a projection matrix weighted by the diagonal matrix D \u2208\nRn\u00d7n. Rearranging terms gives the analytical solution\n\nD = \u03a0D(R + \u03b3P \u03a6w\u22c6\nD)\n\n\u03a6w\u22c6\n\n(3)\nAlthough we cannot expect to form this solution exactly when P is unknown and too large to repre-\nsent, we can approximate the solution via stochastic iteration (leading to the original TD algorithm),\nor via the least-squares TD (LSTD) algorithm, which forms the matrices\n\nD =(cid:0)\u03a6T D(\u03a6 \u2212 \u03b3P \u03a6)(cid:1)\u22121\n\n\u03a6T DR.\n\nw\u22c6\n\n\u02c6wD = \u02c6A\u22121\u02c6b,\n\n\u02c6A =\n\n1\nm\n\nm\n\nXi=1\n\n\u03c6(s(i))(cid:16)\u03c6(s(i)) \u2212 \u03b3\u03c6(s\u2032(i))(cid:17) ,\n\n\u02c6b =\n\n1\nm\n\n\u03c6(s(i))r(i)\n\n(4)\n\nm\n\nXi=1\n\ngiven a sequence of states, rewards, and next states {s(i), r(i), s\u2032(i)}m\ni=1 where s(i) \u223c D. When D\nis not the stationary distribution of the Markov chain (i.e., we are employing off-policy sampling),\nthen the original TD algorithm may diverge (LSTD will still be able to compute the TD \ufb01xed point\nin this case, but has a greater computational complexity of O(k2)). Thus, there has been a great deal\nof work on developing O(k) algorithms that are guaranteed to converge to the LSTD \ufb01xed point\neven in the case of off-policy sampling [16, 15].\n\nWe note that the above formulation avoids any explicit mention of a Markov Decision Process\n(MDP) or actual policies: rather, we just have tuples of the form {s, r, s\u2032} where s is drawn from\nan arbitrary distribution but s\u2032 still follows the \u201cpolicy\u201d we are trying to evaluate. This is a standard\nformulation for off-policy learning (see e.g. [16, Section 2]); brie\ufb02y, the standard way to reach this\nsetting from the typical notion of off-policy learning (acting according to one policy in an MDP, but\nevaluating another) is to act according to some original policy in an MDP, and then subsample only\nthose actions that are immediately consistent with the policy of interest. We use the above nota-\ntion as it avoids the need for any explicit notation of actions and still captures the off-policy setting\ncompletely.\n\n2.1 Error bounds for the TD \ufb01xed point\nOf course, in addition to the issue of convergence, there is the question as to whether we can say any-\nthing about the quality of the approximation at this \ufb01xed point. For the case of on-policy sampling,\nthe answer here is an af\ufb01rmative one, as formalized in the following theorem.\n\n2\n\n\fr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\ni\n\na\nm\nx\no\nr\np\np\nA\n\n104\n\n102\n\n100\n\n10\u22122\n\n10\u22124\n\n \n0\n\nTD Solution\nOptimal Approximation\n\n \n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\np\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 1: Counter example for off-policy TD learning: (left) the Markov chain considered for the\ncounterexample; (right) the error of the TD estimate for different off-policy distributions (plotted on\na log scale), along with the error of the optimal approximation.\n\nTheorem 1. (Tsitsiklis and Van Roy [17], Lemma 6) Let w\u22c6\n\u03a0D(R + \u03b3P \u03a6w\u22c6\n\nD) where D is the stationary distribution of P . Then\n\nD be the unique solution to \u03a6w\u22c6\n\nD =\n\nk\u03a6w\u22c6\n\nD \u2212 V kD \u2264\n\n1\n\n1 \u2212 \u03b3\n\nk\u03a0DV \u2212 V kD.\n\n(5)\n\nThus, for on-policy sampling with linear function approximation, not only does TD converge to its\n\ufb01xed point, but we can also bound the error of its approximation relative to k\u03a0DV \u2212 V kD, the\nlowest possible approximation error for the class of function approximators.1\nSince this theorem plays an integral role in the remainder of this paper, we want to brie\ufb02y give the\nintuition of its proof. A fundamental property of Markov chains [17, Lemma 1] is that transition\nmatrix P is non-expansive in the D norm when D is the stationary distribution\n\n(6)\nFrom this it can be shown that the Bellman operator is a \u03b3-contraction in the D norm and Theorem 1\nfollows. When D is not the stationary distribution of the Markov chain, then (6) need not hold, and\nit remains to be seen what, if anything, can be said a priori about the TD \ufb01xed point in this situation.\n\nkP xkD \u2264 kxkD, \u2200x.\n\n3 An off-policy counterexample\nHere we present a simple counter-example which shows, for general off-policy sampling, that the\nTD \ufb01xed point can be an arbitrarily poor approximator of the value function, even if the chosen\nbases can represent the true value function with low error. The same intuition has been presented\npreviously [11]. though we here present a concrete numerical example for illustration.\nExample 1. Consider the two-state Markov chain shown in Figure 1, with transition probability\nmatrix P = (1/2)11T , discount factor \u03b3 = 0.99, and value function V = [1 1.05]T (with R =\n(I \u2212 \u03b3P )V ). Then for any \u01eb > 0 and C > 0, there exists an off-policy distribution D such that\nusing bases \u03a6 = [1 1.05 + \u01eb]T gives\n\nk\u03a0DV \u2212 V k \u2264 \u01eb, and k\u03a6w\u22c6\n\nD \u2212 V k \u2265 C.\n\n(7)\n\nProof. (Sketch) The fact that k\u03a0DV \u2212 V k \u2264 \u01eb is obvious from the choice of basis. To show that\nthe TD error can be unboundedly large, let D = diag(p, 1 \u2212 p); then, after some simpli\ufb01cation, the\nTD solution is given analytically by\n\nw\u22c6\n\nD =\n\n\u22122961 + 4141p \u2212 2820\u01eb + 2820p\u01eb\n\n\u22122961 + 4141p \u2212 45240\u01eb + 84840p\u01eb \u2212 40400\u01eb2 + 40400p\u01eb2\n\nwhich is in\ufb01nite, (1/w = 0), when\n\np =\n\n2961 + 45240\u01eb + 40400\u01eb2\n4141 + 84840\u01eb + 40400\u01eb2 .\n\n(8)\n\n(9)\n\nSince this solution is in (0, 1) for all epsilon, by choosing p close to this value, we can make w\u22c6\nD\narbitrarily large, which in turn makes the error of the TD estimate arbitrarily large.\n\n1The approximation factor can be sharpened to\n\nin some settings [18], though the analysis does not\n\ncarry over to our off-policy case, so we present here the simpler version.\n\n1\u221a1\u2212\u03b3 2\n\n3\n\n\fFigure 1 shows a plot of k\u03a6w\u22c6 \u2212 V k2 for the example above with \u01eb = 0.001, varying p from 0 to\n1. For p \u2248 0.715 the error of the TD solution approaches in\ufb01nity; the essential problem here is that\nwhen D is not the stationary distribution of P , A = \u03a6T D(\u03a6 \u2212 \u03b3P \u03a6) can become close to zero (or\nfor the matrix case, one of its eigenvalues can become zero), and the TD value function estimate can\ngrow unboundedly large. Thus, we argue that simple convergence for an off-policy algorithm is not\na suf\ufb01cient criterion for a good learning system, since even for a convergent algorithm the quality of\nthe actual solution could be arbitrarily poor.\n\n4 A convex characterization of valid off-policy distributions\nAlthough it may seem as though the above example would imply that very little could be said about\nthe quality of the TD \ufb01xed point under off-policy sampling, in this section we show that by imposing\nadditional constraints on the sampling distribution, we can \ufb01nd a convex family of distributions for\nwhich it is possible to make guarantees.\n\nTo motivate the approach, we again note that error bounds for the on-policy TD algorithm follow\nfrom the Markov chain property that kP xkD \u2264 kxkD for all x when D is the stationary distribu-\ntion. However, \ufb01nding a D that satis\ufb01es this condition is no easier than computing the stationary\ndistribution directly and thus is not a feasible approach. Instead, we consider a relaxed contraction\nproperty: that the transition matrix P followed by a projection onto the bases will be non-expansive\nfor any function already in the span of \u03a6. Formally, we want to consider distributions D for which\n(10)\n\nk\u03a0DP \u03a6wkD \u2264 k\u03a6wkD\n\nfor any w \u2208 Rk. This de\ufb01nes a convex set of distributions, since\n\nk\u03a0DP \u03a6wk2\n\nD \u2264 k\u03a6wk2\nD\n\n\u21d4 wT \u03a6T P T D\u03a6(\u03a6T D\u03a6)\u22121\u03a6T D\u03a6(\u03a6T D\u03a6)\u22121\u03a6DP \u03a6T w \u2264 wT \u03a6T D\u03a6w\n\n(11)\n\n\u21d4 wT (cid:0)\u03a6T P T D\u03a6(\u03a6T D\u03a6)\u22121\u03a6DP \u03a6T \u2212 \u03a6T D\u03a6(cid:1) w \u2264 0.\n\nThis holds for all w if and only if2\n\n\u03a6T P T D\u03a6(\u03a6T D\u03a6)\u22121\u03a6DP \u03a6T \u2212 \u03a6T D\u03a6 (cid:22) 0\n\nwhich in turn holds if and only if3\n\n(12)\n\n(13)\n\nF \u2261(cid:20) \u03a6T D\u03a6\n\n\u03a6T P T D\u03a6 \u03a6T D\u03a6 (cid:21) (cid:23) 0\n\n\u03a6T DP \u03a6\n\nThis is a matrix inequality (LMI) in D, and thus describes a convex set. Although the distribution D\nis too high-dimensional to optimize directly, analogous to LSTD, the F matrix de\ufb01ned above is of a\nrepresentable size (2k \u00d7 2k), and can be approximated from samples. We will return to this point in\nthe subsequent section, and for now will continue to use the notation of the true distribution D for\nsimplicity. The chief theoretical result of this section is that if we restrict our attention to off-policy\ndistributions within this convex set, we can prove non-trivial bounds about the approximation error\nof the TD \ufb01xed point.\nTheorem 2. Let w\u22c6 be the unique solution to \u03a6w\u22c6 = \u03a0D(R+\u03b3P \u03a6w\u22c6) where D is any distribution\nsatisfying (13). Further, let D\u00b5 be the stationary distribution of P , and let \u00afD \u2261 D\u22121/2D1/2\n\u00b5 Then4\n\nk\u03a6w\u22c6\n\nD \u2212 V kD \u2264\n\n1 + \u03b3\u03ba( \u00afD)\n\n1 \u2212 \u03b3\n\nk\u03a0DV \u2212 V kD.\n\n(14)\n\nThe bound here is of a similar form to the previously stated bound for on-policy TD, it bounds\nthe error of the TD solution relative to the error of the best possible approximation, except for\nthe additional \u03b3\u03ba( \u00afD) term, which measures how much the chosen distribution deviates from the\nstationary distribution. When D = D\u00b5, \u03ba( \u00afD) = 1, so we recover the original bound up to a\nconstant factor. Even though the bound does include this term that depends on the distance from\nthe stationary distribution, no such bound is possible for D that do not satisfy the convex constraint\n(13), as illustrated by the previous counter-example.\n\n2A (cid:22) 0 (A (cid:23) 0) denotes that A is negative (positive) semide\ufb01nite.\nBT C \u2013 (cid:23) 0 \u21d4 BT AB \u2212 C (cid:22) 0 [2, pg 650-651].\n3Using the Schur complement property that \u00bb A B\n4\u03ba(A) denotes the condition number of A, the ratio of the singular values \u03ba(A) = \u03c3max(A)/\u03c3min(A).\n\n4\n\n\fr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\ni\n\na\nm\nx\no\nr\np\np\nA\n\n104\n\n102\n\n100\n\n10\u22122\n\n10\u22124\n \n0\n\nTD Solution\nOptimal Approximation\nFeasible Region\n\n \n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\np\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 2: Counter example from Figure 1 shown with the set of all valid distributions for which\nF (cid:23) 0. Restricting the solution to this region avoids the possibility of the high error solution.\n\nProof. (of Theorem 2) By the triangle inequality and the de\ufb01nition of the TD \ufb01xed point,\nk\u03a6w\u22c6\n\nD \u2212 \u03a0DV kD + k\u03a0DV \u2212 V kD\n\nD \u2212 V kD \u2264 k\u03a6w\u22c6\n\n= k\u03a0D(R + \u03b3P \u03a6w\u22c6\n= \u03b3k\u03a0DP \u03a6w\u22c6\n\u2264 \u03b3k\u03a0DP \u03a6w\u22c6\n\nD) \u2212 \u03a0D(R + \u03b3P V )kD + k\u03a0DV \u2212 V kD\n\nD \u2212 \u03a0DP V kD + k\u03a0DV \u2212 V kD\nD \u2212 \u03a0DP \u03a0DV kD + \u03b3k\u03a0DP \u03a0DV \u2212 \u03a0DP V kD + k\u03a0DV \u2212 V kD.\n(15)\n\nSince \u03a0DV = \u03a6 \u00afw for some \u00afw, we can use the de\ufb01nition of our contraction k\u03a0DP \u03a6wkD \u2264 k\u03a6wkD\nto bound the \ufb01rst term as\nk\u03a0DP \u03a6w\u22c6\n\nD \u2212 \u03a0DV kD \u2264 k\u03a6w\u22c6\n\nD \u2212 V kD.\n\nD \u2212 \u03a0DP \u03a0DV kD \u2264 k\u03a6w\u22c6\nSimilarly, the second term in (15) can be bounded as\n\n(16)\n\nk\u03a0DP \u03a0DV \u2212 \u03a0DP V kD \u2264 kP \u03a0DV \u2212 P V kD \u2264 kP kDk\u03a0DV \u2212 V kD\n\n(17)\nwhere kP kD denotes the matrix norm kAkD \u2261 maxkxkD\u22641 kAxkD. Substituting these bounds\nback into (15) gives\n\n(1 \u2212 \u03b3)k\u03a6w\u22c6\n\nD \u2212 V kD \u2264 (1 + \u03b3kP kD)k\u03a0DV \u2212 V kD,\n\n(18)\n\nso all the remains is to show that kP kD \u2264 \u03ba( \u00afD). To show this, \ufb01rst note that kP kD\u00b5 = 1, since\n\nmax\n\nkxkD\u00b5 \u22641\n\nkP xkD\u00b5 \u2264 max\n\nkxkD\u00b5 \u22641\n\nkxkD\u00b5 = 1,\n\nand for any nonsingular D,\n\nkP kD = max\nkxkD\u22641\n\nkP xkD = max\n\nkyk2\u22641pyT D\u22121/2P T DP D\u22121/2y = kD1/2P D\u22121/2k2.\n\nFinally, since D\u00b5 and D are both diagonal (and thus commute),\n\u00b5 P D\u22121/2\n\nkD1/2P D\u22121/2k2 = kD\u22121/2\n\u2264 kD\u22121/2\n= kD\u22121/2\n\n\u00b5 D1/2D1/2\n\u00b5 D1/2k2kD1/2\n\u00b5\n\u00b5 D1/2k2kD\u22121/2D1/2\n\n\u00b5 P D\u22121/2\n\n\u00b5 D\u22121/2D1/2\n\n\u00b5 k2\n\nk2kD\u22121/2D1/2\n\n\u00b5 k2\n\n\u00b5 k2 = \u03ba( \u00afD)\n\n(19)\n\n(20)\n\nThe \ufb01nal form of the bound can be quite loose of, course, as many of the steps involved in the proof\nused substantial approximations and discarded problem speci\ufb01c data (such as the actual k\u03a0DP kD\nterm in favor of the generic \u03ba( \u00afD) term, for instance). This is in constrast to the previously mentioned\nwork of Yu and Bertsekas [19] that uses these and similar terms to obtain much tigher, but data\ndependent, bounds. Indeed, applying a theorem from this work we can arrive as a slight improvement\nof the bound above [13], but the focus here is just on the general form and possibility of the bound.\n\nReturning to the counter-example from the previous section, we can visualize the feasible region\nfor which F (cid:23) 0, shown as the shaded portion in Figure 2, and so constraining the solution to\nthis feasible region avoids the possibility of the high error solution. Moreover, in this example the\noptimal TD error occurs exactly at the point where \u03bbmin(F ) = 0, so that projecting an off-policy\ndistribution onto this set will give an optimal solution for initially infeasible distributions.\n\n5\n\n\f4.1 Estimation from samples\n\nReturning to the issue of optimizing this distribution only using samples from the system, we note\nthat analogous to LSTD, for samples {s(i), r(i), s\u2032(i)}m\ni=1\n\n\u02c6F =\n\n1\nm\n\nm\n\nXi=1\" \u03c6(s(i))\u03c6(s(i))T\n\n\u03c6(s\u2032(i))\u03c6(s(i))T\n\n\u03c6(s(i))\u03c6(s\u2032(i))T\n\n\u03c6(s(i))\u03c6(s(i))T # \u2261\n\n1\nm\n\nm\n\nXi=1\n\n\u02c6Fi\n\n(21)\n\nPm\n\nwill be an unbiased estimate of the LMI matrix F (for a diagonal matrix D given the our sampling\ndistribution over s(i)). Placing a weight di on each sample, we could optimize the sum \u02c6F (d) =\ni=1 di \u02c6Fi and obtain a tractable optimization problem. However, optimizing these weights freely is\nnot permissible, since this procedure allows us to choose di 6= dj even if s(i) = s(j), which violates\nthe weights in the original LMI. However, if we additionally require that s(i) = s(j) \u21d2 di = dj\n(or more appropriately for continuous features and states, for example that kdi \u2212 djk \u2192 0 as\nk\u03c6(s(i)) \u2212 \u03c6(s(j))k \u2192 0 according to some norm) then we are free to optimize over these empirical\ndistribution weights. In practice, we want to constrain this distribution in a manner commensurate\nwith the complexity of the feature space and the number of samples. However, determining the best\nsuch distributions to use in practice remains an open problem for future work in this area.\n\nFinally, since many empirical distributions satisfy \u02c6F (d) (cid:23) 0, we propose to \u201cproject\u201d the empirical\ndistribution onto this set by minimizing the KL divergence between the observed and optimized\ndistributions, subject to the constraint that \u02c6F (d) (cid:23) 0. Since this constraint is guaranteed to hold at\nthe stationary distribution, the intuition here is that by moving closer to this set, we will likely obtain\na better solution. Formally, the \ufb01nal optimization problem, which we refer to as the TD-DO method\n(Temporal Difference Distribution Optimization), is given by\n\nm\n\n\u2212\u02c6pi log di\n\ns.t. , 1T d = 0, \u02c6F (d) (cid:23) 0, d \u2208 C.\n\n(22)\n\nmin\n\nd\n\nXi=1\n\nwhere C is some convex set that respects the metric constraints described above. This is a convex op-\ntimization problem in d, and thus can be solved ef\ufb01ciently, though off-the-shelf solvers can perform\nquite poorly, especially for large dimension m.\n\n4.2 Ef\ufb01cient Optimization\n\nHere we present a \ufb01rst-order optimization method based on solving the dual of (22). By properly\nexploiting the decomposability of the objective and low-rank structure of the dual problem, we\ndevelop an iterative optimization method where each gradient step can be computed very ef\ufb01ciently.\nThe presentation here is necessarily brief due to space constraints, but we also include a longer\ndescription and an implementation of the method in the supplementary material. For simplicity we\npresent the algorithm ignoring the constraint set C, though we discuss possible additonal constraints\nbrie\ufb02y in supplementary material.\nWe begin by forming the Lagrangian of (22), introducing Lagrange multipliers Z \u2208 R2k\u00d72k for the\nconstraint \u02c6F (d) (cid:23) 0 and \u03bd \u2208 R for the constraint 1T d = 1. This leads to the dual optimization\nproblem\n\nmax\nZ(cid:23)0,\u03bd\n\nmin\n\nd (\u2212\n\nm\n\nXi=1\n\n\u02c6pi log di \u2212 tr(Z T \u02c6F (d)) + \u03bd(1T d \u2212 1)) .\n\n(23)\n\nTreating Z as \ufb01xed, we maximize over \u03bd and minimize over d in (23) using an equality-constrained,\nfeasible start Newton method [2, pg 528]. Since the objective is separable over the di\u2019s the Hessian\nmatrix is diagonal, and the Newton step can be computed in O(m) time; furthermore, since we\nsolve this subproblem for each update of dual variables Z, we can warm-start Newton\u2019s method\nfrom previous solutions, leading to a number of Newton steps that is virtually constant in practice.\nConsidering now the maximization over Z, the gradient of\n\ng(Z) \u2261(Xi\n\n\u2212 \u02c6pi log d\u22c6\n\ni (Z) \u2212 trZ T \u02c6F (d\u22c6(Z)) + \u03bd\u22c6(Z)(1T d\u22c6(Z) \u2212 1))\n\n(24)\n\n6\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\nOff\u2212policy TD\nOff\u2212policy TD\u2212DO\nOn\u2212policy TD\nOptimal Projection\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\ni\n\na\nm\nx\no\nr\np\np\nA\nd\ne\nz\n\n \n\ni\nl\n\na\nm\nr\no\nN\n\n \n\nOff\u2212policy TD\nOff\u2212policy TD\u2212DO\nOn\u2212policy TD\nOptimal Projection\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\ni\n\na\nm\nx\no\nr\np\np\nA\nd\ne\nz\n\n \n\ni\nl\n\na\nm\nr\no\nN\n\n0\n \n2\n\n3\n\n4\n\n5\n7\nNumber of Bases\n\n6\n\n8\n\n9\n\n10\n\n0\n \n2\n\n3\n\n4\n\n5\n7\nNumber of Bases\n\n6\n\n8\n\n9\n\n10\n\nFigure 3: Average approximation error of the TD methods, using different numbers of bases func-\ntions, for the random Markov chain (left) and diffusion chain (right).\n\n2\n\n1.5\n\n1\n\n0.5\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\ni\n\na\nm\nx\no\nr\np\np\nA\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\n0\n \n100\n\n \n\nOff\u2212policy TD\nOff\u2212policy TD\u2212DO\nOn\u2212policy TD\nOptimal Projection\n\n101\n\nCloseness to Stationary Distribution, C\n\n102\n\n103\n\nmu\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\ni\n\na\nm\nx\no\nr\np\np\nA\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\n0\n \n100\n\n \n\nOff\u2212policy TD\nOff\u2212policy TD\u2212DO\nOn\u2212policy TD\nOptimal Projection\n\n101\n\nCloseness to Stationary Distribution, C\n\n102\n\n103\n\nmu\n\nFigure 4: Average approximation error, using off-policy distributions closer or further from the\nstationary distribution (see text) for the random Markov chain (left) and diffusion chain (right).\n\n1.5\n\n1\n\n0.5\n\nr\no\nr\nr\n\ni\n\nE\n \nn\no\ni\nt\na\nm\nx\no\nr\np\np\nA\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\n0\n \n102\n\n \n\nOff\u2212policy TD\nOff\u2212policy TD\u2212DO\nOn\u2212policy TD\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nr\no\nr\nr\n\ni\n\nE\n \nn\no\ni\nt\na\nm\nx\no\nr\np\np\nA\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\n \n\nOff\u2212policy TD\nOff\u2212policy TD\u2212DO\nOn\u2212policy TD\n\n103\n\nNumber of Samples\n\n104\n\n0\n \n102\n\n103\n\nNumber of Samples\n\n104\n\nFigure 5: Average approximation error for TD methods computed via sampling, for different num-\nbers of samples, for random Markov chain (left) and diffusion chain (right).\n\nis given simply by \u2207Zg(Z) = \u2212 \u02c6F (d\u22c6(Z)). We then exploit the fact that we expect Z to typically be\nlow-rank: by the KKT conditions for a semide\ufb01nite program \u02c6F (d) and Z will have complementary\nranks, and since we expect \u02c6F (d) to be nearly full rank at the solution, we factor Z = Y Y T for\nY \u2208 Rk\u00d7p with p \u226a k. Although this is now a non-convex problem, local optimization of this\nobjective is still guaranteed to give a global solution to the original semide\ufb01nite problem, provided\nwe choose the rank of Y to be suf\ufb01cient to represent the optimal solution [5]. The gradient of this\ntransformed problem is \u2207Zg(Y Y T ) = \u22122 \u02c6F (d)Y , which can be computed in time O(mkp) since\neach \u02c6Fi term is a low-rank matrix, and we optimize the dual objective via an off-the-shelf LBFGS\nsolver [12, 14]. Though it is dif\ufb01cult to bound p aprirori, we can check after the solution that our\nchosen value was suf\ufb01cient for the global solution, and we have found that very low values (p = 1\nor p = 2) were suf\ufb01cient in our experiments.\n\n5 Experiments\nHere we present simple simulation experiments illustrating our proposed approach; while the evalua-\ntion is of course small scale, the results highlight the potential of TD-DO to improve TD algorithms\nboth practically as well as theoretically. Since the bene\ufb01ts of the method are clearest in terms of\n\n7\n\n\f0.19\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\n0.18\n\n0.17\n\nOff\u2212policy TD\u2212DO\n\n \n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\ni\n\na\nm\nx\no\nr\np\np\nA\nd\ne\nz\n\n \n\ni\nl\n\na\nm\nr\no\nN\n\n0.16\n\n0.15\n\n0.14\n\n0.13\n\ni\n\na\nm\nx\no\nr\np\np\nA\nd\ne\nz\n\n \n\ni\nl\n\na\nm\nr\no\nN\n\nOff\u2212policy TD\u2212DO\n\n \n\n0.22\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.12\n\n \n20\n\n30\n\n40\n\n50\n70\nNumber of Clusters\n\n60\n\n80\n\n90\n\n100\n\n0.08\n \n5\n\n10\n\n15\n\n20\n\n25\n\nNumber of LBFGS Iterations\n\n30\n\n35\n\n40\n\n45\n\n50\n\nFigure 6: (Left) Effect of the number of clusters for sample-based learning on diffusion chain,\n(Right) performance of algorithm on diffusion chain versus number of LBFGS iterations\n\nthe mean performance over many different environments we focus on randomly generated Markov\nchains of two types: a random chain and a diffusion process chain.5\nFigure 3 shows the average approximation error of the different algorithms with differing numbers\nof basis function, over 1000 domains. In this and all experiments other than those evaluating the\neffect of sampling, we use the full \u03a6 and P matrices to compute the convex set, so that we are\nevaluating the performance of the approach in the limit of large numbers of samples. We evaluate\nthe approximation error k \u02c6V \u2212 V kD where D is the off-policy sampling distribution (so as to be\nas favorable as possible to off-policy TD). In all cases the TD-DO algorithm improves upon the\noff-policy TD, though the degree of improvement can vary from minor to quite signi\ufb01cant.\n\nFigure 4 shows a similar result for varying the closeness of the sampling distribution to the\nstationary distribution;\nin our experiments, the off-policy distribution is sampled according to\nD \u223c Dir(1 + C\u00b5\u00b5) where \u00b5 denotes the stationary distribution. As expected, the off-policy ap-\nproaches perform similarly for larger C\u00b5 (approaching the stationary distribution), with TD-DO\nhaving a clear advantage when the off-policy distribution is far from the stationary distribution.\n\nIn Figure 5 we consider the effect of sampling on the algorithms. For these experiments we employ a\nsimple clustering method to compute a distribution over states d that respects the fact that \u03c6(s(i)) =\n\u03c6(s(j)) \u21d2 di = dj: we group the sampled states into k clusters via k-means clustering on the\nfeature vectors, and optimize over the reduced distribution d \u2208 Rk. In Figure 6 we vary the number\nof clusters k for the sampled diffusion chain, showing that the algorithm is robust to a large number\nof different distributional representations; we also show the performance of our method varying the\nnumber of LBFGS iterations, illustrating that performance generally improves monotonically.\n\n6 Conclusion\nThe fundamental idea we have presented in this paper is that by considering a convex subset of\noff-policy distributions (and one which can be computed ef\ufb01ciently from samples), we can provide\nperformance guarantees for the TD \ufb01xed point. While we have focused on presenting error bounds\nfor the analytical (in\ufb01nite sample) TD \ufb01xed point, a huge swath of problems in TD learning arise\nfrom this same off-policy issue: the convergence of the original TD method, the ability to \ufb01nd the\n\u21131 regularized TD \ufb01xed point [6], the on-policy requirement of the \ufb01nite sample analysis of LSTD\n[8], and the convergence of TD-based policy iteration algorithms [7]. Although left for future work,\nwe suspect that the same techniques we present here can also be extending to these other cases,\npotentially providing a wide range of analogous results that still apply under off-policy sampling.\nAcknowledgements. We thank the reviewers for helpful comments and Bruno Scherrer for pointing\nout a potential improvement to the error bound. J. Zico Kolter is supported by an NSF CI Fellowship.\n\n5Experimental details: For the random Markov Chain rows of P are drawn IID from a Dirichlet distribution,\nand the reward and bases are random normal, with |S| = 11. For the diffusion-based chain, we sample\n|S| = 100 points from a 2D unit cube xi \u2208 [0, 1]2 and set p(s\u2032 = j|s = i) \u221d exp(\u2212kxi \u2212 xjk2/(2\u03c32))\nfor bandwidth \u03c3 = 0.4. Similarly, rewards are sampled from a zero-mean Gaussian Process with covariance\nKij = exp(\u2212kxi \u2212 xjk2/(2\u03c32)), and for basis vectors we use the principle eigenvectors of Cov(V ) =\nE[(I \u2212 \u03b3P )RRT (I \u2212 \u03b3P )T ] = (I \u2212 \u03b3P )K(I \u2212 \u03b3P )T , which are the optimal bases for representing value\nfunctions (in expectation). Some details of the domains are omitted due to space constraints, but MATLAB\ncode for all the experiments is included in the supplementary \ufb01les.\n\n8\n\n\fReferences\n\n[1] L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In\n\nProceedings of the International Conference on Machine Learning, 1995.\n\n[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[3] P. Dayan. The convergence of TD(\u03bb) for general \u03bb. Machine Learning, 8(3\u20134), 1992.\n[4] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic\n\nprogramming algorithms. Neural Computation, 6(6), 1994.\n\n[5] M. Journee, F. Bach, P.A. Absil, and R. Sepulchre. Low-rank optimization on the cone of\n\npositive semide\ufb01nite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351, 2010.\n\n[6] J.Z. Kolter and A.Y. Ng. Regularization and feature selection in least-squares temporal differ-\n\nence learning. In Proceedings of the International Conference on Machine Learning, 2009.\n\n[7] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n[8] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In Proceedings\n\nof the International Conference on Machine Learning, 2010.\n\n[9] H.R. Maei and R.S. Sutton. GQ(\u03bb): A general gradient algorithm for temporal-difference\nprediction learning with eligibility traces. In Proceedings of the Third Conference on Arti\ufb01cial\nGeneral Intelligence, 2010.\n\n[10] H.R. Maei, Cs. Szepesvari, S. Bhatnagar, and R.S. Sutton. Toward off-policy learning control\nIn Proceedings of the International Conference on Machine\n\nwith function approximation.\nLearning, 2010.\n\n[11] R. Munos. Error bounds for approximate policy iteration. In Proceedings of the International\n\nConference on Machine Learning, 2003.\n\n[12] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, 1999.\n[13] B. Scherrer. Personal communication, 2011.\n[14] M. Schmidt. minfunc, 2005. Available at http://www.cs.ubc.ca/\u02dcschmidtm/\n\nSoftware/minFunc.html.\n\n[15] R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvari, and E. Wiewiora.\nFast gradient-descent methods for temporal-difference learning with linear function approxi-\nmation. In Proceedings of the International Conference on Machine Learning, 2009.\n\n[16] R.S. Sutton, Cs. Szepesvari, and H.R. Maei. A convergent O(n) algorithm for off-policy\nIn Advances in Neural In-\n\ntemporal-different learning with linear function approximation.\nformation Processing, 2008.\n\n[17] J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function\n\napproximation. IEEE Transactions and Auotomatic Control, 42:674\u2013690, 1997.\n\n[18] J.N. Tsitsiklis and B. Van Roy. Average cost temporal difference learning. Automatica,\n\n35(11):1799\u20131808, 1999.\n\n[19] H. Yu and D. P. Bertsekas. Error bounds for approximations from projected linear equations.\n\nMathematics of Operations Research, 35:306\u2013329, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1200, "authors": [{"given_name": "J.", "family_name": "Kolter", "institution": null}]}