{"title": "Basis Construction from Power Series Expansions of Value Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1540, "page_last": 1548, "abstract": "This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efficient basis representations. The first two terms in the Laurent series represent the scaled average-reward and the average-adjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same benefits at lower computational cost.", "full_text": "Basis Construction from Power Series Expansions of\n\nValue Functions\n\nSridhar Mahadevan\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\nmahadeva@cs.umass.edu\n\nBo Liu\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\nboliu@cs.umass.edu\n\nAbstract\n\nThis paper explores links between basis construction methods in Markov deci-\nsion processes and power series expansions of value functions. This perspective\nprovides a useful framework to analyze properties of existing bases, as well as\nprovides insight into constructing more effective bases. Krylov and Bellman er-\nror bases are based on the Neumann series expansion. These bases incur very\nlarge initial Bellman errors, and can converge rather slowly as the discount fac-\ntor approaches unity. The Laurent series expansion, which relates discounted and\naverage-reward formulations, provides both an explanation for this slow conver-\ngence as well as suggests a way to construct more ef\ufb01cient basis representations.\nThe \ufb01rst two terms in the Laurent series represent the scaled average-reward and\nthe average-adjusted sum of rewards, and subsequent terms expand the discounted\nvalue function using powers of a generalized inverse called the Drazin (or group\ninverse) of a singular matrix derived from the transition matrix. Experiments\nshow that Drazin bases converge considerably more quickly than several other\nbases, particularly for large values of the discount factor. An incremental vari-\nant of Drazin bases called Bellman average-reward bases (BARBs) is described,\nwhich provides some of the same bene\ufb01ts at lower computational cost.\n\n1\n\nIntroduction\n\nMarkov decision processes (MDPs) are a well-studied model of sequential decision-making under\nuncertainty [11]. Recently, there has been growing interest in automatic basis construction meth-\nods for constructing a problem-speci\ufb01c low-dimensional representation of an MDP. Functions on\nthe original state space, such as the reward function or the value function, are \u201ccompressed\u201d by\nprojecting them onto a basis matrix \u03a6, whose column space spans a low-dimensional subspace of\nthe function space on the states of the original MDP. Among the various approaches proposed are\nreward-sensitive bases, such as Krylov bases [10] and an incremental variant called Bellman error\nbasis functions (BEBFs) [9]. These approaches construct bases by dilating the (sampled) reward\nfunction by geometric powers of the (sampled) transition matrix of a policy. An alternative ap-\nproach, called proto-value functions, constructs reward-invariant bases by \ufb01nding the eigenvectors\nof the symmetric graph Laplacian matrix induced by the neighborhood topology of the state space\nunder the given actions [7].\nA fundamental dilemma that is revealed by these prior studies is that neither reward-sensitive nor\nreward-invariant eigenvector bases by themselves appear to be fully satisfactory. A Chebyshev poly-\nnomial bound for the error due to approximation using Krylov bases was derived in [10], extending\na known similar result for general Krylov approximation [12]. This bound shows that performance\nof Krylov bases (and BEBFs) tends to degrade as the discount factor \u03b3 \u2192 1. Intuitively, the initial\nbasis vectors capture short-term transient behavior near rewarding regions, and tend to poorly ap-\n\n1\n\n\fproximate the value function over the entire state space until a suf\ufb01ciently large time scale is reached.\nA straightforward geometrical analysis of approximation errors using least-squares \ufb01xed point ap-\nproximation onto a basis shows that the Bellman error decomposes into the sum of two terms: a\nreward error and a second term involving the feature prediction error [1, 8] (see Figure 1). This\nanalysis helps reveal sources of error: Krylov bases and BEBFs tend to have low reward error (or\nzero in the non-sampled case), and hence a large component of the error in using these bases tends to\nbe due to the feature prediction error. In contrast, PVFs tend to have large reward error since typical\nspiky goal reward functions are poorly approximated by smooth low-order eigenvectors; however,\ntheir feature prediction error can be quite low as the eigenvectors often capture invariant subspaces\nof the model transition matrix.\nA hybrid approach that combined low-order eigenvectors of the transition matrix (or PVFs) with\nhigher-order Krylov bases was proposed in [10], which empirically resulted in a better approach.\nThis paper demonstrates a more principled approach to address this problem, by constructing new\nbases that emerge from investigating the links between basis construction methods and different\npower series expansions of value functions. In particular, instead of using the eigenvectors of the\ntransition matrix, the proposed approach uses the average-reward or gain as the \ufb01rst basis vector, and\ndilates the reward function by powers of the average-adjusted transition matrix. It turns out that the\ngain is an element of the space of eigenvectors associated with the eigenvalue \u03bb = 1 of the transition\nmatrix. The relevance of power series expansion to approximations of value functions was hinted\nat in early work by Schwartz [13] on undiscounted optimization, although he did not discuss basis\nconstruction.\nKrylov and Bellman error basis functions (BEBFs) [10, 9, 12], as well as proto-value functions\n[7], can be related to terms in the Neumann series expansion. Ultimately, the performance of these\nbases is limited by the speed of convergence of the Neumann expansion, and of course, other errors\narising due to reward and feature prediction error. The key insight underlying this paper is to exploit\nconnections between average-reward and discounted formulations. It is well-known that discounted\nvalue functions can be written in the form of a Laurent series expansion, where the \ufb01rst two terms\ncorrespond to the average-reward term (scaled by 1\n1\u2212\u03b3 ), and the average-adjusted sum of rewards\n(or bias). Higher order terms involve powers of the Drazin (or group) inverse of a singular matrix\nrelated to the transition matrix. This expansion provides a mathematical framework for analyzing the\nproperties of basis construction methods and developing newer representations. In particular, Krylov\nbases converge slowly for high discount factors since the value function is dominated by the scaled\naverage-reward term, which is poorly approximated by the initial BEBF or Krylov basis vectors as\nit involves the long-term limiting matrix P \u2217. The Laurent series expansion leads to a new type of\nbasis called a Drazin basis [6]. An approximation of Drazin bases called Bellman average-reward\nbases (BARBs) is described and compared with BEBFs, Krylov bases, and PVFs.\n\n2 MDPs and Their Approximation\n\nA Markov decision process M is formally de\ufb01ned by the tuple (S, A, P, R), where S is a discrete\nstate space, A is the set of actions (which could be conditioned on the state s, so that As is the set of\nlegal actions in s), P (s(cid:48)|s, a) is the transition matrix specifying the effect of executing action a in\nstate s, and R(s, a) : S \u00d7 A \u2192 R is the (expected) reward for doing action a in state s. The value\nfunction V associated with a deterministic policy \u03c0 : S \u2192 A is de\ufb01ned as the long-term expected\nsum of rewards received starting from a state, and following the policy \u03c0 inde\ufb01nitely. 1 The value\nfunction V associated with a \ufb01xed policy \u03c0 can be determined by solving the Bellman equation\n\nV = T (V ) = R + \u03b3P V,\n\nwhere T (.) is the Bellman backup operator, R(s) = R(s, \u03c0(s)), P (s, s(cid:48)) = P (s(cid:48)|s, \u03c0(s)), and the\ndiscount factor 0 \u2264 \u03b3 < 1. For a \ufb01xed policy \u03c0, the induced discounted Markov reward process is\nde\ufb01ned as (P, R, \u03b3).\nA popular approach to approximating V is to use a linear combination of basis functions V \u2248 \u02c6V =\n\u03a6w, where the basis matrix \u03a6 is of size |S|\u00d7 k, and k (cid:28) |S|. The Bellman error for a given basis \u03a6,\ndenoted BE(\u03a6), is de\ufb01ned as the difference between the two sides of the Bellman equation, when\n\n1In what follows, we suppress the dependence of P , R, and V on the policy \u03c0 to avoid clutter.\n\n2\n\n\fFigure 1: The Bellman error due to a basis and its decomposition. See text for explanation.\n\nV is approximated by \u03a6w. As Figure 1 illustrates, simple geometrical considerations can be used to\nshow that the Bellman error can be decomposed into two terms: a reward error term and a weighted\nfeature error term [1, 8]: 2\n\nBE(\u03a6) = R + \u03b3P \u03a6w\u03a6 \u2212 \u03a6w\u03a6 = (I \u2212 \u03a0\u03a6)R + (I \u2212 \u03a0\u03a6)\u03b3P \u03a6w\u03a6,\n\nwhere \u03a0\u03a6 is the weighted orthogonal projector onto the column space spanned by the basis \u03a6. w\u03a6\nis the weight vector associated with the \ufb01xed point \u03a6w\u03c6 = \u03a0\u03a6(T (\u03a6w\u03a6)). If the Markov chain\nde\ufb01ned by P is irreducible and aperiodic, then \u03a0\u03a6 = \u03a6(\u03a6T D\u2217\u03a6)\u22121\u03a6T D\u2217, where D\u2217 is a diagonal\nmatrix whose entries contain the stationary distribution of the Markov chain. In the experiments\nshown below, for simplicity we will use the unweighted projection \u03a0\u03a6 = \u03a6(\u03a6T \u03a6)\u22121\u03a6T , in which\ncase the \ufb01xed point is given by w\u03c6 = (\u03a6T \u03a6 \u2212 \u03b3\u03a6T P \u03a6)\u22121\u03a6T R.\n\n3 Neumann Expansions and Krylov/BEBF Bases\n\nThe most familiar expansion of the value function V is in terms of the Neumann series, where\n\nV = (I \u2212 \u03b3P )\u22121R = (I + \u03b3P + \u03b32P 2 + . . .)R.\n\nKrylov bases correspond to successive terms in the Neumann series [10, 12]. The jth Krylov sub-\nspace Kj is de\ufb01ned as the space spanned by the vectors: Kj = {R, P R, P 2R, . . . , P j\u22121R}. Note\nthat K1 \u2286 K2 \u2286 . . ., such that for some m,Km = Km+1 = K (where m is the minimal polynomial\nof A = I \u2212 \u03b3P ). Thus, K is the P -invariant Krylov space generated by P and R. An incremental\nvariant of the Krylov-based approach is called Bellman error basis functions (BEBFs) [9]. In partic-\nular, given a set of basis functions \u03a6k (where the \ufb01rst one is assumed to equal R), the next basis is\n) \u2212 \u03a6kw\u03a6k. In the model-free reinforcement learning setting, \u03c6k+1\nde\ufb01ned to be \u03c6k+1 = T (\u03a6kw\u03a6k\ncan be approximated by the temporal-difference (TD) error \u03c6k+1 = r + \u03b3 \u02c6Qk(s(cid:48), \u03c0k(s(cid:48)))\u2212 \u02c6Qk(s, a),\ngiven a set of stored transitions in the form (s, a, r, s(cid:48)). Here, \u02c6Qk is the \ufb01xed-point least-squares ap-\nproximation to the action-value function Q(s, a) on the basis \u03a6k. It can be easily shown that BEBFs\nand Krylov bases de\ufb01ne the same space [8].\n\n3.1 Convergence Analysis\n\nA key issue in evaluating the effectiveness of Krylov bases (and BEBFs) is the speed of convergence\nof the Neumann series. As \u03b3 \u2192 1, Krylov bases and BEBFs converge rather slowly, owing to a large\nincrease in the weighted feature error. In practice, this problem can be shown to be acute even for\nvalues of \u03b3 = 0.9 or \u03b3 = 0.99, which are quite common in experiments. Petrik [10] derived a bound\nfor the error due to Krylov approximation, which depends on the condition number of I \u2212 \u03b3P , and\nthe ratio of two Chebyshev polynomials on the complex plane. The condition number of I \u2212 \u03b3P\ncan signi\ufb01cantly increase as \u03b3 \u2192 1 (see Figure 2).\nIt has been shown that BEBFs reduce the Bellman error at a rate bounded by value iteration [9].\nIterative solution methods for solving linear systems Ax = b can broadly be categorized as different\nways of decomposing A = S \u2212 T , giving rise to the iteration Sxk+1 = T xk + b. The convergence\nof this iteration depends on the spectral structure of B = S\u22121T , in particular its largest eigenvalue.\nFor standard value iteration, A = I \u2212 \u03b3P , and consequently the natural decomposition is to set\n2The two components of the Bellman error may partially (or fully) cancel each other out: the Bellman error\n\nof V itself is 0, but it generates non-zero reward and feature prediction errors.\n\n3\n\nR\u03b3P\u03a6w\u03a6\u03a6\u03a0\u03a6R(I\u2212\u03a0\u03a6)R\u03a0\u03a6\u03b3\u03a6w\u03a6(I\u2212\u03a0\u03a6)\u03b3P\u03a6w\u03a6\fFigure 2: Condition number of I \u2212 \u03b3P as \u03b3 \u2192 1, where P is the transition matrix of the optimal\npolicy in a chain MDP of 50 states with rewards in states 10 and 41 [9].\n\nS = I and T = \u03b3P . Thus, the largest eigenvalue of S\u22121T is \u03b3, and as \u03b3 \u2192 1, convergence of value\niteration is progressively decelerated. For Krylov bases, the following bound can be shown.\nTheorem 1: The Bellman error in approximating the value function for a discounted Markov reward\nprocess (P, R, \u03b3) using m BEBF or Krylov bases is is bounded by\n(cid:107)BE(\u03a6)(cid:107)2 = (cid:107)(I \u2212 \u03a0\u03a6)\u03b3P \u03a6w\u03a6(cid:107)2 \u2264 \u03ba(X)\n\nwhere the (Jordan form) diagonalization of I \u2212 \u03b3P = XSX\u22121, and \u03ba(X) = (cid:107)X(cid:107)2(cid:107)X\u22121(cid:107)2 is the\ncondition number of I \u2212 \u03b3P . Cm is the Chebyshev polynomial of degree m of the \ufb01rst kind, where\na, c and d are chosen such that E(a, c, d) is an ellipse on the set of complex numbers that covers all\nthe eigenvalues of I \u2212 \u03b3P with center c, focal distance d, and major semi-axis a.\nProof: This result follows directly from standard Krylov space approximation results [12], and past\nresults on approximation using BEBFs and Krylov spaces for MDPs [10, 8]. First, note that the\noverall Bellman error can be reduced to the weighted feature error, since the reward error is 0 as R\nis in the span of both BEBFs and Krylov bases:\n\n(cid:107)BE(\u03a6)(cid:107)2 = (cid:107)T (\u03a6w\u03a6) \u2212 \u03a6w\u03a6(cid:107)2 = (cid:107)R \u2212 (I \u2212 \u03b3P ) \u02c6V (cid:107)2 = (cid:107)(I \u2212 \u03a0\u03a6)\u03b3P \u03a6w\u03a6(cid:107)2.\n\nCm( a\nCm( c\n\nd\n\nd\n\n)\n) ,\n\nNext, setting A = (I \u2212 \u03b3P ), we have\n\n(cid:107)R \u2212 Aw(cid:107)2 = (cid:107)R \u2212 m(cid:88)\n\nwiAiR(cid:107)2 = (cid:107) m(cid:88)\n\n\u2212w(i)AiR(cid:107)2.\n\nassuming w(0) = \u22121. A standard result in Krylov space approximation [12] shows that\n\ni=1\n\ni=0\n\nmin\np\u2208Pm\n\n(cid:107)p(A)(cid:107)2 \u2264 min\np\u2208Pm\n\n\u03ba(X) max\ni=1,...,n\n\n|p(\u03bbi)| \u2264 \u03ba(X)\n\nCm( a\nCm( c\n\nd\n\nd\n\n)\n) ,\n\nwhere Pm is the set of polynomials of degree m.\nFigure 2 shows empirically that one reason for the slow convergence of BEBFs and Krylov bases\nis that as \u03b3 \u2192 1, the condition number of I \u2212 \u03b3P signi\ufb01cantly increases. Figure 3 compares the\nweighted feature error of BEBF bases (the performance of Krylov bases is identical and not shown)\non a 50 state chain domain with a single goal reward of 1 in state 25. The dynamics of the chain are\nidentical to those in [9]. Notice as \u03b3 increases, the feature error increases dramatically.\n\n4 Laurent Series Expansion and Drazin Bases\n\nA potential solution to the slow convergence of BEBFs and Krylov bases is suggested by a different\npower series called the Laurent expansion. It is well known from the classical theory of MDPs\nthat the discounted value function can be written in a form that relates it to the average-reward\nformulation [11]. This connection uses the following Laurent series expansion of V in terms of the\naverage reward \u03c1 of the policy \u03c0, the average-adjusted sum of rewards h, and higher order terms that\ninvolve the generalized spectral inverse (Drazin or group inverse) of I \u2212 P .\n)n((I \u2212 P )D)n+1R\n\n(cid:32) \u03b3\n\n( \u03b3 \u2212 1\n\n\u221e(cid:88)\n\n\u03c1 + h +\n\n(cid:33)\n\nV =\n\n.\n\n(1)\n\n1\n\u03b3\n\n1 \u2212 \u03b3\n\n\u03b3\n\nn=1\n\n4\n\n0510152025020406080100120140160180200Chain MDP with 50 states and rewards at state 10 and 41\u03b3 varied from 0.75 \u2192 0.99Condition Number\fFigure 3: Weighted feature error of BEBF bases on a 50 state chain MDP with a single reward of 1 in\nstate 25. Top: optimal value function for \u03b3 = 0.5, 0.9, 0.99. Bottom: Weighted feature error. As \u03b3\nincreases, the weighted feature error grows much larger as the value function becomes progressively\nsmoother and less like the reward function. Note the difference in scale between the three plots.\n\nAs \u03b3 \u2192 1, the \ufb01rst term in the Laurent series expansion grows quite large causing the slow conver-\ngence of Krylov bases and BEBFs. (I \u2212 P )D is a generalized inverse of the singular matrix I \u2212 P\ncalled the Drazin inverse [2, 11]. For any square matrix A \u2208 Cn\u00d7n, the Drazin inverse X of A\nsatis\ufb01es the following properties: (1) XAX = X (2) XA = AX (3) Ak+1X = Ak. Here, k is the\nindex of matrix A, which is the smallest nonnegative integer k such that R(Ak) = R(Ak+1). R(A)\nis the range (or column space) of matrix A. For example, a nonsingular (square) matrix A has index\n0, because R(A0) = R(I) = R(A). The matrix I \u2212 P of a Markov chain has index k = 1. For\nindex 1 matrices, the Drazin inverse turns out to be the same as the group inverse, which is de\ufb01ned\nas\n\n(cid:80)n\nk=0 P k = I \u2212 (I \u2212 P )(I \u2212 P )D. The\nwhere the long-term limiting matrix P \u2217 = limn\u2192\u221e 1\nmatrix (I \u2212 P + P \u2217)\u22121 is often referred to as the fundamental matrix of a Markov chain. Note that\nfor index 1 matrices, the Drazin (or group) inverse satis\ufb01es the additional property AXA = A. Also,\nP \u2217 and I \u2212 P \u2217 are orthogonal projection matrices, since they are both idempotent and furthermore\nP P \u2217 = P \u2217P = P \u2217, and P \u2217(I \u2212 P \u2217) = 0. 3 The gain and bias can be expressed in terms of P \u2217. In\nparticular, the gain g = P \u2217R, and the bias or average-adjusted value function is given by:\n\n(I \u2212 P )D = (I \u2212 P + P \u2217)\u22121 \u2212 P \u2217,\n\nn\n\nh = (I \u2212 P )DR =(cid:2)(I \u2212 P + P \u2217)\u22121 \u2212 P \u2217(cid:3) R =\n\n\u221e(cid:88)\n\n(P t \u2212 P \u2217)R.\n\nwhere the last equality holds for aperiodic Markov chains. If we represent the coef\ufb01cients in the\nLaurent series as y\u22121, y0, . . . ,, they can be shown to be solutions to the following set of equations\n(for n = 1, 2, . . .). In terms of the expansion above, y\u22121 is the gain of the policy, y0 is its bias, and\nso on.\n\nt=0\n\n(I \u2212 P )y\u22121 = 0, y\u22121 + (I \u2212 P )y0 = R, . . . , yn\u22121 + (I \u2212 P )yn = 0.\n\nAnalogous to the Krylov bases, the successive terms of the Laurent series expansion can be viewed\nas basis vectors. More formally, the Drazin basis is de\ufb01ned as the space spanned by the vectors [6]:\n\nDm = {P \u2217R, (I \u2212 P )DR, ((I \u2212 P )D)2R, . . . , ((I \u2212 P )D)m\u22121R}.\n\n(2)\nThe \ufb01rst basis vector is the average-reward or gain g = P \u2217R of policy \u03c0. The second basis vector is\nthe bias, or average-adjusted sum of rewards h. Subsequent basis vectors correspond to higher-order\nterms in the Laurent series.\n\n3Several methods are available to compute Drazin inverses, as described in [2]. An iterative method called\n\nSuccessive Matrix Squaring (SMS) has also has been developed for ef\ufb01cient parallel implementation [15].\n\n5\n\n0510152025303540455000.20.40.60.811.21.4Optimal VF0510152025303540455000.511.522.533.544.55Optimal VF051015202530354045503234363840424446Optimal VF0510152025303540455000.20.40.60.81# Basis FunctionsBellman ErrorFigure 1a DBFBEBFPVFPVF\u2212MP0510152025303540455000.20.40.60.81# Basis FunctionsReward ErrorFigure 1b DBFBEBFPVFPVF\u2212MP05101520253000.20.40.60.8# Basis FunctionsWeighted Feature ErrorBEBF Feature Error in Chain Domain 0510152025303540455001234# Basis FunctionsBellman ErrorFigure 1a DBFBEBFPVFPVF\u2212MP0510152025303540455000.20.40.60.81# Basis FunctionsReward ErrorFigure 1b DBFBEBFPVFPVF\u2212MP05101520253001234# Basis FunctionsWeighted Feature ErrorBEBF Feature Error in Chain Domain 0510152025303540455001020304050# Basis FunctionsBellman ErrorFigure 1a DBFBEBFPVFPVF\u2212MP0510152025303540455000.20.40.60.81# Basis FunctionsReward ErrorFigure 1b DBFBEBFPVFPVF\u2212MP05101520253001020304050# Basis FunctionsWeighted Feature ErrorBEBF Fearure Error for Chain Domain \f5 Bellman Average-Reward Bases\n\nTo get further insight into methods for approximating Drazin bases, it is helpful to note that the\n(i, j)th element in the Drazin or group inverse matrix is the difference between the expected number\nof visits to state j starting in state i following the transition matrix P versus the expected number\nof visits to j following the long-term limiting matrix P \u2217. Building on this insight, an approximate\nDrazin basis called Bellman average-reward bases (BARBs) can be de\ufb01ned as follows. First, the\napproximate Drazin basis is de\ufb01ned as the space spanned by the vectors\n\nAm = {P \u2217R, (P \u2212 P \u2217)R, (P \u2212 P \u2217)2R, . . . , (P \u2212 P \u2217)m\u22121R}\n= {P \u2217R, P R \u2212 P \u2217R, P 2R \u2212 P \u2217R, . . . , P m\u22121R \u2212 P \u2217R}.\n\nBARBs are similar to Krylov bases, except that the reward function is being dilated by the average-\nadjusted transition matrix P \u2212 P \u2217, and the \ufb01rst basis element is the gain. De\ufb01ning \u03c1 = P \u2217R,\nBARBs can be de\ufb01ned as follows:\n\n\u03c61 = \u03c1 = P \u2217R.\n\n\u03c6k+1 = R \u2212 \u03c1 + P \u03a6kw\u03a6k\n\n\u2212 \u03a6kw\u03a6k.\n\nThe cost of computing BARBs is essentially that of computing BEBFs (or Krylov bases), except\nfor the term involving the gain \u03c1. Analogous to BEBFs, in the model-free reinforcement learning\nsetting, BARBs can be computed using the average-adjusted TD error\n\n\u03c6k+1(s) = r \u2212 \u03c1k(s) + \u02c6Qk(s(cid:48), \u03c0k(s(cid:48))) \u2212 \u02c6Qk(s, a).\n\nThere are a number of incremental algorithms for computing \u03c1 (such as the scheme used in R-\nlearning [13], or simply averaging the sample rewards). Several methods for computing P \u2217 are\ndiscussed in [14].\n\n5.1 Expressivity Properties\n\nSome results concerning the expressivity of approximate Drazin bases and BARBs are now dis-\ncussed. Due to space, detailed proofs are not included.\nTheorem 2 For any k > 1, the following hold:\n\nspan{Ak(R)} \u2286 span{BARBk+1(R)} .\nspan{BARBk+1(R)} = span{{R} \u222a Ak (R)} .\nspan{BEBFk(R)} \u2286 span{BARBk+1(R)} .\nspan{BARBk+1(R)} = span{{\u03c1} \u222a BEBFk (R)} .\n\nProof: Proof follows by induction. For k = 1, both approximate Drazin bases and BARBs contain\nthe gain \u03c1. For k = 2, BARB2(R) = R \u2212 \u03c1, whereas A2 = P R \u2212 \u03c1 (which is included in\nBARB3(R)). For general k > 2, the new basis vector in Ak is P k\u22121R, which can be shown to be\npart of BARBk+1(R). The other results can be derived through similar analysis.\nThere is a similar decomposition of the average-adjusted Bellman error into a component that de-\npends on the average-adjusted reward error and an undiscounted weighted feature error.\nTheorem 3 Given a basis \u03a6, for any average reward Markov reward process (P, R), the Bellman\nerror can be decomposed as follows:\n\nT ( \u02c6V ) \u2212 \u02c6V = R \u2212 \u03c1 + P \u03a6w\u03a6 \u2212 \u03a6w\u03a6\n\n= (I \u2212 \u03a0\u03a6)(R \u2212 \u03c1) + (I \u2212 \u03a0\u03a6)P \u03a6w\u03a6\n= (I \u2212 \u03a0\u03a6)R \u2212 (I \u2212 \u03a0\u03a6)\u03c1 + (I \u2212 \u03a0\u03a6)P \u03a6w\u03a6.\n\nProof: The three terms represent the reward error, the average-reward error, and the undiscounted\nweighted feature error. The proof follows immediately from the geometry of the Bellman error,\nsimilar to that shown in Figure 1, and using the property of linearity of orthogonal projectors.\nA more detailed convergence analysis of BARBs is given in [4], based on the relationship between\nthe approximation error and the mixing rate of the Markov chain de\ufb01ned by P .\n\n6\n\n\fFigure 4: Experimental comparison on a 50 state chain MDP with rewards in state 10 and 41. Left\ncolumn: \u03b3 = 0.7. Middle column: \u03b3 = 0.9. Right column: \u03b3 = 0.99.\n\n6 Experimental Comparisons\n\nFigure 4 compares the performance of Bellman average-reward basis functions (BARBs) vs.\nBellman-error basis functions (BEBFs), and a variant of proto-value functions (PVF-MP) on a 50\nstate chain MDP. This problem was previously studied by [3]. The two actions (go left, or go right)\nsucceed with probability 0.9. When the actions fail, they result in movement in the opposite direc-\ntion with probability 0.1. The two ends of the chain are treated as \u201cdead ends\u201d. Rewards of +1 are\ngiven in states 10 and 41. The PVF-MP algorithm selects basis functions incrementally based upon\nthe Bellman error, where basis function k+1 is the PVF that has the largest inner product with the\nBellman error resulting from the previous k basis functions.\nPVFs have a high reward error, since the reward function is a set of two delta functions that is poorly\napproximated by the eigenvectors of the combinatorial Laplacian on the chain graph. However,\nPVFs have very low weighted feature error. The overall Bellman error remains large due to the high\nreward error. The reward error for BEBFs is by de\ufb01nition 0 as R is a basis vector itself. However,\nthe weighted feature error for BEBFs grows quite large as \u03b3 increases from 0.7 to 0.99, particularly\ninitially, until around 15 bases are used. Consequently, the Bellman error for BEBFs remains large\ninitially. BARBs have the best overall performance at this task, particularly for \u03b3 = 0.9 and 0.99.\nThe plots in Figure 5 compare BARBs vs. Drazin and Krylov bases in the two-room gridworld\nMDP [7]. Drazin bases perform the best, followed by BARBs, and then Krylov bases. At higher\ndiscount factors, the differences are more noticeable. Finally, Figure 6 compares BARBs vs. BEBFs\non a 10 \u00d7 10 grid world MDP with a reward placed in the upper left corner state. The advantage\nof using BARBs over BEBFs is signi\ufb01cant as \u03b3 \u2192 1. The policy is a random walk on the grid.\nFinally, similar results were also obtained in experiments conducted on random MDPs, where the\nstates were decomposed into communicating classes of different block sizes (not shown).\n\n7 Conclusions and Future Work\n\nThe Neumann and Laurent series lead to different ways of constructing problem-speci\ufb01c bases.\nThe Neumann series, which underlies Bellman error and Krylov bases, tends to converge slowly as\n\u03b3 \u2192 1. To address this shortcoming, the Laurent series was used to derive a new approach called the\nDrazin basis, which expands the discounted value function in terms of the average-reward, the bias,\nand higher order terms representing powers of the Drazin inverse of a singular matrix derived from\n\n7\n\n0102030405000.511.5# Basis FunctionsReward Error BARBBEBFPVF\u2212MP0102030405000.511.5# Basis FunctionsWeighted Feature Error BARBBEBFPVF\u2212MP0102030405000.511.5# Basis FunctionsBellman Error BARBsBEBFPVF\u2212MP0102030405000.511.5# Basis FunctionsReward Error BARBBEBFPVF\u2212MP010203040500246# Basis FunctionsWeighted Feature Error BARBBEBFPVF\u2212MP010203040500246# Basis FunctionsBellman Error BARBsBEBFPVF\u2212MP0102030405000.511.5# Basis FunctionsReward Error BARBBEBFPVF\u2212MP01020304050020406080# Basis FunctionsWeighted Feature Error BARBBEBFPVF\u2212MP01020304050020406080# Basis FunctionsBellman Error BARBsBEBFPVF\u2212MP\fFigure 5: Comparison of BARBs vs. Drazin and Krylov bases in a 100 state two-room MDP [7].\nAll bases were evaluated on the optimal policy. The reward was set at +100 for reaching a corner\ngoal state in one of the rooms. Left: \u03b3 = 0.9. Right: \u03b3 = 0.99.\n\nFigure 6: Experimental comparison of BARBs and BEBFs on a 10 \u00d7 10 grid world MDP with a\nreward in the upper left corner. Left: \u03b3 = 0.7. Middle: \u03b3 = 0.99. Right: \u03b3 = 0.999.\n\nthe transition matrix. An incremental version of Drazin bases called Bellman average-reward bases\n(BARBs) was investigated. Numerical experiments on simple MDPs show superior performance of\nDrazin bases and BARBs to BEBFs, Krylov bases, and PVFs. Scaling BARBs and Drazin bases\nto large MDPs requires addressing sampling issues, and exploiting structure in transition matrices,\nsuch as using factored representations. BARBs are computationally more tractable than Drazin\nbases, and merit further study. Reinforcement learning methods to estimate the \ufb01rst few terms of\nthe Laurent series were proposed in [5], and can be adapted for basis construction. The Schultz\nexpansion provides a way of rewriting the Neumann series using a multiplicative series of dyadic\npowers of the transition matrix, which is useful for multiscale bases [6].\n\nAcknowledgements\n\nThis research was supported in part by the National Science Foundation (NSF) under grants NSF\nIIS-0534999 and NSF IIS-0803288, and Air Force Of\ufb01ce of Scienti\ufb01c Research (AFOSR) under\ngrant FA9550-10-1-0383. Any opinions, \ufb01ndings, and conclusions or recommendations expressed\nin this material are those of the authors and do not necessarily re\ufb02ect the views of the AFOSR or the\nNSF.\n\n8\n\n05101520253035404550020406080100Reward Error BARBDrazinKrylov05101520253035404550020406080100120Weighted Feature Error BARBDrazinKrylov05101520253035404550020406080100120Bellman Error BARBDrazinKrylov05101520253035404550020406080100Reward Error BARBDrazinKrylov05101520253035404550050100150200Weighted Feature Error BARBDrazinKrylov05101520253035404550050100150200Bellman Error BARBDrazinKrylov05101520253000.20.40.60.81Reward Error BEBFBARB05101520253000.20.40.60.8Weighted Feature Error BEBFBARB05101520253000.20.40.60.81Bellman Error BEBFBARB05101520253000.20.40.60.81Reward Error BEBFBARB05101520253000.511.5Weighted Feature Error BEBFBARB05101520253000.511.5Bellman Error BEBFBARB0510152025303500.20.40.60.81Reward Error BEBFBARB051015202530350246810Weighted Feature Error BEBFBARB051015202530350246810Bellman Error BEBFBARB\fReferences\n[1] D. Bertsekas and D. Casta\u02dcnon. Adaptive aggregation methods for in\ufb01nite horizon dynamic\n\nprogramming. IEEE Transactions on Automatic Control, 34:589\u2013598, 1989.\n\n[2] S. Campbell and C. Meyer. Generalized Inverses of Linear Transformations. Pitman, 1979.\n[3] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n[4] B. Liu and S. Mahadevan. An investigation of basis construction from power series expansions\n\nof value functions. Technical report, University Massachusetts, Amherst, 2010.\n\n[5] S Mahadevan. Sensitive-discount optimality: Unifying discounted and average reward rein-\nIn Proceedings of the International Conference on Machine Learning,\n\nforcement learning.\n1996.\n\n[6] S. Mahadevan. Learning representation and control in Markov Decision Processes: New fron-\n\ntiers. Foundations and Trends in Machine Learning, 1(4):403\u2013565, 2009.\n\n[7] S. Mahadevan and M. Maggioni. Proto-value functions: A Laplacian framework for learn-\ning representation and control in Markov Decision Processes. Journal of Machine Learning\nResearch, 8:2169\u20132231, 2007.\n\n[8] R. Parr, , Li. L., G. Taylor, C. Painter-Wake\ufb01eld, and M. Littman. An analysis of linear mod-\nels, linear value-function approximation, and feature selection for reinforcement learning. In\nProceedings of the International Conference on Machine Learning (ICML), 2008.\n\n[9] R. Parr, C. Painter-Wake\ufb01eld, L. Li, and M. Littman. Analyzing feature generation for value\nfunction approximation. In Proceedings of the International Conference on Machine Learning\n(ICML), pages 737\u2013744, 2007.\n\n[10] M. Petrik. An analysis of Laplacian methods for value function approximation in MDPs.\nIn Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages\n2574\u20132579, 2007.\n\n[11] M. L. Puterman. Markov Decision Processes. Wiley Interscience, New York, USA, 1994.\n[12] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM Press, 2003.\n[13] A. Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proc.\n10th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 1993.\n[14] William J. Stewart. Numerical methods for computing stationary distributions of \ufb01nite irre-\nducible markov chains. In Advances in Computational Probability. Kluwer Academic Pub-\nlishers, 1997.\n\n[15] Y. Wei. Successive matrix squaring algorithm for computing the Drazin inverse. Applied\n\nMathematics and Computation, 108:67\u201375, 2000.\n\n9\n\n\f", "award": [], "sourceid": 892, "authors": [{"given_name": "Sridhar", "family_name": "Mahadevan", "institution": null}, {"given_name": "Bo", "family_name": "Liu", "institution": null}]}