{"title": "Predictive State Temporal Difference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 271, "page_last": 279, "abstract": "We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem.", "full_text": "Predictive State Temporal Difference Learning\n\nByron Boots\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nbeb@cs.cmu.edu\n\nGeoffrey J. Gordon\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nggordon@cs.cmu.edu\n\nAbstract\n\nWe propose a new approach to value function approximation which combines lin-\near temporal difference reinforcement learning with subspace identi\ufb01cation. In\npractical applications, reinforcement learning (RL) is complicated by the fact that\nstate is either high-dimensional or partially observable. Therefore, RL methods\nare designed to work with features of state rather than state itself, and the suc-\ncess or failure of learning is often determined by the suitability of the selected\nfeatures. By comparison, subspace identi\ufb01cation (SSID) methods are designed to\nselect a feature set which preserves as much information as possible about state.\nIn this paper we connect the two approaches, looking at the problem of reinforce-\nment learning with a large set of features, each of which may only be marginally\nuseful for value function approximation. We introduce a new algorithm for this\nsituation, called Predictive State Temporal Difference (PSTD) learning. As in\nSSID for predictive state representations, PSTD \ufb01nds a linear compression op-\nerator that projects a large set of features down to a small set that preserves the\nmaximum amount of predictive information. As in RL, PSTD then uses a Bellman\nrecursion to estimate a value function. We discuss the connection between PSTD\nand prior approaches in RL and SSID. We prove that PSTD is statistically consis-\ntent, perform several experiments that illustrate its properties, and demonstrate its\npotential on a dif\ufb01cult optimal stopping problem.\n\n1\n\nIntroduction\n\nWe wish to estimate the value function of a policy in an unknown decision process in a high dimen-\nsional and partially-observable environment. We represent the value function in a linear architec-\nture, as a linear combination of features of (sequences of) observations. A popular family of learning\nalgorithms called temporal difference (TD) methods [1] are designed for this situation. In particular,\nleast-squares TD (LSTD) algorithms [2, 3, 4] exploit the linearity of the value function to estimate\nits parameters from sampled trajectories, i.e., from sequences of feature vectors of visited states, by\nsolving a set of linear equations.\nRecently, Parr et al. looked at the problem of value function estimation from the perspective of\nboth model-free and model-based reinforcement learning [5]. The model-free approach (which\nincludes TD methods) estimates a value function directly from sample trajectories. The model-based\napproach, by contrast, \ufb01rst learns a model of the process and then computes the value function from\nthe learned model. Parr et al. demonstrated that these two approaches compute exactly the same\nvalue function [5]. In the current paper, we build on this insight, while simultaneously \ufb01nding a\ncompact set of features using powerful methods from system identi\ufb01cation.\nFirst, we look at the problem of improving LSTD from a model-free predictive-bottleneck perspec-\ntive: given a large set of features of history, we devise a new TD method called Predictive State\nTemporal Difference (PSTD) learning. PSTD estimates the value function through a bottleneck that\n\n1\n\n\fpreserves only predictive information (Section 3). Second, we look at the problem of value function\nestimation from a model-based perspective (Section 4). Instead of learning a linear transition model\nin feature space, as in [5], we use subspace identi\ufb01cation [6, 7] to learn a PSR from our samples.\nSince PSRs are at least as compact as POMDPs, our representation can naturally be viewed as a\nvalue-directed compression of a much larger POMDP. Finally, we show that our two improved\nmethods are equivalent. This result yields some appealing theoretical bene\ufb01ts: for example, PSTD\nfeatures can be explicitly interpreted as a statistically consistent estimate of the true underlying sys-\ntem state. And, the feasibility of \ufb01nding the true value function can be shown to depend on the\nlinear dimension of the dynamical system, or equivalently, the dimensionality of the predictive state\nrepresentation\u2014not on the cardinality of the POMDP state space. Therefore our representation is\nnaturally \u201ccompressed\u201d in the sense of [8], speeding up convergence.\nWe demonstrate the practical bene\ufb01ts of our method with several experiments: we compare PSTD\nto competing algorithms on a synthetic example and a dif\ufb01cult optimal stopping problem. In the\nlatter problem, a signi\ufb01cant amount of prior work has gone into hand-tuning features. We show that,\nif we add a large number of weakly relevant features to these hand-tuned features, PSTD can \ufb01nd a\npredictive subspace which performs much better than competing approaches, improving on the best\npreviously reported result for this problem by a substantial margin. The theoretical and empirical\nresults reported here suggest that, for many applications where LSTD is used to compute a value\nfunction, PSTD can be simply substituted to produce better results.\n\nJ \u03c0(s) = R(s) + \u03b3(cid:80)\n\n2 Value Function Approximation\nWe start from a discrete time dynamical system with a set of states S, a set of actions A, a distribution\nover initial states \u03c00, a transition function T , a reward function R, and a discount factor \u03b3 \u2208 [0, 1].\nWe seek a policy \u03c0, a mapping from states to actions. For a given policy \u03c0, the value of state s is\nde\ufb01ned as the expected discounted sum of rewards when starting in state s and following policy \u03c0,\n\nt=0 \u03b3tR(st) | s0 = s, \u03c0]. The value function obeys the Bellman equation\n\nJ \u03c0(s) = E [(cid:80)\u221e\n\ns(cid:48) J \u03c0(s(cid:48)) Pr[s(cid:48) | s, \u03c0(s)]\n\n(1)\nIf we know the transition function T , and if the set of states S is suf\ufb01ciently small, we can \ufb01nd an\noptimal policy with policy iteration: pick an initial policy \u03c0, use (1) to solve for the value function\nJ \u03c0, compute the greedy policy for J \u03c0 (setting the action at each state to maximize the right-hand\nside of (1)), and repeat. However, we consider instead the harder problem of estimating the value\nfunction when s is a partially observable latent variable, and when the transition function T is\nunknown. In this situation, we receive information about s through observations from a \ufb01nite set\nO. We can no longer make decisions or predict reward based on S, but instead must use a history\n(an ordered sequence of action-observation pairs h = ah\nt that have been executed and\nobserved prior to time t): R(h), J(h), and \u03c0(h) instead of R(s), J \u03c0(s), and \u03c0(s). Let H be the set\nof all possible histories. H is often very large or in\ufb01nite, so instead of \ufb01nding a value separately for\neach history, we focus on value functions that are linear in features of histories\n\n1 . . . ah\n\n1 oh\n\nt oh\n\n(2)\nHere w \u2208 Rj is a parameter vector and \u03c6H(h) \u2208 Rj is a feature vector for a history h. So, we can\nrewrite the Bellman equation as\n\nJ \u03c0(h) = wT\u03c6H(h)\n\nwT\u03c6H(h) = R(h) + \u03b3(cid:80)\n\no\u2208O wT\u03c6H(h\u03c0o) Pr[h\u03c0o | h\u03c0]\n\nwhere h\u03c0o is history h extended by taking action \u03c0(h) and observing o.\n\n2.1 Least Squares Temporal Difference Learning\nIn general we don\u2019t know the transition probabilities Pr[h\u03c0o | h], but we do have samples of state\nt+1 = \u03c6H(ht+1), and immediate rewards Rt = R(ht).\nfeatures \u03c6H\nWe can thus estimate the Bellman equation\n\nt = \u03c6H(ht), next-state features \u03c6H\n\nwT\u03c6H\n\n1:k \u2248 R1:k + \u03b3wT\u03c6H\n\n2:k+1\n\n(Here we have used \u03c6H\nfor t = 1 . . . k.) We can can\nimmediately attempt to estimate the parameter w by solving this linear system in the least squares\n\n1:k to mean the matrix whose columns are \u03c6H\n\nt\n\n2\n\n(3)\n\n(4)\n\n\f(cid:0)\u03c6H\nference E[\u03c6H(h) \u2212 \u03b3(cid:80)\n\n(cid:1)\u2020\n\n2:k+1\n\n1:k \u2212 \u03b3\u03c6H\n\nsense: \u02c6wT = R1:k\nis biased [3], since the independent variables \u03c6H\n\n, where \u2020 indicates the pseudo-inverse. However, this solution\nt+1 are noisy samples of the expected dif-\no\u2208O \u03c6H(h\u03c0o) Pr[h\u03c0o | h]]. In other words, estimating the value function\n\nparameters w is an error-in-variables problem.\nThe least squares temporal difference (LSTD) algorithm \ufb01nds a consistent estimate of w by right-\nmultiplying the approximate Bellman equation (Equation 4) by \u03c6H\n\nt \u2212 \u03b3\u03c6H\n\n(cid:80)k\nt=1 Rt\u03c6H\n\nt\n\nT(cid:16) 1\n\nk\n\n(cid:80)k\nt=1 \u03c6H\n\nt \u03c6H\n\nt\n\nT \u2212 \u03b3\n\nk\n\nt\n\n(cid:80)k\nT:\nt=1 \u03c6H\n\nt+1\u03c6H\n\nt\n\nT(cid:17)\u22121\n\n\u02c6wT = 1\nk\n\n(5)\n\nHere, \u03c6H\nthe true independent variables but uncorrelated with the noise in our estimates of these variables.\n\nT can be viewed as an instrumental variable [3], i.e., a measurement that is correlated with\n\nt\n\n/k and\nAs the amount of data k increases,\n\u03c6H\n2:k+1\u03c6H\n/k converge with probability 1 to their population values, and so our estimate of the\nmatrix to be inverted in (5) is consistent. So, as long as this matrix is nonsingular, our estimate of\nthe inverse is also consistent, and our estimate of w converges to the true value with probability 1.\n\nthe empirical covariance matrices \u03c6H\n\n1:k\u03c6H\n\n1:k\n\n1:k\n\nT\n\nT\n\n3 Predictive Features\nLSTD provides a consistent estimate of the value function parameters w; but in practice, if the\nnumber of features is large relative to the number of training samples, then the LSTD estimate of w\nis prone to over\ufb01tting. This problem can be alleviated by choosing a small set of features that only\ncontains information that is relevant for value function approximation. However, with the exception\nof LARS-TD [9], there has been little work on how to select features automatically for value function\napproximation when the system model is unknown; and of course, manual feature selection depends\non not-always-available expert guidance. We approach the problem of \ufb01nding a good set of features\nfrom a bottleneck perspective. That is, given a large set of features of history, we would like to \ufb01nd\na compression that preserves only relevant information for predicting the value function J \u03c0. As we\nwill see in Section 4, this improvement is directly related to spectral identi\ufb01cation of PSRs.\n\n3.1 Finding Predictive Features Through a Bottleneck\n\nIn order to \ufb01nd a predictive feature compression, we \ufb01rst need to determine what we would like to\npredict. The most relevant prediction is the value function itself; so, we could simply try to predict\ntotal future discounted reward. Unfortunately, total discounted reward has high variance, so unless\nwe have a lot of data, learning will be dif\ufb01cult. We can reduce variance by including other prediction\ntasks as well. For example, predicting individual rewards at future time steps seems highly relevant,\nand gives us much more immediate feedback. Similarly, future observations hopefully contain\ninformation about future reward, so trying to predict observations can help us predict reward.\nWe call these prediction tasks, collectively, features of the future. We write \u03c6T\nfor the vector of\nall features of the \u201cfuture at time t,\u201d i.e., events starting at time t + 1 and continuing forward.\nNow, instead of remembering a large arbitrary set of features of history, we want to \ufb01nd a small\nsubspace of features of history that is relevant for predicting features of the future. We will call this\nsubspace a predictive compression, and we will write the value function as a linear function of only\nthe predictive compression of features. To \ufb01nd our predictive compression, we will use reduced-\nrank regression [10]. We de\ufb01ne the following empirical covariance matrices between features of the\nfuture and features of histories:\n\nt\n\n(cid:98)\u03a3T ,H = 1\n\n(cid:80)k\nt=1 \u03c6T\n\nt \u03c6H\n\nt\n\nT\n\n(cid:98)\u03a3H,H = 1\n\n(cid:80)k\nt=1 \u03c6H\n\nt \u03c6H\n\nt\n\nT\n\nk\n\nLet LH be the lower triangular Cholesky factor of(cid:98)\u03a3H,H. Then we can \ufb01nd a predictive compression\n(cid:98)\u03a3T ,HL\u2212TH for a truncated SVD [11], where U contains the left singular vectors, V contains the right\n(cid:98)U from the compressed space up to the space of features of the future, and we de\ufb01ne (cid:98)V to be the\n\nof histories by a singular value decomposition (SVD) of the weighted covariance: write UDV T \u2248\nsingular vectors, and D is the diagonal matrix of singular values. (We can tune accuracy by keeping\nmore or fewer singular values, i.e., columns of U, V, or D.) We use the SVD to de\ufb01ne a mapping\n\n(6)\n\nk\n\n3\n\n\foptimal compression operator given (cid:98)U (in a least-squares sense, see [12] for details):\n\n(cid:98)U = UD1/2\n\n(cid:98)V = (cid:98)U T(cid:98)\u03a3T ,H((cid:98)\u03a3H,H)\u22121\n\n(7)\nBy weighting different features of the future differently, we can change the approximate compression\nin interesting ways. For example, as we will see in Section 4.2, scaling up future reward by a\nconstant factor results in a value-directed compression\u2014but, unlike previous ways to \ufb01nd value-\nanother example, let LT be the Cholesky factor of the empirical covariance of future features(cid:98)\u03a3T ,T .\ndirected compressions [8], we do not need to know a model of our system ahead of time. For\nThen, if we scale features of the future by L\u2212TT , the SVD will preserve the largest possible amount\nof mutual information between history and future, yielding a canonical correlation analysis [13, 14].\n\n3.2 Predictive State Temporal Difference Learning\n\nNow that we have found a predictive compression operator (cid:98)V via Equation 7, we can replace the\n\nt with the compressed features (cid:98)V \u03c6H\n\nin the Bellman recursion, Equation 4:\n\nfeatures of history \u03c6H\n\n1:k \u2248 R1:k + \u03b3wT(cid:98)V \u03c6H\n\nt\n\n(8)\nThe least squares solution for w is still prone to an error-in-variables problem. The instrumental\nvariable \u03c6H is still correlated with the true independent variables and uncorrelated with noise, and\nso we can again use it to unbias the estimate of w. De\ufb01ne the additional covariance matrices:\n\n2:k+1\n\nThen, the corrected Bellman equation is wT(cid:98)V(cid:98)\u03a3H,H = (cid:98)\u03a3R,H + \u03b3wT(cid:98)V(cid:98)\u03a3H+,H, and solving for w\n\nt+1\u03c6H\n\n(9)\n\nT\n\nT\n\nk\n\nt\n\ngives us the Predictive State Temporal Difference (PSTD) learning algorithm:\n\n(cid:80)k\n(cid:98)\u03a3H+,H = 1\nt=1 \u03c6H\n(cid:17)\u2020\n(cid:16)(cid:98)V(cid:98)\u03a3H,H \u2212 \u03b3(cid:98)V(cid:98)\u03a3H+,H\n\nk\n\n(cid:98)\u03a3R,H = 1\n\nwT(cid:98)V \u03c6H\n(cid:80)k\nt=1 Rt\u03c6H\nwT =(cid:98)\u03a3R,H\n\nt\n\n(10)\n\nSo far we have provided some intuition for why predictive features should be better than arbitrary\nfeatures for temporal difference learning. Below we will show an additional bene\ufb01t: the model-free\nalgorithm in Equation 10 is, under some circumstances, equivalent to a model-based method which\nuses subspace identi\ufb01cation to learn Predictive State Representations [6, 7].\n4 Predictive State Representations\nA predictive state representation (PSR) [15] is a compact and complete description of a dynami-\ncal system. Unlike POMDPs, which represent state as a distribution over a latent variable, PSRs\nrepresent state as a set of predictions of tests. Just as a history is an ordered sequence of action-\nobservation pairs executed prior to time t, we de\ufb01ne a test of length i to be an ordered sequence of\naction-observation pairs \u03c4 = a1o1 . . . aioi that can be executed and observed after time t [15].\nThe prediction for a test \u03c4 after a history h, written \u03c4 (h), is the probability that we will see\nthe test observations \u03c4 O = o1 . . . oi, given that we intervene [16] to execute the test actions\n\u03c4 A = a1 . . . ai: \u03c4 (h) = Pr[\u03c4 O | h, do(\u03c4 A)].\nIf Q = {\u03c41, . . . , \u03c4n} is a set of tests, we write\nQ(h) = (\u03c41(h), . . . , \u03c4n(h))T for the corresponding vector of test predictions.\nFormally, a PSR consists of \ufb01ve elements (cid:104)A, O, Q, s1, F(cid:105). A is a \ufb01nite set of possible actions,\nand O is a \ufb01nite set of possible observations. Q is a core set of tests, i.e., a set whose vector of\npredictions Q(h) is a suf\ufb01cient statistic for predicting the success probabilities of all tests. F is\nthe set of functions f\u03c4 which embody these predictions: \u03c4 (h) = f\u03c4 (Q(h)). And, m1 = Q(\u0001) is\nthe initial prediction vector. In this work we will restrict ourselves to linear PSRs, in which all\n\u03c4 Q(h) for some vector r\u03c4 \u2208 R|Q|. Finally, a core\nprediction functions are linear: f\u03c4 (Q(h)) = rT\nset Q is minimal if the tests in Q are linearly independent [17, 18], i.e., no one test\u2019s prediction is a\nlinear function of the other tests\u2019 predictions.\nSince Q(h) is a suf\ufb01cient statistic for all tests, it is a state for our PSR: i.e., we can remember just\nQ(h) instead of h itself. After action a and observation o, we can update Q(h) recursively: if we\nwrite Mao for the matrix with rows rT\n\nao\u03c4 for \u03c4 \u2208 Q, then we can use Bayes\u2019 Rule to show:\n\nQ(hao) =\n\nMaoQ(h)\n\nPr[o| h, do(a)]\n\n=\n\nMaoQ(h)\n\nmT\u221eMaoQ(h)\n\n(11)\n\n4\n\n\fwhere m\u221e is a normalizer, de\ufb01ned by mT\u221eQ(h) = 1 for all h. In addition to the above PSR param-\neters, for reinforcement learning we need a reward function R(h) = \u03b7TQ(h) mapping predictive\nstates to immediate rewards, a discount factor \u03b3 \u2208 [0, 1] which weights the importance of future\nrewards vs. present ones, and a policy \u03c0(Q(h)) mapping from predictive states to actions.\nInstead of ordinary PSRs, we will work with transformed PSRs (TPSRs) [6, 7]. TPSRs are a gener-\nalization of regular PSRs: a TPSR maintains a small number of suf\ufb01cient statistics which are linear\ncombinations of a (potentially very large) set of test probabilities. That is, a TPSR maintains a small\nnumber of feature predictions instead of test predictions. TPSRs have exactly the same predictive\nabilities as regular PSRs, but are invariant under similarity transforms: given an invertible matrix\nS, we can transform m1 \u2192 Sm1, mT\u221e \u2192 mT\u221eS\u22121, and Mao \u2192 SMaoS\u22121 without changing the\ncorresponding dynamical system, since pairs S\u22121S cancel in Eq. 11. The main bene\ufb01t of TPSRs\nover regular PSRs is that, given any core set of tests, low dimensional parameters can be found\nusing spectral matrix decomposition and regression instead of combinatorial search. In this respect,\nTPSRs are closely related to the transformed representations of LDSs and HMMs found by subspace\nidenti\ufb01cation [19, 20, 14, 21].\n4.1 Learning Transformed PSRs\nLet Q be a minimal core set of tests, so that n = |Q| is the linear dimension of the system. Then, let\nT be a larger core set of tests (not necessarily minimal), and let H be the set of all possible histories.\nt \u2208 R(cid:96) for a vector\nAs before, write \u03c6H\nof features of the future at time t. Since T is a core set of tests, by de\ufb01nition we can compute any test\nprediction \u03c4 (h) as a linear function of T (h). And, since feature predictions are linear combinations\nof test predictions, we can also compute any feature prediction \u03c6(h) as a linear function of T (h).\nWe de\ufb01ne the matrix \u03a6T \u2208 R(cid:96)\u00d7|T | to embody our predictions of future features: an entry of \u03a6T is\nthe weight of one of the tests in T for calculating the prediction of one of the features in \u03c6T . Below\nwe de\ufb01ne several covariance matrices, Equation 12(a\u2013d), in terms of the observable quantities \u03c6T\nt ,\n\u03c6H\nt , at, and ot, and show how these matrices relate to the parameters of the underlying PSR. These\nrelationships then lead to our learning algorithm, Eq. 14 below.\nFirst we de\ufb01ne \u03a3H,H, the covariance matrix of features of histories, as E[\u03c6H\nk samples, we can approximate this covariance:\n\nt \u2208 R(cid:96) for a vector of features of history at time t, and write \u03c6T\n\nT | ht \u223c \u03c9]. Given\n\nt \u03c6H\n\nT | ht \u223c \u03c9, do(\u03b6)] = \u03a6T R\u03a3S,H\n\nAs k \u2192 \u221e,(cid:98)\u03a3H,H converges to the true covariance \u03a3H,H with probability 1. Next we de\ufb01ne \u03a3S,H,\nstate at time t, let \u03a3S,H = E(cid:104) 1\nderivations):(cid:98)\u03a3T ,H \u2261 1\nk \u03c6T\n\nthe cross covariance of states and features of histories. Writing st = Q(ht) for the (unobserved)\n. We cannot directly estimate \u03a3S,H from\ndata, but this matrix will appear as a factor in several of the matrices that we de\ufb01ne below. Next\nwe de\ufb01ne \u03a3T ,H, the cross covariance matrix of the features of tests and histories (see [12] for\n\n(12b)\nwhere row \u03c4 of the matrix R is r\u03c4 , the linear function that speci\ufb01es the prediction of the test \u03c4 given\nthe predictions of tests in the core set Q. By do(\u03b6), we mean to approximate the effect of executing\nall sequences of actions required by all tests or features of the future at once. This is not dif\ufb01cult\nin our experiments (in which all tests use compatible action sequences); but see [12] for further\ndiscussion. Eq. 12b tells us that, because of our assumptions about linear dimension, the matrix\n\u03a3T ,H has factors R \u2208 R|T |\u00d7n and \u03a3S,H \u2208 Rn\u00d7(cid:96). Therefore, the rank of \u03a3T ,H is no more than\nn, the linear dimension of the system. We can also see that, since the size of \u03a3T ,H is \ufb01xed, as the\n\nnumber of samples k increases,(cid:98)\u03a3T ,H \u2192 \u03a3T ,H with probability 1.\n(cid:98)\u03a3H,ao,H \u2261 1\nSince the dimensions of each (cid:98)\u03a3H,ao,H are \ufb01xed, as k \u2192 \u221e these empirical covariances converge\n\nNext we de\ufb01ne \u03a3H,ao,H, a set of matrices, one for each action-observation pair, that represent the\ncovariance between features of history before and after taking action a and observing o.\nIn the\nfollowing, It(o) is an indicator variable for whether we see observation o at step t.\n\n\u03a3H,ao,H \u2261 E [\u03a3H,ao,H| ht \u223c \u03c9 (\u2200t), do(a) (\u2200t)]\n\n(cid:80)k\nt=1 \u03c6H\n\n1:k\u03c6H\n\n(cid:98)\u03a3H,H = 1\nk \u03c6H\nT(cid:12)(cid:12)(cid:12) ht \u223c \u03c9 (\u2200t)\n\n(cid:105)\n\nto the true covariances \u03a3H,ao,H with probability 1. Finally we de\ufb01ne \u03a3R,H, and approximate the\ncovariance (in this case a vector) of reward and features of history:\n\n\u03a3T ,H \u2261 E[\u03c6T\n\nt \u03c6H\n\nt\n\nIt(o)\u03c6H\n\nk s1:k\u03c6H\n\n1:k\n\n(12a)\n\n1:k\u03c6H\n\n1:k\n\nT\n\nT\n\n.\n\n1:k\n\n(12c)\n\nt\n\nk\n\nt+1\n\nT\n\nt\n\n5\n\n\f(cid:98)\u03a3R,H \u2261 1\n\n(cid:80)k\nt=1 Rt\u03c6H\n\nt\n\nAgain, as k \u2192 \u221e,(cid:98)\u03a3R,H converges to \u03a3R,H with probability 1.\n\n\u03a3R,H \u2261 E[Rt\u03c6H\n\nT\n\nk\n\nt\n\nT | ht \u223c \u03c9] = \u03b7T\u03a3S,H\n\n(12d)\n\nWe now wish to use the above-de\ufb01ned matrices to learn a TPSR from data. To do so we need to\nmake a somewhat-restrictive assumption: we assume that our features of history are rich enough to\ndetermine the state of the system, i.e., the regression from \u03c6H to s is exact: st = \u03a3S,H\u03a3\u22121H,H\u03c6H\nt .\nWe discuss how to relax this assumption in [12]. We also need a matrix U such that U T\u03a6T R is\ninvertible; with probability 1 a random matrix satis\ufb01es this condition, but as we will see below, there\nare reasons to choose U via SVD of a scaled version of \u03a3T ,H as described in Sec. 3.1. Using our\nassumptions we can show a useful identity for \u03a3H,ao,H (for proof details see [12]):\n\n\u03a3S,H\u03a3\u22121H,H\u03a3H,ao,H = Mao\u03a3S,H\n\n(13)\nThis identity is at the heart of our learning algorithm: it states that \u03a3H,ao,H contains a hidden copy\nof Mao, the main TPSR parameter that we need to learn. We would like to recover Mao via Eq. 13,\n\u2020\nMao = \u03a3S,H\u03a3\u22121H,H\u03a3H,ao,H\u03a3\nS,H; but of course we do not know \u03a3S,H. Fortunately, it turns out that\nwe can use U T\u03a3T ,H as a stand-in, since this matrix differs only by an invertible transform (Eq. 12b).\nWe now show how to recover a TPSR from the matrices \u03a3T ,H, \u03a3H,H, \u03a3R,H, \u03a3H,ao,H, and U.\nSince a TPSR\u2019s predictions are invariant to a similarity transform of its parameters, our algorithm\nonly recovers the TPSR parameters to within a similarity transform [7, 12].\n\nt = (U T\u03a6T R)st\n\nbt \u2261 U T\u03a3T ,H(\u03a3H,H)\u22121\u03c6H\nBao \u2261 U T\u03a3T ,H(\u03a3H,H)\u22121\u03a3H,ao,H(U T\u03a3T ,H)\u2020 = (U T\u03a6T R)Mao(U T\u03a6T R)\u22121\n\u03b7 \u2261 \u03a3R,H(U T\u03a3T ,H)\u2020 = \u03b7T(U T\u03a6T R)\u22121\nbT\n\n(14a)\n(14b)\n(14c)\nOur PSR learning algorithm is simple: replace each true covariance matrix in Eq. 14 by its empirical\nestimate. Since the empirical estimates converge to their true values with probability 1 as the sample\nsize increases, our learning algorithm is clearly statistically consistent.\n4.2 Predictive State Temporal Difference Learning (Revisited)\nFinally, we are ready to show that the model-free PSTD learning algorithm introduced in Section 3.2\nis equivalent to a model-based algorithm built around PSR learning. For a \ufb01xed policy \u03c0, a PSR\nor TPSR\u2019s value function is a linear function of state, V (s) = wTs, and is the solution of the\nPSR Bellman equation [22]: for all s, wTs = bT\no\u2208O wTB\u03c0os, or equivalently, wT =\no\u2208O wTB\u03c0o. Substituting in our learned PSR parameters from Equations 14(a\u2013c), we get\nbT\n\n\u03b7 s + \u03b3(cid:80)\nwT =(cid:98)\u03a3R,H(U T(cid:98)\u03a3T ,H)\u2020 + \u03b3(cid:80)\nwTU T(cid:98)\u03a3T ,H =(cid:98)\u03a3R,H + \u03b3wTU T(cid:98)\u03a3T ,H((cid:98)\u03a3H,H)\u22121(cid:98)\u03a3H+,H\n\n\u03b7 + \u03b3(cid:80)\no\u2208O wTU T(cid:98)\u03a3T ,H((cid:98)\u03a3H,H)\u22121(cid:98)\u03a3H,\u03c0o,H(U T(cid:98)\u03a3T ,H)\u2020\nsince, by comparing Eqs. 12c and 9, we can see that(cid:80)\n(cid:98)V as in Eq. 7, and let U = (cid:98)U as suggested above in Sec. 4.1. Then U T(cid:98)\u03a3T ,H = (cid:98)V(cid:98)\u03a3H,H, and\n(cid:17)\u2020\n\nwT(cid:98)V(cid:98)\u03a3H,H =(cid:98)\u03a3R,H + \u03b3wT(cid:98)V(cid:98)\u03a3H+,H =\u21d2 wT =(cid:98)\u03a3R,H\n\no\u2208O(cid:98)\u03a3H,\u03c0o,H =(cid:98)\u03a3H+,H. Now, de\ufb01ne (cid:98)U and\n\n(cid:16)(cid:98)V(cid:98)\u03a3H,H \u2212 \u03b3(cid:98)V(cid:98)\u03a3H+,H\n\n(15)\n\nEq. 15 is exactly Eq. 10, the PSTD algorithm. So, we have shown that, if we learn a PSR by the\nsubspace identi\ufb01cation algorithm of Sec. 4.1 and then compute its value function via the Bellman\nequation, we get the exact same answer as if we had directly learned the value function via the\nmodel-free PSTD method. In addition to adding to our understanding of both methods, an important\ncorollary of this result is that PSTD is a statistically consistent algorithm for PSR value function\napproximation\u2014to our knowledge, the \ufb01rst such result for a TD method.\n\n5 Experimental Results\n5.1 Estimating the Value Function of a RR-POMDP\nWe evaluate the PSTD learning algorithm on a synthetic example derived from [23]. The problem is\nto \ufb01nd the value function of a policy in a partially observable Markov decision Process (POMDP).\nThe POMDP has 4 latent states, but the policy\u2019s transition matrix is low rank: the resulting belief\ndistributions lie in a 3-dimensional subspace of the original belief simplex (see [12] for details).\n\n6\n\n\fFigure 1: Experimental Results. Error bars indicate standard error. (A) Estimating the value func-\ntion with a small number of informative features. All three approaches do well. (B) Estimating the\nvalue function with a small set of informative features and a large set of random features. LARS-\nTD is designed for this scenario and dramatically outperforms PSTD and LSTD. (C) Estimating\nthe value function with a large set of semi-informative features. PSTD is able to determine a small\nset of compressed features that retain the maximal amount of information about the value function,\noutperforming LSTD and LARS-TD. (D) Pricing a high-dimensional derivative via policy iteration.\nThe optimal threshold strategy (sell if price is above a threshold [24]) is in black, LSTD (16 canon-\nical features) is in blue, LSTD (on the full 220 features) is cyan, LARS-TD (feature selection from\nset of 220) is in green, and PSTD (16 dimensions, compressing 220 features) is in red.\n\nWe perform 3 experiments, comparing the performance of LSTD, LARS-TD, and PSTD when dif-\nferent sets of features are used.\nIn each case we compare the value function estimated by each\nalgorithm to the true value function computed by J \u03c0 = R(I \u2212 \u03b3T \u03c0)\u22121. In the \ufb01rst experiment\nwe execute the policy \u03c0 for 1000 time steps. We split the data into overlapping histories and tests\nof length 5, and sample 10 of these histories and tests to serve as centers for Gaussian radial ba-\nsis functions. We then evaluate each basis function at every remaining sample. Then, using these\nfeatures, we learned the value function using LSTD, LARS-TD, and PSTD with linear dimension\n3 (Figure 1(A)). Each method estimated a reasonable value function. For the second experiment,\nwe added 490 random, uninformative features to the 10 good features and then attempted to learn\nthe value function with each of the 3 algorithms (Figure 1(B)). In this case, LSTD and PSTD both\nhad dif\ufb01culty \ufb01tting the value function due to the large number of irrelevant features. LARS-TD,\ndesigned for precisely this scenario, was able to select the 10 relevant features and estimate the\nvalue function better by a substantial margin. For the third experiment, we increased the number of\nsampled features from 10 to 500. In this case, each feature was somewhat relevant, but the number\nof features was large compared to the amount of training data. This situation occurs frequently in\npractice: it is often easy to \ufb01nd a large number of features that are at least somewhat related to state.\nPSTD outperforms LSTD and LARS-TD by summarizing these features and ef\ufb01ciently estimating\nthe value function (Figure 1(C)).\n5.2 Pricing A High-dimensional Financial Derivative\nDerivatives are \ufb01nancial contracts with payoffs linked to the future prices of basic assets such as\nstocks, bonds and commodities. In some derivatives the contract holder has no choices, but in more\ncomplex cases, the holder must make decisions, and the value of the contract depends on how the\nholder acts\u2014e.g., with early exercise the holder can decide to terminate the contract at any time and\nreceive payments based on prevailing market conditions, so deciding when to exercise is an optimal\nstopping problem. Stopping problems provide an ideal testbed for policy evaluation methods, since\nwe can collect a single data set which lets us evaluate any policy: we just choose the \u201ccontinue\u201d\naction forever. (We can then evaluate the \u201cstop\u201d action easily in any of the resulting states.)\nWe consider the \ufb01nancial derivative introduced by Tsitsiklis and Van Roy [24]. The derivative\ngenerates payoffs that are contingent on the prices of a single stock. At the end of each day, the\nholder may opt to exercise. At exercise the holder receives a payoff equal to the current price of the\nstock divided by the price 100 days beforehand. We can think of this derivative as a \u201cpsychic call\u201d:\nthe holder gets to decide whether s/he would like to have bought an ordinary 100-day European\ncall option, at the then-current market price, 100 days ago. In our simulation (and unknown to the\ninvestor), the underlying stock price follows a geometric Brownian motion with volatility \u03c3 = 0.02\nand continuously compounded short term growth rate \u03c1 = 0.0004. Assuming stock prices \ufb02uctuate\nonly on days when the market is open, these parameters correspond to an annual growth rate of\n\u223c 10%. In more detail, if wt is a standard Brownian motion, then the stock price pt evolves as\n\u2207pt = \u03c1pt\u2207t + \u03c3pt\u2207wt, and we can summarize relevant state at the end of each day as a vector\n\n7\n\n1234\u221210\u221250510151234\u221210\u221250510151234\u221210\u221250510150510152025300.951.001.051.101.151.201.251.30ValueStateStateStateExpected RewardPolicy IterationA.B.C.D.LSTDPSTDLARS-TDLSTD (16)LSTDPSTDLARS-TDThresholdJLSTDPSTDLARS-TDJLSTDPSTDLARS-TDJ\u03c0\u03c0\u03c0\fpt\u2212100\n\npt\n\npt\u2212100\n\n, . . . ,\n\n, pt\u221298\npt\u2212100\n\nxt \u2208 R100, with xt = ( pt\u221299\n)T. This process is Markov and ergodic [24, 25]:\nxt and xt+100 are independent and identically distributed. The immediate reward for exercising the\noption is G(x) = x(100), and the immediate reward for continuing to hold the option is 0. The\ndiscount factor \u03b3 = e\u2212\u03c1 is determined by the growth rate; this corresponds to assuming that the\nrisk-free interest rate is equal to the stock\u2019s growth rate, meaning that the investor gains nothing in\nexpectation by holding the stock itself.\nE[\u03b3tG(xt) | x0 = x].\nThe value of the derivative, if the current state is x, is given by V \u2217(x) = supt\nOur goal is to calculate an approximate value function V (x) = wT\u03c6H(x), and then use this value\nfunction to generate a stopping time min{t| G(xt) \u2265 V (xt)}. To do so, we sample a sequence\nof 1,000,000 states xt \u2208 R100 and calculate features \u03c6H of each state. We then perform policy\niteration on this sample, alternately estimating the value function under a given policy and then\nusing this value function to de\ufb01ne a new greedy policy \u201cstop if G(xt) \u2265 wT\u03c6H(xt).\u201d\nWithin the above strategy, we have two main choices: which features do we use, and how do we\nestimate the value function in terms of these features. For value function estimation, we used LSTD,\nLARS-TD, or PSTD. In each case we re-used our 1,000,000-state sample trajectory for all iterations:\nwe start at the beginning and follow the trajectory as long as the policy chooses the \u201ccontinue\u201d action,\nwith reward 0 at each step. When the policy executes the \u201cstop\u201d action, the reward is G(x) and the\nnext state\u2019s features are all 0; we then restart the policy 100 steps in the future, after the process\nhas fully mixed. For feature selection, we are fortunate: previous researchers have hand-selected a\n\u201cgood\u201d set of 16 features for this data set through repeated trial and error (see [12] and [24, 25]). We\ngreatly expand this set of features, then use PSTD to synthesize a small set of high-quality combined\nfeatures. Speci\ufb01cally, we add the entire 100-step state vector, the squares of the components of the\nstate vector, and several additional nonlinear features, increasing the total number of features from\n16 to 220. We use histories of length 1, tests of length 5, and (for comparison\u2019s sake) we choose a\nlinear dimension of 16. Tests (but not histories) were value-directed by reducing the variance of all\nfeatures except reward by a factor of 100.\nFigure 1D shows results. We compared PSTD (reducing 220 to 16 features) to LSTD with either\nthe 16 hand-selected features or the full 220 features, as well as to LARS-TD (220 features) and to\na simple thresholding strategy [24]. In each case we evaluated the \ufb01nal policy on 10,000 new ran-\ndom trajectories. PSTD outperformed each of its competitors, improving on the next best approach,\nLARS-TD, by 1.75 percentage points. In fact, PSTD performs better than the best previously re-\nported approach [24, 25] by 1.24 percentage points. These improvements correspond to appreciable\nfractions of the risk-free interest rate (which is about 4 percentage points over the 100 day window\nof the contract), and therefore to signi\ufb01cant arbitrage opportunities: an investor who doesn\u2019t know\nthe best strategy will consistently undervalue the security, allowing an informed investor to buy it\nfor below its expected value.\n6 Conclusion\nIn this paper, we attack the feature selection problem for temporal difference learning. Although\nwell-known temporal difference algorithms such as LSTD can provide asymptotically unbiased es-\ntimates of value function parameters in linear architectures, they can have trouble in \ufb01nite samples:\nif the number of features is large relative to the number of training samples, then they can have\nhigh variance in their value function estimates. For this reason, in real-world problems, a substantial\namount of time is spent selecting a small set of features, often by trial and error [24, 25]. To remedy\nthis problem, we present the PSTD algorithm, a new approach to feature selection for TD methods,\nwhich demonstrates how insights from system identi\ufb01cation can bene\ufb01t reinforcement learning.\nPSTD automatically chooses a small set of features that are relevant for prediction and value func-\ntion approximation. It approaches feature selection from a bottleneck perspective, by \ufb01nding a small\nset of features that preserves only predictive information. Because of the focus on predictive infor-\nmation, the PSTD approach is closely connected to PSRs: under appropriate assumptions, PSTD\u2019s\ncompressed set of features is asymptotically equivalent to TPSR state, and PSTD is a consistent\nestimator of the PSR value function.\nWe demonstrate the merits of PSTD compared to two popular alternative algorithms, LARS-TD\nand LSTD, on a synthetic example, and argue that PSTD is most effective when approximating a\nvalue function from a large number of features, each of which contains at least a little information\nabout state. Finally, we apply PSTD to a dif\ufb01cult optimal stopping problem, and demonstrate the\npractical utility of the algorithm by outperforming several alternative approaches and topping the\nbest reported previous results.\n\n8\n\n\fReferences\n[1] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9\u201344,\n\n1988.\n\n[2] Justin A. Boyan. Least-squares temporal difference learning.\n\npages 49\u201356. Morgan Kaufmann, San Francisco, CA, 1999.\n\nIn Proc. Intl. Conf. Machine Learning,\n\n[3] Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal difference learning.\n\nIn Machine Learning, pages 22\u201333, 1996.\n\n[4] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. J. Mach. Learn. Res., 4:1107\u2013\n\n1149, 2003.\n\n[5] Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wake\ufb01eld, and Michael L. Littman. An anal-\nysis of linear models, linear value-function approximation, and feature selection for reinforcement learn-\ning. In ICML \u201908: Proceedings of the 25th international conference on Machine learning, pages 752\u2013759,\nNew York, NY, USA, 2008. ACM.\n\n[6] Matthew Rosencrantz, Geoffrey J. Gordon, and Sebastian Thrun. Learning low dimensional predictive\n\nrepresentations. In Proc. ICML, 2004.\n\n[7] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop with predic-\n\ntive state representations. In Proceedings of Robotics: Science and Systems VI, 2010.\n\n[8] Pascal Poupart and Craig Boutilier. Value-directed compression of pomdps. In NIPS, pages 1547\u20131554,\n\n2002.\n\n[9] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference\nlearning. In ICML \u201909: Proceedings of the 26th Annual International Conference on Machine Learning,\npages 521\u2013528, New York, NY, USA, 2009. ACM.\n\n[10] Gregory C. Reinsel and Rajabather Palani Velu. Multivariate Reduced-rank Regression: Theory and\n\nApplications. Springer, 1998.\n\n[11] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press,\n\n1996.\n\n[12] Byron Boots and Geoffrey J. Gordon. Predictive state temporal difference learning. Technical report,\n\narXiv.org.\n\n[13] Harold Hotelling. The most predictable criterion. Journal of Educational Psychology, 26:139\u2013142, 1935.\n[14] S. Soatto and A. Chiuso. Dynamic data factorization. Technical report, UCLA, 2001.\n[15] Michael Littman, Richard Sutton, and Satinder Singh. Predictive representations of state. In Advances in\n\nNeural Information Processing Systems (NIPS), 2002.\n\n[16] Judea Pearl. Causality: models, reasoning, and inference. Cambridge University Press, 2000.\n[17] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation,\n\n12:1371\u20131398, 2000.\n\n[18] Satinder Singh, Michael James, and Matthew Rudary. Predictive state representations: A new theory for\n\nmodeling dynamical systems. In Proc. UAI, 2004.\n\n[19] P. Van Overschee and B. De Moor. Subspace Identi\ufb01cation for Linear Systems: Theory, Implementation,\n\nApplications. Kluwer, 1996.\n\n[20] Tohru Katayama. Subspace Methods for System Identi\ufb01cation. Springer-Verlag, 2005.\n[21] Daniel Hsu, Sham Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models.\n\nIn COLT, 2009.\n\n[22] Michael R. James, Ton Wessling, and Nikos A. Vlassis. Improving approximate value iteration using\n\nmemories and predictive state representations. In AAAI, 2006.\n\n[23] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceed-\nings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS-2010),\n2010.\n\n[24] John N. Tsitsiklis and Benjamin Van Roy. Optimal stopping of Markov processes: Hilbert space theory,\nIEEE\n\napproximation algorithms, and an application to pricing high-dimensional \ufb01nancial derivatives.\nTransactions on Automatic Control, 44:1840\u20131851, 1997.\n\n[25] David Choi and Benjamin Roy. A generalized Kalman \ufb01lter for \ufb01xed point approximation and ef\ufb01cient\n\ntemporal-difference learning. Discrete Event Dynamic Systems, 16(2):207\u2013239, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1174, "authors": [{"given_name": "Byron", "family_name": "Boots", "institution": null}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}