{"title": "Predictive Representations of State", "book": "Advances in Neural Information Processing Systems", "page_first": 1555, "page_last": 1561, "abstract": "", "full_text": "Predictive Representations of State\n\nMichael L. Littman\nRichard S. Sutton\n\nAT&T Labs-Research, Florham Park, New Jersey\n\n{mlittman,sutton}~research.att.com\n\nSatinder Singh\n\nSyntek Capital, New York, New York\n\nbaveja~cs.colorado.edu\n\nAbstract\n\nWe show that states of a dynamical system can be usefully repre(cid:173)\nsented by multi-step, action-conditional predictions of future ob(cid:173)\nservations. State representations that are grounded in data in this\nway may be easier to learn, generalize better, and be less depen(cid:173)\ndent on accurate prior models than, for example, POMDP state\nrepresentations. Building on prior work by Jaeger and by Rivest\nand Schapire, in this paper we compare and contrast a linear spe(cid:173)\ncialization of the predictive approach with the state representa(cid:173)\ntions used in POMDPs and in k-order Markov models. Ours is the\nfirst specific formulation of the predictive idea that includes both\nstochasticity and actions (controls). We show that any system has\na linear predictive state representation with number of predictions\nno greater than the number of states in its minimal POMDP model.\n\nIn predicting or controlling a sequence of observations, the concepts of state and\nstate estimation inevitably arise. There have been two dominant approaches. The\ngenerative-model approach, typified by research on partially observable Markov de(cid:173)\ncision processes (POMDPs), hypothesizes a structure for generating observations\nand estimates its state and state dynamics. The history-based approach, typified by\nk-order Markov methods, uses simple functions of past observations as state, that is,\nas the immediate basis for prediction and control. (The data flow in these two ap(cid:173)\nproaches are diagrammed in Figure 1.) Of the two, the generative-model approach\nis more general. The model's internal state gives it temporally unlimited memory(cid:173)\nthe ability to remember an event that happened arbitrarily long ago--whereas a\nhistory-based approach can only remember as far back as its history extends. The\nbane of generative-model approaches is that they are often strongly dependent on a\ngood model of the system's dynamics. Most uses of POMDPs, for example, assume\na perfect dynamics model and attempt only to estimate state. There are algorithms\nfor simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous\nto the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but\nthese are only effective at tuning parameters that are already approximately cor(cid:173)\nrect (e.g., Shatkay & Kaelbling, 1997).\n\n\f\fobservations\n(and actions)\n\n1-----1-----1..-\n\nstate\nrep'n\n\nobservations\u00a2E\n\n(and actions)\n\n/\nt/'\n1-step --+\ndelays\n\nstate\nrep'n\n\n.\n\n(a)\n\n(b)\n\nFigure 1: Data flow in a) POMDP and other recursive updating of state represen(cid:173)\ntation, and b) history-based state representation.\n\nIn practice, history-based approaches are often much more effective. Here, the state\nrepresentation is a relatively simple record of the stream of past actions and ob(cid:173)\nservations.\nIt might record the occurrence of a specific subsequence or that one\nevent has occurred more recently than another. Such representations are far more\nclosely linked to the data than are POMDP representations. One way of saying\nthis is that POMDP learning algorithms encounter many local minima and saddle\npoints because all their states are equipotential. History-based systems immedi(cid:173)\nately break symmetry, and their direct learning procedure makes them comparably\nsimple. McCallum (1995) has shown in a number of examples that sophisticated\nhistory-based methods can be effective in large problems, and are often more prac(cid:173)\ntical than POMDP methods even in small ones.\n\nThe predictive state representation (PSR) approach, which we develop in this paper,\nis like the generative-model approach in that it updates the state representation\nrecursively, as in Figure l(a), rather than directly computing it from data. We\nshow that this enables it to attain generality and compactness at least equal to\nthat of the generative-model approach. However, the PSR approach is also like the\nhistory-based approach in that its representations are grounded in data. Whereas a\nhistory-based representation looks to the past and records what did happen, a PSR\nlooks to the future and represents what will happen. In particular, a PSR is a vector\nof predictions for a specially selected set of action-observation sequences, called tests\n(after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1\nand U2 are specific actions and 01 and 02 are specific observations. The correct\nprediction for this test given the data stream up to time k is the probability of its\nobservations occurring (in order) given that its actions are taken (in order) (i.e.,\nPr {Ok = 01, Ok+1 = 02 I A k = u1,Ak+ 1 = U2}). Each test is a kind of experiment\nIf we knew the\nthat could be performed to tell us something about the system.\noutcome of all possible tests, then we would know everything there is to know\nabout the system. A PSR is a set of tests that is sufficient information to determine\nthe prediction for all possible tests (a sufficient statistic).\n\nAs an example of these points, consider the float/reset problem (Figure 2) consisting\nof a linear string of 5 states with a distinguished reset state on the far right. One\naction, f (float), causes the system to move uniformly at random to the right or\nleft by one state, bounded at the two ends. The other action, r (reset), causes a\njump to the reset state irrespective of the current state. The observation is always\nounless the r action is taken when the system is already in the reset state, in which\ncase the observation is 1. Thus, on an f action, the correct prediction is always 0,\nwhereas on an r action, the correct prediction depends on how many fs there have\nbeen since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or\nfour fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every\nsecond f, asymptotically bottoming out at 0.2.\n\nNo k-order Markov method can model this system exactly, because no limited-.\n\n\f\f.5\n\n.5\n\n1,0=1\n\na) float action\n\nb) reset action\n\nFigure 2: Underlying dynamics of the float/reset problem for a) the float action and\nb) the reset action. The numbers on the arcs indicate transition probabilities. The\nobservation is always 0 except on the reset action from the rightmost state, which\nproduces an observation of 1.\n\nlength history is a sufficient statistic. A POMDP approach can model it exactly by\nmaintaining a belief-state representation over five or so states. A PSR, on the other\nhand, can exactly model the float/reset system using just two tests: rl and fOrI.\nStarting from the rightmost state, the correct predictions for these two tests are al(cid:173)\nways two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...),\nwhich is always a sufficient statistic to predict the next pair in the sequence. Al(cid:173)\nthough this informational analysis indicates a solution is possible in principle, it\nwould require a nonlinear updating process for the PSR.\nIn this paper we restrict consideration to a linear special case of PSRs, for which\nwe can guarantee that the number of tests needed does not exceed the number\nof states in the minimal POMDP representation (although we have not ruled out\nthe possibility it can be considerably smaller). Of greater ultimate interest are the\nprospects for learning PSRs and their update functions, about which we can only\nspeculate at this time. The difficulty of learning POMDP structures without good\nprior models are well known. To the extent that this difficulty is due to the indirect\nlink between the POMDP states and the data, predictive representations may be\nable to do better.\n\nJaeger (2000) introduced the idea of predictive representations as an alternative\nto belief states in hidden Markov models and provided a learning procedure for\nthese models. We build on his work by treating the control case (with actions),\nwhich he did not significantly analyze. We have also been strongly influenced by\nthe work of Rivest and Schapire (1994), who did consider tests including actions,\nbut treated only the deterministic case, which is significantly different. They also\nexplored construction and learning algorithms for discovering system structure.\n\n1 Predictive State Representations\n\nWe consider dynamical systems that accept actions from a discrete set A and gener(cid:173)\nate observations from a discrete set O. We consider only predicting the system, not\ncontrolling it, so we do not designate an explicit reward observation. We refer to\nsuch a system as an environment. We use the term history to denote a test forming\nan initial stream of experience and characterize an environment by a probability dis(cid:173)\ntribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol\u00b7\u00b7\u00b7 Otl a1\u00b7\u00b7\u00b7 at)\nis the probability of observations 01, ... , O\u00a3 being generated, in that order, given that\nactions aI, ... ,at are taken, in that order. The probability of a test t conditional\non a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til,\nwe define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a\npredictive state representation (PSR) if and only if it forms a sufficient statistic for\nthe environment, Le., if and only if\n\nP(tlh) = ft(P(h)),\n\n(1)\n\n\f\ffor any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In\nthis paper we focus on linear PSRs, for which the projection functions are linear,\nthat is, for which there exist a (1 x q) projection vector mt, for every test t, such\nthat\n\nP(tlh) == ft(P(h)) =7 p(h)mf,\n\n(2)\n\nfor all histories h.\nLet Pi(h) denote the ith component of the prediction vector for some PSR. This\ncan be updated recursively, given a new action-observation pair a,o, by\n.(h ) == P(t.lh ) == P(otil ha) == faati(P(h)) == p(h)m'{;ati\np(h)mro '\nP2\n\nfaa (P(h))\n\n(3)\n\nao\n\n2 ao\n\nP(olha)\n\nwhere the last step is specific to linear PSRs. We can now state our main result:\n\nTheorem 1 For any environment that can be represented by a finite POMDP\nmodel, there exists a linear PSR with number of tests no larger than the number of\nstates in the minimal POMDP model.\n\n2 Proof of Theorem 1: Constructing a PSR from a POMDP\n\nWe prove Theorem 1 by showing that for any POMDP model of the environment,\nwe can construct in polynomial time a linear PSR for that POMDP of lesser or\nequal complexity that produces the same probability distribution over histories as\nthe POMDP model.\n\nWe proceed in three steps. First, we review POMDP models and how they assign\nprobabilities to tests. Next, we define an algorithm that takes an n-state POMDP\nmodel and produces a set of n or fewer tests, each of length less than or equal to\nn. Finally, we show that the set of tests constitute a PSR for the POMDP, that is,\nthat there are projection vectors that, together with the tests' predictions, produce\nthe same probability distribution over histories as the POMDP.\n\nA POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple\n(8, A, 0, bo,T, 0). Here, 8 is a set of n underlying (hidden) states, A is a dis(cid:173)\ncrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is\nan initial state distribution. The set T consists of (n x n) transition matrices Ta,\none for each action a, where Tlj is the probability of a transition from state i to j\nwhen action a is chosen. The set 0 consists of diagonal (n x n) observation matrices\noa,o, one for each pair of observation 0 and action a, where o~'o is the probability\nof observation 0 when action a is selected and state i is reached. l\nThe state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n)\nvector of the state-occupation probabilities given the history h. It can be computed\nrecursively given a new action a and observation 0 by\n\nb(h)Taoa,o\n\nb(hao) = b(h)Taoa,oe;'\n\nwhere en is the (1 x n)-vector of all Is.\nFinally, a POMDP defines a probability distribution over tests (and thus histories)\nby\n\nP(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa\u00a3,Ole~.\n\n(4)\n\nIThere are many equivalent formulations and the conversion procedure described here\n\ncan be easily modified to accommodate other POMDP definitions.\n\n\f\fWe now present our algorithm for constructing a PSR for a given POMDP. It\nuses a function u mapping tests to (1 x n) vectors defined recursively by u(c) ==\nen and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually,\nthe components of u(t) are the probabilities of the test t when applied from each\nunderlying state of the POMDP; we call u(t) the outcome vector for test t. We say\na test t is linearly independent of a set of tests S if its outcome vector is linearly\nindependent of the set of outcome vectors of the tests in S. Our algorithm search\nis used and defined as\n\nQ -<- search(c, {})\n\nsearch(t, S):\n\nfor each a E A, 0 E 0\n\nif aot is linearly independent of S\n\nthen S -<- search(aot, S U {aot})\n\nreturn S\n\nThe algorithm maintains a set of tests and searches for new tests that are linearly\nindependent of those already found. It is a form of depth-first search. The algorithm\nhalts when it checks all the one-step extensions of its tests and finds none that are\nlinearly independent. Because the set of tests Q returned by search have linearly\nindependent outcome vectors, the cardinality of Q is bounded by n, ensuring that\nthe algorithm halts after a polynomial number of iterations. Because each test in\nQ is formed by a one-step extension to some other test in Q, no test is longer than\nn action-observation pairs.\n\nThe check for linear independence can be performed in many ways, including Gaus(cid:173)\nsian elimination, implying that search terminates in polynomial time.\n\nBy construction, all one-step extensions to the set of tests Q returned by search\nare linearly dependent on those in Q. We now show that this is true for any test.\n\nLemma 1 The outcome vectors of the tests in Q can be linearly combined to produce\nthe outcome vector for any test.\n\nProof: Let U be the (n x q) matrix formed by concatenating the outcome vectors\nfor all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou\nare linearly dependent on the columns of U, we can write Taoa,ou == UWT for\nsome q x q matrix of weights W.\nIf t is a test that is linearly dependent on Q, then anyone-step extension of t, aot,\nis linearly dependent on Q. This is because we can write the outcome vector for t\nas u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot\nas u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly\ndependent on Q.\nNow, note that all one-step tests are linearly dependent on Q by the structure of\nthe search algorithm. Using the previous paragraph as an inductive argument, this\nimplies that all tests are linearly dependent on Q.\n0\n\nReturning to the float/reset example POMDP, search begins with by enumerating\nthe 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO\nare are linearly independent. Of the extensions of these, fOrO is the only one that\nis linearly independent of the other two. The remaining two tests added to Q by\nsearch are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly\nindependent of the 5 tests in Q, so the procedure halts.\n\n\f\fWe now show that the set of tests Q constitute a PSR for the POMDP by con(cid:173)\nstructing projection vectors that, together with the tests' predictions, produce the\nsame probability distribution over histories as the POMDP.\nFor each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and\na 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors\ndefined in the previous section and U+ is its pseudoinverse2 \u2022 The ith row of Mao is\nmaoti. The probability distribution on histories implied by these projection vectors\nis\n\np(h)m~101 alOl\np(h)M~ol\nb(h)UU+ra1 oa1,01 U ... U+Tal-10al-1,Ol-1 UU+Taloal,ole;\nb(h)Ta1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~,\n\nM~_10l_1 m~Ol\n\nLe., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses\nthe fact that UU+v T == vT for vT linearly dependent on the columns of U. This\nholds by construction of U in the previous section.\n\nThis completes the proof of Theorem 1.\n\nCompleting the float/reset example, consider the Mf,o matrix found by the process\ndefined in this section. It derives predictions for each test in Q after taking action f.\nMost of these are quite simple because the tests are so similar: the new prediction\nfor rO is exactly the old prediction for fOrO, for example. The only non trivial test\nis fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) +\n0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not\ncontain only positive entries.\n\n3 Conclusion\n\nWe have introduced a predictive state representation for dynamical systems that\nis grounded in actions and observations and shown that, even in its linear form, it\nis at least as general and compact as POMDPs. In essence, we have established\nPSRs as a non-inferior alternative to POMDPs, and suggested that they might have\nimportant advantages, while leaving demonstration of those advantages to future\nwork. We conclude by summarizing the potential advantages (to be explored in\nfuture work):\n\nLearnability. The k-order Markov model is similar to PSRs in that it is entirely\nbased on actions and observations. Such models can be learned trivially from data\nby counting-it is an open question whether something similar can be done with a\nPSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting,\nbut the situation is more complex in the multiple action case since outcomes are\nconditioned on behavior, violating some required independence assumptions.\n\nCompactness. We have shown that there exist linear PSRs no more complex that\nthe minimal POMDP for an environment, but in some cases the minimal linear PSR\nseems to be much smaller. For example, a POMDP extension of factored MDPs ex(cid:173)\nplored by Singh and Cohn (1998) would be cross-products of separate POMDPs and\nhave linear PSRs that increase linearly with the number and size of the component\nPOMDPs, whereas their minimal POMDP representation would grow as the size\n2If U = A~BT is the singular value decomposition of U, then B:E+AT is the pseudoin(cid:173)\nverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its\nreciprocal.\n\n\f\fof the state space, Le., exponential in the number of component POMDPs. This\n(apparent) advantage stems from the PSR's combinatorial or factored structure.\nAs a vector of state variables, capable of taking on diverse values, a PSR may be\ninherently more powerful than the distribution over discrete states (the belief state)\nof a POMDP. We have already seen that general PSRs can be more compact than\nPOMDPs; they are also capable of efficiently capturing environments in the diver(cid:173)\nsity representation used by Rivest and Schapire (1994), which is known to provide\nan extremely compact representation for some environments.\n\nGeneralization. There are reasons to think that state variables that are themselves\npredictions may be particularly useful in learning to make other predictions. With\nso many things to predict, we have in effect a set or sequence of learning problems, all\ndue to the same environment. In many such cases the solutions to earlier problems\nhave been shown to provide features that generalize particularly well to subsequent\nproblems (e.g., Baxter, 2000; Thrun & Pratt, 1998).\nPowerful, extensible representations. PSRs that predict tests could be gener(cid:173)\nalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In\nthis case, particularly, they would constitute a powerful language for representing\nthe state of complex environments.\n\nAcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and\nRob Schapire for many helpful discussions of these and related ideas.\n\nReferences\n\nBaum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique\noccurring in the statistical analysis of probabilistic functions of Markov chains. Annals\nof Mathematical Statistics, 41, 164-171.\n\nBaxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence\n\nResearch, 12, 149-198.\n\nChrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual\ndistinctions approach. Proceedings of the Tenth National Conference on Artificial Intel(cid:173)\nligence (pp. 183-188). San Jose, California: AAAI Press.\n\nJaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural\n\nComputation, 12, 1371-1398.\n\nKaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in '\n\npartially observable stochastic domains. Artificial Intelligence, 101, 99-134.\n\nLovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov\n\ndecision processes. Annals of Operations Research, 28, 47-65.\n\nMcCallum, A. K. (1995). Reinforcement learning with selective perception and hidden\nstate. Doctoral diss.ertation, Department of Computer Science, University of Rochester.\nRivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata.\n\nJournal of the ACM, 41, 555-589.\n\nShatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odomet(cid:173)\nric information~ Proceedings of Fifteenth International Joint Conference on Artificial\nIntelligence (IJCAI-91) (pp. 920-929).\n\nSingh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes.\n\nAdvances in Neural and Information Processing Systems 10 (pp. 1057-1063).\n\nSutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A frame(cid:173)\nwork for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211.\n\nThrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.\n\n\f\f", "award": [], "sourceid": 1983, "authors": [{"given_name": "Michael", "family_name": "Littman", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}]}