{"title": "A Nonlinear Predictive State Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 855, "page_last": 862, "abstract": "", "full_text": "A Nonlinear Predictive State Representation\n\nMatthew R. Rudary and Satinder Singh\n\nComputer Science and Engineering\n\nUniversity of Michigan\nAnn Arbor, MI 48109\n\n{mrudary,baveja}@umich.edu\n\nAbstract\n\nPredictive state representations (PSRs) use predictions of a set of tests to\nrepresent the state of controlled dynamical systems. One reason why this\nrepresentation is exciting as an alternative to partially observable Markov\ndecision processes (POMDPs) is that PSR models of dynamical systems\nmay be much more compact than POMDP models. Empirical work on\nPSRs to date has focused on linear PSRs, which have not allowed for\ncompression relative to POMDPs. We introduce a new notion of tests\nwhich allows us to de\ufb01ne a new type of PSR that is nonlinear in general\nand allows for exponential compression in some deterministic dynami-\ncal systems. These new tests, called e-tests, are related to the tests used\nby Rivest and Schapire [1] in their work with the diversity representation,\nbut our PSR avoids some of the pitfalls of their representation\u2014in partic-\nular, its potential to be exponentially larger than the equivalent POMDP.\n\n1 Introduction\n\nA predictive state representation, or PSR, captures the state of a controlled dynamical sys-\ntem not as a memory of past observations (as do history-window approaches), nor as a dis-\ntribution over hidden states (as do partially observable Markov decision process or POMDP\napproaches), but as predictions for a set of tests that can be done on the system. A test is\na sequence of action-observation pairs and the prediction for a test is the probability of\nthe test-observations happening if the test-actions are executed. Littman et al. [2] showed\nthat PSRs are as \ufb02exible a representation as POMDPs and are a more powerful represen-\ntation than \ufb01xed-length history-window approaches. PSRs are potentially signi\ufb01cant for\ntwo main reasons: 1) they are expressed entirely in terms of observable quantities and this\nmay allow the development of methods for learning PSR models from observation data\nthat behave and scale better than do existing methods for learning POMDP models from\nobservation data, and 2) they may be much more compact than POMDP representations. It\nis the latter potential advantage that we focus on in this paper.\n\nAll PSRs studied to date have been linear, in the sense that the probability of any sequence\nof k observations given a sequence of k actions can be expressed as a linear function of the\npredictions of a core set of tests. We introduce a new type of test, the e-test, and present\nthe \ufb01rst nonlinear PSR that can be applied to a general class of dynamical systems. In\nparticular, in the \ufb01rst such result for PSRs we show that there exist controlled dynamical\nsystems whose PSR representation is exponentially smaller than its POMDP representation.\n\n\fTo arrive at this result, we brie\ufb02y review PSRs, introduce e-tests and an algorithm to gen-\nerate a core set of e-tests given a POMDP, show that a representation built using e-tests is\na PSR and that it can be exponentially smaller than the equivalent POMDP, and conclude\nwith example problems and a look at future work in this area.\n\n2 Models of Dynamical Systems\n\nA model of a controlled dynamical system de\ufb01nes a probability distribution over sequences\nof observations one would get for any sequence of actions one could execute in the system.\nEquivalently, given any history of interaction with the dynamical system so far, a model\nde\ufb01nes the distribution over sequences of future observations for all sequences of future\nactions. The state of such a model must be a suf\ufb01cient statistic of the observed history; that\nis, it must encode all the information conveyed by the history.\n\nPOMDPs [3, 4] and PSRs [2] both model controlled dynamical systems. In POMDPs, a\nbelief state is used to encode historical information; in PSRs, probabilities of particular\nfuture outcomes are used. Here we describe both models and relate them to one another.\nPOMDPs A POMDP model is de\ufb01ned by a tuple hS,A,O, T, O, b0i, where S, A, and\nO are, respectively, sets of (unobservable) hypothetical underlying-system states, actions\nthat can be taken, and observations that may be issued by the system. T is a set of matrices\nof dimension |S| \u00d7 |S|, one for each a \u2208 A, such that T a\nij is the probability that the next\nstate is j given that the current state is i and action a is taken. O is a set of |S| \u00d7 |S|\ndiagonal matrices, one for each action-observation pair, such that Oa,o\nis the probability of\nii\nobserving o after arriving in state i by taking action a. Finally, b0 is the initial belief state,\na |S| \u00d7 1 vector whose ith element is the probability of the system starting in state i.\nThe belief state at history h is b(S|h) = [prob(1|h) prob(2|h) . . . prob(|S||h)], where\nprob(i|h) is the probability of the unobserved state being i at history h. After taking an\naction a in history h and observing o, the belief state can be updated as follows:\n\nbT (S|hao) = bT (S|h)T aOa,o\nbT (S|h)T aOa,o1|S|\n\n(1|S| is the |S| \u00d7 1 vector consisting of all 1\u2019s)\n\nPSRs Littman et al. [2] (inspired by the work of Rivest and Schapire [1] and Jaeger [5])\nintroduced PSRs to represent the state of a controlled dynamical system using predictions\nof the outcomes of tests. They de\ufb01ne a test t as a sequence of actions and observations\nt = a1o1a2o2 \u00b7\u00b7\u00b7 akok; we shall call this type of test a sequence test, or s-test in short. An\ns-test succeeds iff, when the sequence of actions a1a2 \u00b7\u00b7\u00b7 ak is executed, the system issues\nthe observation sequence o1o2 \u00b7\u00b7\u00b7 ok. The prediction p(t|h) is the probability that the s-test\nt succeeds from observed history h (of length n w.l.o.g.); that is\n\np(t|h) = prob(on+1 = o1, . . . , on+k = ok|h, an+1 = a1, . . . , an+k = ak)\n\n(1)\nwhere ai and oi denote the action taken and the observation, respectively, at time i. In the\nrest of this paper, we will abbreviate expressions like the right-hand side of Equation 1 by\nprob(o1o2 \u00b7\u00b7\u00b7 ok|ha1a2 \u00b7\u00b7\u00b7 ak).\nA set of s-tests Q = {q1q2 . . . q|Q|} is said to be a core set if it constitutes a PSR, i.e.,\nif its vector of predictions, p(Q|h) = [p(q1|h) p(q2|h)\n. . . p(q|Q||h)], is a suf\ufb01cient\nstatistic for any history h. Equivalently, if Q is a core set, then for any s-test t, there exists\na function ft such that p(t|h) = ft(p(Q|h)) for all h. The prediction vector p(Q|h) in\nPSR models corresponds to belief state b(S|h) in POMDP models. The PSRs discussed by\nLittman et al. [2] are linear PSRs in the sense that for any s-test t, ft is a linear function of\nthe predictions of the core s-tests; equivalently, the following equation\n\n\u2200s-tests t \u2203 a weight vector wt, s.t. p(t|h) = pT (Q|h)wt\n\n(2)\n\n\fde\ufb01nes what it means for Q to constitute a linear PSR. Upon taking action a in history h\nand observing o, the prediction vector can be updated as follows:\n\n= faoqi(p(Q|h))\nfao(p(Q|h))\n\n= pT (Q|h)maoqi\npT (Q|h)mao\n\np(qi|hao) = p(aoqi|h)\np(ao|h)\n\n(3)\n\nwhere the \ufb01nal right-hand side is only valid for linear PSRs. Thus a linear PSR model is\nspeci\ufb01ed by Q and by the weight vectors in the above equation maoq for all a \u2208 A, o \u2208\nO, q \u2208 Q \u222a \u03c6 (where \u03c6 is the null sequence). It is pertinent to ask what sort of dynamical\nsystems can be modeled by a PSR and how many core tests are required to model a system.\nIn fact, Littman et al. [2] answered these questions with the following result:\n\nLemma 1 (Littman et al. [2]) For any dynamical system that can be represented by a \ufb01nite\nPOMDP model, there exists a linear PSR model of size (|Q|) no more than the size (|S|) of\nthe POMDP model.\n\nT a2\n\nOa2o2\n\nOa1o1\n\nLittman et al. prove this result by providing an algorithm for constructing a linear PSR\nmodel from a POMDP model. The algorithm they present depends on the insight that\ns-tests are differentiated by their outcome vectors. An outcome vector u(t) for an s-test\nt = a1o1a2o2 . . . akok is a |S| \u00d7 1 vector;\nthe ith component of the vector is the\nprobability of t succeeding given that the system is in the hidden state i, i.e., u(t) =\nT a1\nOakok1|S|. Consider the matrix U whose rows correspond to\nthe states in S and whose columns are the outcome vectors for all possible s-tests. Let Q\ndenote the set of s-tests associated with the maximal set of linearly independent columns\nof U; clearly |Q| \u2264 |S|. Littman et al. showed that Q is a core set for a linear PSR model\nby the following logic. Let U(Q) denote the submatrix consisting of the columns of U\ncorresponding to the s-tests \u2208 Q. Clearly, for any s-test t, u(t) = U(Q)wt for some vector\nof weights wt. Therefore, p(t|h) = bT (S|h)u(t) = bT (S|h)U(Q)wt = p(Q|h)wt which\nis exactly the requirement for a linear PSR (cf. Equation 2).\n\n. . . T an\n\nWe will reuse the concept of linear independence of outcome vectors with a new type of\ntest to derive a PSR that is nonlinear in general. This is the \ufb01rst nonlinear PSR that can be\nused to represent a general class of problems. In addition, this type of PSR in some cases\nrequires a number of core tests that is exponentially smaller than the number of states in\nthe minimal POMDP or the number of core tests in the linear PSR.\n\n3 A new notion of tests\n\nIn order to formulate a PSR that requires fewer core tests, we look to a new kind of test\u2014\nthe end test, or e-test in short. An e-test is de\ufb01ned by a sequence of actions and a single\nending observation. An e-test e = a1a2 \u00b7\u00b7\u00b7 akok succeeds if, after the sequence of ac-\ntions a1a2 \u00b7\u00b7\u00b7 ak is executed, ok is observed. This type of test is inspired by Rivest and\nSchapire\u2019s [1] notion of tests in their work on modeling deterministic dynamical systems.\n\n3.1 PSRs with e-tests\n\nJust as Littman et al. considered the problem of constructing s-test-based PSRs from\nPOMDP models, here we consider how to construct a e-test-based PSR, or EPSR, from\na POMDP model and will derive properties of EPSRs from the resulting construction.\nThe |S| \u00d7 1 outcome vector for an e-test e = a1a2 . . . akok is\nOakok1|S|.\n\n(4)\nNote that we are using v\u2019s to denote outcome vectors for e-tests and u\u2019s to denote outcome\nvectors for s-tests. Consider the matrix V whose rows correspond to S whose columns are\n\nv(e) = T a1\n\n. . . T ak\n\nT a2\n\n\fdone \u2190 false; i \u2190 0; L \u2190 {}\ndo until done\ndone \u2190 true\nN \u2190 generate all one-action extensions of length-i tests in L\nfor each t \u2208 N\n\nif v(t) is linearly independent of V (L) then\n\nL \u2190 L \u222a {t}; done \u2190 false\n\nend for\ni \u2190 i + 1\n\nend do\nQV \u2190 L\n\nFigure 1: Our search algorithm to \ufb01nd a set of core e-tests given the outcome vectors.\n\nthe outcome vectors for all possible e-tests. Let QV denote the set of e-tests associated\nwith a maximal set of linearly independent columns of matrix V ; clearly |QV | \u2264 |S|. Note\nthat QV is not uniquely de\ufb01ned; there are many such sets. The hope is that the set QV is\na core set for an EPSR model of the dynamical system represented by the POMDP model.\nBut before we consider this hope, let us consider how we would \ufb01nd QV given a POMDP\nmodel.\n\nWe can compute the outcome vector for any e-test from the POMDP parameters using\nEquation 4. So we could compute the columns of V one by one and check to see how many\nlinearly independent columns we \ufb01nd. If we ever \ufb01nd |S| linearly independent columns,\nwe know we can stop, because we will not \ufb01nd any more. However, if |QV | < |S|, then\nhow would we know when to stop? In Figure 1, we present a search algorithm that \ufb01nds\na set QV in polynomial time. Our algorithm is adapted from Littman et al.\u2019s algorithm for\n\ufb01nding core s-tests. The algorithm starts with all e-tests of length one and maintains a set\nL of currently known linearly independent e-tests. At each iteration it searches for new\nlinearly independent e-tests among all one-action extensions of the e-tests it added to L at\nthe last iteration and stops when an iteration does not add to the set L.\n\nLemma 2 The search algorithm of Figure 1 computes the set QV in time polynomial in\nthe size of the POMDP.\n\nProof Computing the outcome vector for an e-test using Equation 4 is polynomial in the\nnumber of states and the length of the e-test. There cannot be more than |S| e-tests in the\nset L maintained by the search algorithm algorithm and only one-action extensions of the\ne-tests in L \u222a O are ever considered. Each extension length considered must add an e-test\nelse the algorithm stops, and so the maximal length of each e-test in QV is upper bounded\n(cid:3)\nby the number of states. Therefore, our algorithm returns QV in polynomial time.\n\nNote that this algorithm is only practical if the outcome vectors have been found; this only\nmakes sense if the POMDP model is already known, as outcome vectors map POMDP\nstates to outcomes. We will address learning these models from observations in future\nwork [6]. Next we show that the prediction of any e-test can be computed linearly from the\nprediction vector for the e-tests in QV .\nLemma 3 For any history h and any e-test e, the prediction p(e|h) is some linear function\nof prediction vector p(QV |h), i.e., p(e|h) = p(QV |h)we for some weight vector we.\nLet V (QV ) be the submatrix of V containing the columns corresponding to QV .\nProof\nBy the de\ufb01nition of QV , for any e-test e, v(e) = V (QV )we, for some weight vector we.\n\nFurthermore, for any history h, p(e|h) = b(S|h)v(e) = b(S|h)V (QV )we = p(QV |h)we.(cid:3)\n\n\fNote that Lemma 3 does not imply that QV constitutes a PSR, let alone a linear PSR, for\nthe de\ufb01nition of a PSR requires that the prediction of all s-tests be computable from the\ncore test predictions. Next we turn to the crucial question: does QV constitute a PSR?\n\nTheorem 1 If V (QV ), de\ufb01ned as above with respect to some POMDP model of a dynam-\nical system, is a square matrix, i.e., the number of e-tests in QV is the number of states |S|\n(in that POMDP model), then QV constitutes a linear EPSR for that dynamical system.\n\nFor any history h, pT (QV |h) = bT (S|h)V (QV ). If V (QV ) is square then it is\nProof\ninvertible because by construction it has full rank, and hence for any history h, bT (S|h) =\npT (QV |h)V \u22121(QV ). For any s-test t = a1o1a2o2 \u00b7\u00b7\u00b7 akok,\npT (t|h) = bT (S|h)T a1\n\nOak,ok1S (by \ufb01rst-principles de\ufb01nition)\nOak,ok1S = pT (QV |h)wt\nOa2,o2 \u00b7\u00b7\u00b7 T ak\nfor some weight vector wt. Thus, QV constitutes a linear EPSR as per the de\ufb01nition in\n(cid:3)\nEquation 2.\n\nT a2\n= pT (QV |h)V \u22121(QV )T a1\n\nOa2,o2 \u00b7\u00b7\u00b7 T ak\nT a2\nOa1,o1\n\nOa1,o1\n\nOa1,o1 \u00b7\u00b7\u00b7 T ak\n\n= p(QV |h)maoei\np(QV |h)mao\n\nT a2 \u00b7\u00b7\u00b7 T ak\n\nOak,ok1S for the e-test e = a1a2 \u00b7\u00b7\u00b7 akok.\n\nWe note that the product T a1\nOak,ok1S appears often in association with an\ns-test t = a1o1 \u00b7\u00b7\u00b7 akok, and abbreviate the product z(t). We similarly de\ufb01ne z(e) =\nT a1\nStaying with the linear EPSR case for now, we can de\ufb01ne an update function for p(QV |h)\nas follows: (remembering that V (QV ) is invertible for this case)\np(ei|hao) = p(aoei|h)\np(ao|h)\n\n= p(QV |h)V \u22121(QV )z(aoei)\n\n= b(S|h)T aOa,oz(ei)\n\np(Q|h)mao\n\np(QV |h)mao\n\n(5)\nwhere we used the fact that the test ao in the denominator is an e-test. (The form of the\nlinear EPSR update equation is identical to the form of the update in linear PSRs with\ns-tests given in Equation 3). Thus, a linear EPSR model is de\ufb01ned by QV and the set of\nweight vectors, maoe for all a \u2208 A, o \u2208 O, e \u2208 {QV \u222a \u03c6}, used in Equation 5.\nNow, let us turn to the case when the number of e-tests in QV is less than |S|, i.e., when\nV (QV ) is not a square matrix.\nLemma 4 In general, if the number of e-tests in QV is less than |S|, then QV is not\nguaranteed to be a linear EPSR.\n\nProof\n(Sketch) To prove this lemma, we must only \ufb01nd an example of a dynamical\nsystem that is an EPSR but not a linear EPSR. In Section 4.3 we present a k-bit register as\nan example of such a problem. We show in that section that the state space size is 2k and\nthe number of s-tests in the core set of a linear s-test-based PSR model is also 2k, but the\nnumber of e-tests in QV is only k + 1. Note that this means that the rank of the U matrix\nfor s-tests is 2k while the rank of the V matrix for e-tests is k + 1. This must mean that QV\n(cid:3)\nis not a linear EPSR. We skip these details for lack of space.\nLemma 4 leaves open the possibility that if |QV | < |S| then QV constitutes a nonlinear\nEPSR. We were unable, thus far, to evaluate this possibility for general POMDPs but we\ndid obtain an interesting and positive answer, presented in the next section, for the class of\ndeterministic POMDPs.\n\n4 A Nonlinear PSR for Deterministic Dynamical Systems\n\nIn deterministic dynamical systems, the predictions of both e-tests and s-tests are binary\nand it is this basic fact that allows us to prove the following result.\n\n\fTheorem 2 For deterministic dynamical systems the set of e-tests QV is always an EPSR\nand in general it is a nonlinear EPSR.\n\nFor an arbitrary s-test t = a1o1a2o2 \u00b7\u00b7\u00b7 akok, and some arbitrary history h that is\nProof\nrealizable (i.e., p(h) = 1), and for some vectors wa1o1, wa1o1a2o2, . . . , wa1o1a2o2\u00b7\u00b7\u00b7akok of\nlength |QV |, we have\n\nprob(o1o2 \u00b7\u00b7\u00b7 ok|ha1a2 \u00b7\u00b7\u00b7 ak) =\n\n= prob(o1|ha1)prob(o2|ha1o1a2)\u00b7\u00b7\u00b7 prob(ok|ha1o1a2o2 \u00b7\u00b7\u00b7 ak\u22121ok\u22121ak)\n= prob(o1|ha1)prob(o2|ha1a2)\u00b7\u00b7\u00b7 prob(ok|ha1a2 \u00b7\u00b7\u00b7 ak)\n= (pT (QV |h)wa1o1)(pT (QV |h)wa1o1a2o2)\u00b7\u00b7\u00b7 (pT (QV |h)wa1o1\u00b7\u00b7\u00b7akok)\n= ft(p(QV |h))\n\nIn going from the second line to the third, we eliminate observations from the conditions by\nnoting that in a deterministic system, the observation emitted depends only on the sequence\nof actions executed. In going from the third line to the fourth, we use the result of Lemma 3\nthat regardless of the size of QV , the predictions for all e-tests for any history h are linear\nfunctions of p(QV |h). This shows that for deterministic dynamical systems, QV , regardless\n(cid:3)\nof its size, constitutes an EPSR.\nUpdate Function: Since predictions are binary in deterministic EPSRs, p(ao|h) must be 1\nif o is observed after taking action a in history h:\n\np(ei|hao) = p(aoei|h)/p(ao|h) = p(aei|h) = p(QV |h)maei\n\nwhere the second equality from the left comes about because p(ao|h) = 1 and, because\no must be observed when a is executed, p(aoei|h) = p(aei|h), and the last equality used\nthe fact that aei is just some other e-test and so from Lemma 3 must be a linear function\nof p(QV |h).\nIt is rather interesting that even though the EPSR formed through QV is\nnonlinear (as seen in Theorem 2), the update function is in fact linear.\n\n4.1 Diversity and e-tests\n\nRivest and Schapire\u2019s [1] diversity representation, the inspiration for e-tests, applies only\nto deterministic systems and can be explained using the binary outcome matrix V de\ufb01ned\nat the beginning of Section 3.1. Diversity also uses the predictions of a set of e-tests as its\nrepresentation of state; it uses as many e-tests as there are distinct columns in the matrix\nV . Clearly, at most there can be 2|S| distinct columns and they show that there have to\nbe at least log2(|S|) distinct columns and that these bounds are tight. Thus the size of\nthe diversity representation can be exponentially smaller or exponentially bigger than the\nsize of a POMDP representation. While we use the same notion of tests as the diversity\nrepresentation, our use of linear independence of outcome vectors instead of equivalence\nclasses based on equality of outcome vectors allows us to use e-tests on stochastic systems.\n\nNext we show through an example that EPSR models in deterministic dynamic systems\ncan lead to exponential compression over POMDP models in some cases while avoiding\nthe exponential blowup possible in Rivest and Schapire\u2019s [1] diversity representation.\n\n4.2 EPSRs can be Exponentially Smaller than Diversity\n\nThis \ufb01rst example shows a case in which the size of the EPSR representation is exponen-\ntially smaller than the size of the diversity representation. The hit register (see Figure 2a)\nis a k-bit register (these are the value bits) with an additional special hit bit. There are\n2k + 1 states in the POMDP describing this system\u2014one state in which the hit bit is 1\nand 2k states in which the hit bit is 0 and the value bits take on different combinations of\n\n\fFigure 2: The two example systems. a) The k-bit hit register. There are k value bits and the\nspecial hit bit. The value of the hit bit determines the observation and k + 2 actions alter\nthe value of the bits; this is fully speci\ufb01ed in Section 4.2. b) The k-bit rotate register. The\nvalue of the leftmost bit is observed; this bit can be \ufb02ipped, and the register can be rotated\neither to the right or to the left. This is described in greater detail in Section 4.3.\n\nvalues. There are k + 2 actions: a \ufb02ip action Fi for each value bit i that inverts bit i if the\nhit bit is not set, a set action Sh that sets the hit bit if all the value bits are 0, and a clear\naction Ch that clears the hit bit. There are two observations: Oh if the hit bit is set and\nOm otherwise. Rivest and Schapire [1] present a similar problem (their version has no Ch\naction). The diversity representation requires O(22k) equivalence classes and thus tests,\nwhereas an EPSR has only 2k + 1 core e-tests (see Table 1 for the core e-tests and update\nfunction when k = 2).\n\nTable 1: Core e-tests and update functions for the 2-bit hit register problem.\n\nupdate function for action\n\ntest\nF1Oh\nShOh\n\nF1ShOh\n\nF1\n\np(F1Oh)\n\np(F1ShOh)\n\np(ShOh)\n\nF2\n\np(F1Oh)\n\np(F2ShOh)\n\np(F2F1ShOh)\n\nF2ShOh\n\np(F2F1ShOh)\n\np(ShOh)\n\nF2F1ShOh\n\np(F2ShOh)\n\np(F1ShOh)\n\nSh\n\np(ShOh)\np(ShOh)\n\np(F1ShOh)\n\np(ShOh)\u2212 p(F1Oh) +\np(ShOh)\u2212 p(F1Oh) +\np(ShOh)\u2212 p(F1Oh) +\n\np(F2ShOh)\n\nCh\n0\n\np(ShOh)\n\np(F1Oh)\n\np(F1ShOh) \u2212\np(F2ShOh) \u2212\np(F2F1ShOh)\u2212\n\np(F1Oh)\n\np(F2F1ShOh)\n\np(F1Oh)\n\nLemma 5 For deterministic dynamical systems, the size of the EPSR representation is\nalways upper-bounded by the minimum of the size of the diversity representation and the\nsize of the POMDP representation.\n\nThe size of the EPSR representation, |QV |, is upper-bounded by |S| by construc-\nProof\ntion of QV . The number of e-tests used by diversity representation is the number of distinct\ncolumns in the binary V matrix of Section 3.1, while the number of e-tests used by the\nEPSR representation is the number of linearly independent columns in V . Clearly the lat-\nter is upper-bounded by the former. As a quick example, consider the case of 2-bit binary\n(cid:3)\nvectors: There are 4 distinct vectors but only 2 linearly independent ones.\n\n4.3 EPSRs can be Exponentially Smaller than POMDPs and the Original PSRs\n\nThis second example shows a case in which the EPSR representation uses exponentially\nfewer tests than the number of states in the POMDP representation as well as the original\nlinear PSR representation. The rotate register illustrated in Figure 2b is a k-bit shift-register.\n\nb)a)010111......k bits (value bits)k bitshit\fTable 2: Core e-tests and update function for the 4 bit rotate register problem.\n\ntest\nF O1\nRO1\nLO1\nF F O1\nRRO1\n\nR\n\nupdate function for action\nL\n\np(F O1) + p(F F O1) \u2212 p(RO1)\n\np(F O1) + p(F F O1) \u2212 p(LO1)\n\np(RRO1)\np(F F O1)\np(RO1)\np(LO1)\n\np(F F O1)\np(RRO1)\np(LO1)\np(RO1)\n\nF\n\np(F F O1)\np(RO1)\np(LO1)\np(F O1)\np(RRO1)\n\nThere are two observations: O1 is observed if the leftmost bit is 1 and O0 is observed when\nthe leftmost bit is 0. The three actions are R, which shifts the register one place to the\nright with wraparound, L, which does the opposite, and F , which \ufb02ips the leftmost bit.\nThis problem is also presented by Rivest and Schapire as an example of a system whose\ndiversity is exponentially smaller than the number of states in the minimal POMDP, which\nis 2k. This is also the number of core s-tests in the equivalent linear PSR (we computed\nthese 2k s-tests but do not report them here). The diversity is 2k. However, the EPSR that\nmodels this system has only k + 1 core e-tests. The tests and update function for the 4-bit\nrotate register are shown in Table 2.\n\n5 Conclusions and Future Work\n\nIn this paper we have used a new type of test, the e-test, to specify a nonlinear PSR for\ndeterministic controlled dynamical systems. This is the \ufb01rst nonlinear PSR for any general\nclass of systems. We proved that in some deterministic systems our new PSR models are\nexponentially smaller than both the original PSR models as well as POMDP models. Sim-\nilarly, compared to the size of Rivest & Schapire\u2019s diversity representation (the inspiration\nfor the notion of e-tests) we proved that our PSR models are never bigger but sometimes\nexponentially smaller. This work has primarily been an attempt to understand the repre-\nsentational implications of using e-tests; as future work, we will explore the computational\nimplications of switching to e-tests.\n\nAcknowledgments\n\nMatt Rudary and Satinder Singh were supported by a grant from the Intel Research Council.\n\nReferences\n[1] Ronald L. Rivest and Robert E. Schapire. Diversity-based inference of \ufb01nite automata. Journal\n\nof the ACM, 41(3):555\u2013589, May 1994.\n\n[2] Michael L. Littman, Richard S. Sutton, and Satinder Singh. Predictive representations of state.\n\nIn Advances In Neural Information Processing Systems 14, 2001.\n\n[3] William S. Lovejoy. A survey of algorithmic methods for partially observed markov decision\n\nprocesses. Annals of Operations Research, 28(1):47\u201365, 1991.\n\n[4] Michael L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University,\n\n1996.\n\n[5] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Compu-\n\ntation, 12(6):1371\u20131398, 2000.\n\n[6] Satinder Singh, Michael L. Littman, Nicholas E. Jong, David Pardoe, and Peter Stone. Learning\npredictive state representations. In The Twentieth International Conference on Machine Learning\n(ICML-2003), 2003. To appear.\n\n\f", "award": [], "sourceid": 2413, "authors": [{"given_name": "Matthew", "family_name": "Rudary", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}