{"title": "Completing State Representations using Spectral Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4328, "page_last": 4337, "abstract": "A central problem in dynamical system modeling is state discovery\u2014that is, finding a compact summary of the past that captures the information needed to predict the future. Predictive State Representations (PSRs) enable clever spectral methods for state discovery; however, while consistent in the limit of infinite data, these methods often suffer from poor performance in the low data regime. In this paper we develop a novel algorithm for incorporating domain knowledge, in the form of an imperfect state representation, as side information to speed spectral learning for PSRs. We prove theoretical results characterizing the relevance of a user-provided state representation, and design spectral algorithms that can take advantage of a relevant representation. Our algorithm utilizes principal angles to extract the relevant components of the representation, and is robust to misspecification. Empirical evaluation on synthetic HMMs, an aircraft identification domain, and a gene splice dataset shows that, even with weak domain knowledge, the algorithm can significantly outperform standard PSR learning.", "full_text": "Completing State Representations\n\nusing Spectral Learning\n\nNan Jiang\n\nUIUC\n\nUrbana, IL\n\nnanjiang@illinois.edu\n\nAlex Kulesza\nGoogle Research\nNew York, NY\n\nkulesza@google.com\n\nAbstract\n\nSatinder Singh\n\nUniversity of Michigan\n\nAnn Arbor, MI\n\nbaveja@umich.edu\n\nA central problem in dynamical system modeling is state discovery\u2014that is, \ufb01nding\na compact summary of the past that captures the information needed to predict the\nfuture. Predictive State Representations (PSRs) enable clever spectral methods\nfor state discovery; however, while consistent in the limit of in\ufb01nite data, these\nmethods often suffer from poor performance in the low data regime.\nIn this\npaper we develop a novel algorithm for incorporating domain knowledge, in the\nform of an imperfect state representation, as side information to speed spectral\nlearning for PSRs. We prove theoretical results characterizing the relevance of a\nuser-provided state representation, and design spectral algorithms that can take\nadvantage of a relevant representation. Our algorithm utilizes principal angles\nto extract the relevant components of the representation, and is robust to mis-\nspeci\ufb01cation. Empirical evaluation on synthetic HMMs, an aircraft identi\ufb01cation\ndomain, and a gene splice dataset shows that, even with weak domain knowledge,\nthe algorithm can signi\ufb01cantly outperform standard PSR learning.\n\n1\n\nIntroduction\n\nWhen modeling discrete-time, \ufb01nite-observation dynamical systems from data, a central challenge is\nstate representation discovery, that is, \ufb01nding a compact function of the history that forms a suf\ufb01cient\nstatistic for the future. Many models and algorithms have been developed for this problem, each\nof which represents and discovers state differently. For example, Hidden Markov Models (HMMs)\nrepresent state as the posterior over latent variables, and are learned via Expectation Maximization.\nRecurrent Neural Networks (RNNs) do not commit to any pre-determined semantics of state, but\nlearn a state update function by back-propagation through time. Here, we focus on Predictive State\nRepresentations (PSRs), which represent state as predictions of observable future events [1, 2], and\nare unique in that they can be learned by fast, closed-form, and consistent spectral algorithms [3, 4].\nThough they have been used successfully, spectral algorithms for PSRs attempt to discover the entire\nstate representation from raw data, ignoring the possibility that the user has domain knowledge about\nwhat might constitute a good state representation. In many application scenarios, however, users do\nhave such knowledge and can handcraft a meaningful, albeit incomplete state: for example, in many\ndomains found in reinforcement learning [5, 6, 7], the last observation is often highly informative,\nand only a small amount of additional information needs to be extracted from the history to form a\ncomplete state. While spectral algorithms for PSRs are asymptotically consistent, ignoring domain\nknowledge and discovering state from scratch is wasteful and can result in poor sample ef\ufb01ciency.\nIn this work, we extend PSRs to take advantage of an imperfect, user-provided state function f, and\ndesign spectral algorithms for learning the resulting PSR-f models. We theoretically characterize\nthe relevance of f to the system of interest, and show that a PSR-f model can have substantially\nsmaller size\u2014and can thus be learned from less data\u2014than the corresponding PSR. Our algorithm\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcomputes principal angles to discover relevant components of f, and hence is robust to mis-speci\ufb01ed\nrepresentations. Experimental results show that this theoretical advantage translates to signi\ufb01cantly\nimproved performance in practice, particularly when only a limited amount of data is available.\n\n2 Background\nConsider a dynamical system M that produces sequences of observations from a \ufb01nite set O starting\nfrom some \ufb01xed initial condition. (The initial condition can be de\ufb01ned by a system restart or, if the\nsystem is not subject to restarts, it can be the stationary distribution.) For any sequence of observations\nx 2O \u21e4, let P (x) be the probability that the \ufb01rst |x| observations, starting from the initial condition,\nare given by x. Similarly, for any pair of sequences h, t 2O \u21e4, let P (t|h) = P (ht)/P (h), where ht\ndenotes the concatenation of h and t, be the probability that the next |t| observations are given by t\nconditioned on the fact that the \ufb01rst |h| observations were given by h.\nWe say that b : O\u21e4 !Z is state for M if b is a suf\ufb01cient statistic of history; that is, if all future\nobservations are independent of past observations h 2O \u21e4 conditioned on b(h). When the function\nb(\u00b7) is known, the system can be fully speci\ufb01ed by P (o|h) = P (o|b(h)). And when Z is \ufb01nite, the\nprobabilities P (o|z) for o 2O and z 2Z can be estimated straightforwardly from data.\nIn this paper we consider slightly more general state representations b : O\u21e4 ! Rn, letting P (o|h) =\n>o b(h) for some o 2 Rn. This generalizes discrete-valued b(\u00b7) above because we can lift a discrete\nstate to a one-hot indicator vector in Rn with n = |Z|, in which case the z-th entry of o is P (o|z).\nPSRs When b(\u00b7) is unknown, we need to learn both the state representation and {o}o2O from data.\nPSRs prescribe the state semantics:\n\nb(h) = PT |h := [P (t|h)]t2T 2 R|T | ,\n\n(1)\nwhere T\u21e2O \u21e4 is a set of appropriately chosen tests. Given a corresponding set of histories H\u21e2O \u21e4,\nthe matrix PT ,H := [P (ht)]t2T ,h2H plays a central role in PSR theory. T and H are called core sets\nif PT ,H has maximal rank, that is, rank(PT ,H) = rank(PO\u21e4,O\u21e4). We use rank(M ) to denote that\nmaximum rank, also known as the linear dimension of M [2]. The linear dimension of an HMM, for\nexample, is upper bounded by the number of latent states, regardless of the number of observations.\nWhen T is core, PT |h is provably a suf\ufb01cient statistic of history, and there exist {o}o2O such that\nP (o|h) = >o PT |h for all h 2O \u21e4. Furthermore, there exist updating matrices {Bo}o2O such that\nPoT |h = BoPT |h, where oT = {ot : t 2T } . Knowing {Bo} is suf\ufb01cient to compute PT |h for any\nh 2O \u21e4 by applying iterative updates to the initial state b\u21e4 := PT |\u270f (\u270f is the null sequence), since\n\nPT |ho =\n\nPoT |h\nP (o|h)\n\n=\n\nBoPT |h\n>o PT |h\n\n.\n\n(2)\n\nAltogether, we can compute P (x) for any x 2O \u21e4 using a PSR B = {b\u21e4,{Bo},{o}}.1\nThanks to the linear prediction rules, PSR parameters can be computed by solving linear regression\nproblems {b(h) 7! PoT |h : h 2H} (for Bo) and {b(h) 7! P (o|h) : h 2H} (for o) [8], where each\nh 2H is a regression point and H being core guarantees that the design matrix PT ,H has suf\ufb01cient\nrank. When (T ,H) are core and |T | = |H| = rank(M ) (in which case we say that (T ,H) are\nminimal core), we have [9]:\n(3)\nWhen only sample data are available, we can estimate the required statistics and plug them into\nEq.(3), which yields a consistent algorithm.\n\nb\u21e4 = PT |\u270f, Bo = PoT ,H(PT ,H)1, >o = Po,H(PT ,H)1, 8o 2O .\n\n3 PSR-f: De\ufb01nitions and Properties\n\nIn this section we introduce the PSR-f, which extends the PSR by incorporating a \u201csuggested\u201d state\nrepresentation f : O\u21e4 ! Rm that is supplied by the user. Crucially, we show that when f provides\ninformation relevant to the system, the PSR-f model of the system can be more succinct than the PSR\n\n1See Appendix B for why we do not adopt the more popular {b\u21e4,{Bo}, b1} parameterization.\n\n2\n\n\fmodel (Sec. 3.1). Succinctness often implies better \ufb01nite sample performance, which is con\ufb01rmed\nlater by empirical evaluation in Sec. 6.\nState Representation Given f : O\u21e4 ! Rm and a set of tests T , the state representation of a PSR-f,\ndenoted by b(h), is the concatenation of two components:\n\nb(h) =\uf8ffPT |h\nf (h) .\n\n(4)\n\nWhile our formulation and results apply to arbitrary functions f, it is instructive to consider the\nspecial case f (h) = PTf|h, where Tf is a set of m user-speci\ufb01ed independent tests, that is, PTf ,O\u21e4\nhas linearly independent rows.2 In this case, we are essentially given a partial PSR state, and only\nneed to \ufb01nd complementary tests T to complete the picture. In particular, b(h) is a full state as long\nas T[T f is core, meaning that only rank(M ) m tests remain to be discovered. Similarly, if f is\nlifted from a discrete-valued function (see Sec. 2), Appendix A shows that f can often be viewed as a\ntransformed predictive representation, making it natural to concatenate it with PT |h.\nModel Parameters and Prediction Rules\nA PSR-f has model parameters B = {b\u21e4 2\nRm+|T |,{Bo 2 R|T |\u21e5(m+|T |)},{o 2 Rm+|T |}}. Below we specify the rules used to predict\nP (o|h) from b(h) and to update b(h) to b(ho). Using these rules, we can predict P (x) for any\nx 2O \u21e4 in the same manner as standard PSRs (see Sec. 2).\nPrediction: P (o|h) \u21e1 >o b(h). State update: b(ho) =\uf8ffPT |ho\n\nf (ho) , where PT |ho \u21e1\n\nBo b(h)\n>o b(h)\n\n(Note that we use approximate notation here because we have not yet characterized the conditions\nunder which a PSR-f will be exact.)\nNa\u00efve Learning Algorithm Recall that Eq.(3) can be seen as linear regression: Bo is the solution\nto {b(h) 7! PoT |h : h 2H} and o to {b(h) 7! P (o|h) : h 2H} . We extend this idea to PSR-f.\nLet fH be a m\u21e5|H| matrix whose h-th column is f (h), and Pf,H := fH diag(P\u270f,H); that is, its h-th\ncolumn is P (h)f (h). PSR-f parameters can be computed by solving the following linear systems:\n\n.\n\nb\u21e4 =\uf8ffPT |\u270f\n\nf (\u270f) , Bo\uf8ffPT ,HPf,H \u21e1 PoT ,H, >o \uf8ffPT ,HPf,H \u21e1 Po,H, 8o 2O .\n\nFor now we assume thath PT ,H\n\nPf,Hi is invertible so that Eq.(5) can be solved by matrix inverse. This\n\nrestriction will be removed in Sec. 4. When m = 0 we recover Eq.(3) for standard PSRs. Furthermore,\nby plugging in empirical estimates, we have a na\u00efve algorithm that learns PSR-f models from data.\nThe immediate next question is, when is this algorithm consistent?\n\n(5)\n\n3.1 Rank, Core, and Consistency\nFor PSRs, consistency requires core H and T . Since PSR-fs generalize PSRs, we will need related\nbut slightly different conditions for H and T .\nFor the easy case where f (h) = PTf|h, the answer is clear: Eq.(5) is consistent if T[T f and H\nare, respectively, minimal core tests and histories. This establishes that a PSR-f model can be more\nsuccinct than a PSR model: if Tf consists of linearly independent tests, then with minmal T and H\nthe size of each Bo is (rank(M ) |T f|) rank(M ) for a PSR-f, compared to rank2(M ) for a PSR.\nIn the rest of this section, we extend the above result to the general case, where f is an arbitrary\nfunction. We \ufb01rst give the de\ufb01nition of core tests/histories w.r.t. f, and establish consistency.\nDe\ufb01nition 1. (T ,H) are core w.r.t. f if\n\nrank\uf8ffPT ,HPf,H = sup\n\nT 0,H0\u21e2O\u21e4\n\nrank\uf8ffPT 0,H0\nPf,H0 .\n\n(6)\n\n2We describe the scenario where f (h) = PTf |h only to help readers transfer their knowledge from PSR to\nPSR-f. We expect that f (h) will not take such a predictive format in practice. See the experiment setup in the\naircraft domain for an example of a more natural choice of f.\n\n3\n\n\fAs with standard PSRs, we can consider core tests and histories separately; i.e., T (or H) is core\nw.r.t. f if there exists H (or T ) such that De\ufb01nition 1 is satis\ufb01ed.\nTheorem 1 (Consistency). Solving Eq.(5) by matrix inverse is a consistent algorithm if (T ,H) are\ncore w.r.t. f andh PT ,H\n\nPf,Hi is invertible.\n\nThe proof can be found in Appendix C. While Theorem 1 guarantees consistency, we have not yet\nillustrated the bene\ufb01ts of using f. In particular, we want to characterize the sizes of the minimal core\ntests/histories, as they determine the number of model parameters. At a minimum, we expect that\n|T | < rank(M ) as long as f is somewhat \u201cuseful\u201d. To formalize this idea, we introduce rank(f ; M )\nin De\ufb01nition 5, and show that the minimal sizes of core T and H w.r.t. f are directly determined by\nrank(f ; M ) in Theorem 2. To get to those results, we \ufb01rst introduce the notion of linear relevance.\nDe\ufb01nition 2 (Linear Relevance). f is linearly relevant to M if, for all H\u21e2O \u21e4, rowspace(Pf,H) \u2713\nrowspace(PT ,H)3, where T is any core set of tests for M.\nAn interesting fact is that De\ufb01nition 2 is equivalent to f (h) = PTf|h for some Tf up to linear\ntransformations (see Prop. 2 in Appendix C). While f may not be linearly relevant in general, we may\nexpect that f has some components that are linearly relevant, although it may also contain irrelevant\ninformation. To tease them apart, we introduce the following de\ufb01nitions.\nDe\ufb01nition 3. De\ufb01ne rank(f ) := supH\u21e2O\u21e4 rank(Pf,H).4\nDe\ufb01nition 4 (Linearly Relevant Components). Let Uf 2 Rm\u21e5n be a matrix with the maximum\nnumber of columns n such that (1) f0 := U>f f (\u00b7) is linearly relevant to M, and (2) rank(f0) = n.\nFor any H\u21e2 O\u21e4, de\ufb01ne P ?\nThe matrix Uf extracts the linearly relevant components from f, and our algorithm in Sec. 4 will\nlearn such a matrix from data. Now we are ready to de\ufb01ne rank(f ; M ) and state Theorem 2.\nDe\ufb01nition 5. De\ufb01ne rank(f ; M ) := supH\u21e2O\u21e4 rank(P ?\nTheorem 2. The minimal sizes of T and H that are core w.r.t. f are\n\nf,H := Pf0,H.\n\nf,H).\n\n|T | = rank(M ) rank(f ; M ),\n\n|H| = rank(M ) + rank(f ) rank(f ; M ).\n\nThe proof is deferred to Appendix C. The theorem states that, as expected, the higher rank(f ; M ),\nthe smaller T . On the other hand, it also implies that the more irrelevant information f contains, the\nlarger H needs to be, which might seem counter-intuitive. Roughly speaking, this is because when H\nis small and rank(f ) rank(f ; M ) is high, f can have different behavior on h 2H and h /2H . The\nlearning algorithm may be deceived by f\u2019s good predictions on H, only to \ufb01nd later that this does not\ngeneralize to new histories. In this case, we need to expand H to reveal f\u2019s full behavior in order to\nhave a consistent algorithm. A more concrete example on this issue can be found in Appendix C.1.\n\n4 Spectral Learning of PSR-fs\n\nOne signi\ufb01cant limitation of Eq.(5) for learning a PSR-f is that the matrixh PT ,H\nPf,Hi needs to be\ninvertible, and \ufb01nding T and H that satisfy that criterion can be dif\ufb01cult. In the PSR literature,\nthis is known as the discovery problem, and is largely solved by spectral learning [4, 3]. Spectral\nalgorithms, which are state-of-the-art for learning PSRs, take large T and H as inputs and then use\nsingular value decomposition (SVD) to discover a transformed state representation U>\nT PT |h. In this\nsection we devise spectral algorithms for learning PSR-fs; we will need to discover not only UT as\nfor traditional PSRs, but also the Uf matrix that appears in De\ufb01nition 4. The \ufb01rst step is to extend the\nPSR-f formulation to allow transformed representations.\nTransformed PSR-f The state in a (transformed) PSR-f is b(h) = U>\nT PT |h + U>f f (h), where\nUT 2 R|T |\u21e5k, Uf 2 Rm\u21e5k, and k \uf8ff|T | + m is called the model rank. This representation\n\n3rowspace(P ) is the linear span of the row vectors of a matrix P .\n4Strictly speaking, rank(f ) depends on M, which is implicit in notation. The slight dependence comes from\nthe fact that Pf,H = fH diag(P\u270f,H) and P\u270f,H depends on M. The dependence, however, is minimal, since for\nany H\u21e2O \u21e4 we have rank(Pf,H) = rank(fH diag(P\u270f,H)) = rank(fH) as long as P (h) 6= 0, 8h 2O \u21e4.\n\n4\n\n\fAlgorithm 1 Template for learning transformed PSR-fs\nInput: f : O\u21e4 ! Rm, UT 2 R|T |\u21e5k, Uf 2 Rm\u21e5k.\n1: \u02c6Pf,H := fH diag( \u02c6P\u270f,H). U :=\uf8ffUTUf.\n\n\u02c6PoT ,H\u2713U>\uf8ff \u02c6PT ,H\u02c6Pf,H\u25c6+\n\n\u02c6PT |\u270f, Bo := U>\nT\n\n2: b\u21e4 := U>\nT\nOutput: B := {b\u21e4,{Bo},{o}, Uf}.\nAlgorithm 2 A basic spectral algorithm for PSR-f\nInput: f : O\u21e4 ! Rm, model rank k.\n1: (U, \u2303, V ) := SVD(\uf8ff \u02c6PT ,H\u02c6Pf,H).\n\n2: UT := U1:|T |,1:k. Uf := U(|T |+1):(|T |+m),1:k.\nOutput: The output of Algorithm 1 on f, UT , and Uf .\n\n. \u02c6P(\u00b7) is the empirical estimate of P(\u00b7)\n, >o := \u02c6Po,H\u2713U>\uf8ff \u02c6PT ,H\u02c6Pf,H\u25c6+\n\n.\n\n. singular values are in descending order\n\ngeneralizes Eq.(4), since we can recover the latter by letting k = |T | + m and\nUf =\u21e50|T |\u21e5m, Im\u21e4 ,\n\nUT =\u21e5I|T |, 0m\u21e5|T |\u21e4 ,\n\nwhere 0 and I are zero and identity matrices, respectively.\nThe parameters of a rank-k transformed PSR-f are B = {b\u21e4 2 Rk,{Bo 2 Rk\u21e5k},{o 2 Rk}, Uf 2\nRm\u21e5k}. (Note that UT is only used during learning, so it does not appear as a parameter.) After\ninitializing b(\u270f) = b\u21e4 + U>f f (\u270f), the prediction and update rules are as follows.\nBo b(h)\n>o b(h)\n\nPrediction: P (o|h) \u21e1 >o b(h).\n\nState update: b(ho) \u21e1\n\n+ U>f f (ho).\n\n(7)\n\nTemplate for spectral learning of a transformed PSR-f\nIf k, Uf , and UT are given, we can\neasily adapt the algorithm in Eq.(5) to compute the model parameters of a transformed PSR-f. See\nAlgorithm 1, where (\u00b7)+ is the matrix pseudo-inverse. In the rest of this section, we will introduce\nspectral algorithms that use Algorithm 1 as a subroutine and differ in their choices of UT and Uf .\n4.1 A simple algorithm\nThe key operation in spectral algorithms for standard PSRs is the SVD of PT ,H [3, 4]. The analog\n\nPf,Hi, so our \ufb01rst algorithm simply takes the SVD ofh PT ,H\n\nof PT ,H in our setting ish PT ,H\nPf,Hi to obtain\nUT and Uf ; see Algorithm 2. Note that the standard spectral algorithm is recovered when m = 0.\nAlgorithm 2 is consistent under certain conditions; the proof is deferred to Appendix E.\nTheorem 3. Given any f, Algorithm 2 is consistent when T and H are core w.r.t. f and k =\nrank(M ) + rank(f ) rank(f ; M ).\nDespite its consistency, the algorithm has some signi\ufb01cant caveats. In particular, Theorem 3 implies\nthat we may need k > rank(M ) to guarantee consistency, which increases the state dimensionality.\nTo see why this is inevitable, consider the scenario where kf (h)k kPT |hk. Pf,H will dominate the\nPf,Hi, and the \ufb01rst rank(f ) singular vectors will likely depend on Pf,H, regardless of\nspectrum ofh PT ,H\n\nwhether f is relevant or not. Since the algorithm picks singular vectors based only on their singular\nvalues, we are forced to keep irrelevant components of f in our state representation, causing a blow-up\nin dimensionality.\n\n4.2\n\nIdenti\ufb01cation of linearly relevant components by principal angles\n\nIdeally, we would like to identify the linearly relevant components of f (De\ufb01nition 4) and discard the\nirrelevant parts. If we had access to exact statistics PT ,H, we could identify those linearly relevant\n\n5\n\n\fAlgorithm 3 Canonical angle algorithm for PSR-f\nInput: f : O\u21e4 ! Rm, model rank k, 0 \uf8ff d \uf8ff min(k, m).\n1: (U00, \u230300, V 00) := SVD( \u02c6PT ,H).\n2: \u02dcPf,H := \u02c6Pf,H, row-orthonormalized, \u02dcPT ,H := U00>1:|T |,1:k\n3: (Ua, \u2303a, Va) := SVD( \u02dcPT ,H\n4: (U0)> := ([Va](:),1:d)> \u02dcPf,H ( \u02c6Pf,H)+.\n5: f0(\u00b7) := \u00b7 (U0)> f (\u00b7) for some large 2 R.\n6: {b\u21e4,{Bo},{o}, Uf0} := the output of Algorithm 2 on f0 and k.\nOutput: B := {b\u21e4,{Bo},{o}, U0Uf0}.\n\n\u02dcP >f,H).\n\n\u02c6PT ,H, row-orthonormalized.\n\n. compute principal angles\n\n.U 0 2 Rm\u21e5d, Uf0 2 Rd\u21e5k\n\n\u02dcP >f,H\n\ncomponents by computing the principal angles between the row space of PT ,H and that of Pf,H\n[10]. De\ufb01ne \u02dcPT ,H to be a matrix whose rows form an orthonormal basis of PT ,H, and \u02dcPf,H similarly\nfor Pf,H. The singular values of \u02dcPT ,H\ncorrespond to the cosine of their principal angles. In\nparticular, if the intersection of the row spaces of PT ,H and Pf,H is d-dimensional, then the \ufb01rst d\nsingular values will be cos(0) = 1, and the remaining singular values will be less than 1.\nWhen we only have access to empirical statistics and/or f only approximately contains linearly\nrelevant components, the leading singular values will be close to but less than 1. Based on this\nobservation, Algorithm 3 computes principal angles and extracts the relevant components of f in a\nway that is robust to statistical noise. Line 1 uses the standard spectral learning procedure to compress\n\u02c6PT ,H and remove the dimensions that correspond to pure noise. Lines 2 and 3 compute the principal\nangles via SVD. Line 4 uses the right singular vectors to extract the d most relevant dimensions from\nf. And, \ufb01nally, the last line calls Algorithm 2 with a new function \u00b7 (U0)>f that only contains the\nidenti\ufb01ed relevant components.\nPreservation of dimensionality and invariance to transformations A consistency guarantee for\nAlgorithm 3 is stated below; it shows that the dimensionality of learned state will not blow up.\nFurthermore, by design the algorithm is invariant to transformations, in the sense that f and A>f\nwill produce the same result for any full-rank matrix A, thanks to the orthonormalization step (Line 2).\nTheorem 4. Given any f, Algorithm 3 is consistent as long as (H,T ) is core w.r.t. f, k = rank(M ),\nd = rank(f ; M ), and is a \ufb01xed positive constant.\n ! 1 ) Reduced model complexity In Sec. 3.1 we saw that when f is linearly relevant, Bo for a\nnon-transformed PSR-f only needs rank(M )(rank(M )rank(f ; M )) parameters. In a transformed\nPSR-f, however, the size of Bo is always k \u21e5 k. To guarantee consistency we need k rank(M ),\nso at a super\ufb01cial level no savings in model parameters seems possible.\nHowever, when we re-express a non-transformed PSR-f in the transformed form (Eq.(7)), the UT\nmatrix has many zero entries, which leads to zeros in Bo 2 Rk\u21e5k (see Algorithm 1), implying that\nthe effective size of Bo can be much smaller than k2. Based on this observation, we show that when\n ! 1, Algorithm 3 produces Bo that has at most k(k d) non-zero entries.\nProposition 1. In the limit as ! 1, Bo has at most k(kd) non-zero entries for all o. When k and\nd are as in Theorem 4, the number of non-zero entries is at most rank(M )(rank(M ) rank(f ; M )).\nSee the proof and additional details in Appendix E.1. When k and d are as in Theorem 4, the effective\nsize of Bo matches the analysis in Sec. 3.1. Notably, unlike in Sec. 3.1, Prop. 1 does not rely on f\nbeing linearly relevant, and the model complexity is as if only the linearly relevant components of f\nwere given to begin with.5 In practice, the algorithm behaves robustly for any reasonably large .\n\n5 Related Work\n\nOur work has a similar motivation to that of James et al. [5] (and related work [6, 11]), who incorporate\na user-provided partition on PSR histories by learning one model for each partition; these are called\n\n5Note that a mis-speci\ufb01ed f can still make learning Uf more challenging when m is large; however, the\n\nimpact is much less signi\ufb01cant than in Algorithm 2 or the alternative approaches discussed in Appendix E.1.\n\n6\n\n\fmPSRs. In our setting, a partition over histories can be represented as an f that includes indicators of\npartition membership,6 and our results apply more generally to any real-valued function. While they\nshow examples where mPSRs can have fewer parameters than PSRs, we are not aware of a general\ncharacterization of when this happens; the strongest existing result is that the model size will not\nblow up (see Theorem 1 in James et al. [5]). In contrast, we are able to characterize the relevance of\nan arbitrary function f and quantify the model size explicitly (Sec. 3.1 and Prop. 1).\nFeature PSRs [4] also attempt to leverage domain knowledge by using user-provided history and test\nfeatures to improve learning ef\ufb01ciency. While our use of f is super\ufb01cially similar, in fact the two\napproaches are quite distinct (and complementary). In our formulation, f forms a part of the state\nthat is computed directly and not maintained via iterative updates. In contrast, for feature PSRs, all\ndimensions of the state still need to be discovered from data and updated iteratively during prediction.\nFurther discussion of these differences can be found in Appendix D.\n\n6 Experiments\n\n6.1 Synthetic HMMs\nDomain We generate HMMs with 10 states and 20 observations as the ground truth. Each state has\n3 possible observations and 3 possible next states. We consider two types of topologies: with RAND\ntopology, the possible next states for each state are chosen uniformly at random; with RING topology,\nthe states form a ring and each state can only transition to its neighbors or itself. All non-zero\nparameters of the HMMs are generated by sampling from U [0, 1] followed by normalization.\nThe function f For each HMM we provide two functions: the \ufb01rst function (\u201cdummy-0\u201d) takes the\nform of f (h) = PTf|h, with Tf containing 3 independent tests. The second function (\u201cdummy-3\u201d)\nappends 3 more features to the \ufb01rst one. The new features are predictions of 3 independent tests but\nfor a different HMM7 hence are irrelevant to the HMM of interest. While we might want to make the\nproblem more challenging by transforming the function so that the relevant and the irrelevant features\nmix together, this has no effect on Algorithm 3 since it is invariant to transformation (Sec. 4).\nAlgorithm details For both standard spectral learning (\u201cPSR\u201d) and Algorithm 3 (\u201cPSR-f-dummy-\nX\u201d), T and H consist of all the observation sequences of lengths 1 and 2. The hyperparameter d\nfor Algorithm 3 is tuned by 3-fold cross validation on training data, and is set to 100 to ensure a\nsuccinct model (see Appendix E.1). We additionally include a baseline that uses f (without useless\nfeatures) as state and only learns vectors {o} such that P (o|h) \u21e1 >o f (h) (\u201cf-only\u201d).\nResults From each HMM we generate 500, 1000, and 2000 sequences of length 5 as training\ndata. The models are evaluated by the log-loss (i.e., negative log-likelihood) on 1000 test sequences\nof length 5. Fig. 1a shows the log-loss of different algorithms as a function of model rank k for\nsample size 1000, and the rest of the results can be found in Appendix F. We can see that PSR-f\nmodels outperform PSRs across all model ranks and all sample sizes. While using f alone gives\nsomewhat comparable performance in the low sample regime, it fails to improve with more data due\nto its incomplete state representation, whereas PSR-f can leverage an imperfect f while remaining\nconsistent. Finally, while adding irrelevant features hurts performance in the small sample regime,\nthe degradation is only mild and goes away as more data become available.\n\n6.2 Aircraft Identi\ufb01cation\n\nThe next domain is a POMDP developed by Cassandra [12], which we convert into an HMM by\nusing a uniformly random policy. The POMDP simulates a military base using noisy sensors to\ndecide whether an approaching aircraft is friend/foe, and how far away it is. See Appendix F and\n[12, Chapter H.4] for a detailed speci\ufb01cation. Each observation consists of a binary foe/friend\nsignal and a distance, both of which are noisy. We average both components over all previous time\nsteps to obtain \u02c6e and \u02c6l, respectively\u2014intuitively these should be relevant to the state\u2014and compute\n\n6It should be noted that learning a separate model for each partition enables nonlinear dependence on f,\nwhich cannot be directly expressed in our framework. However, Sec. 3 gives the regression view of PSR-f,\nwhich can be extended to nonlinear regression as done by Hefny et al. [8].\n\n7The HMMs used to generate the irrelevant features have RAND topology, 20 states, 20 observations, 3\n\npossible next states, and 20 possible observations per state.\n\n7\n\n\frandom topology\n\n3\n\n2.5\n\n2\n\n1.5\n\ns\ns\no\nl\n-\ng\no\nl\n \ne\nv\ni\nt\na\ne\nR\n\nl\n\nring topology\n\nPSR\nf-only\nPSR-f-dummy-0\nPSR-f-dummy-3\n\n3\n\n2.5\n\n2\n\n1.5\n\ns\ns\no\nl\n-\ng\no\nl\n \ne\nv\ni\nt\na\ne\nR\n\nl\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\nPSR\nf-only\nPSR-f\n\n1\n\n0\n\n2\n\n6\n\n4\nmodel rank\n\n8\n\n10\n\n1\n\n0\n\n2\n\n6\n\n4\nmodel rank\n\n8\n\n10\n\n(a)\n\n100\n\n200\n\n300\n\n400\n\n500\n\nSample size\n(b)\n\nFigure 1: (a) Synthetic HMMs (Sec. 6.1). The y-axis is relative log-loss (the lower the better), where\nzero corresponds to the log-loss of the ground truth model. \u201cf-only\u201d does not depend on model rank\nand is a horizontal line. All results are averaged over 100 trials, and all error bars in this paper show\n95% con\ufb01dence intervals. Sample size is 1000. See text and Appendix F for more details and full\nresults. (b) Aircraft Identi\ufb01cation domain (Sec. 6.2). Results are averaged over 100 trials.\n\nn\no\ni\nt\ni\ns\no\np\n \nr\ne\np\n \nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\nl\n \ne\nv\ni\nt\na\ne\nR\n\nl\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n0\n\nPSR\nPSR-f\nMkv\n\nNeg.\n\nNeu.\n\nPos.\n\n(a)\n\ny\nc\na\nr\nu\nc\nc\na\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nc\n \ne\nv\ni\nt\na\ne\nR\n\nl\n\ni\n\nl\n\n0.1\n\n0.05\n\n0\n\n-0.05\n\n3\n\n2\n\n1\n\ns\ns\no\nl\n-\ng\no\n\nl\n \n\ne\nv\ni\nt\n\nl\n\na\ne\nr\n\n0\n\n0\n\nStandard\nExact-P\n\nTH\n\nExact-P\n\noTH\n\n5\n\n10\n\nmodel rank k\n\n(b)\n\nFigure 2: (a) Results on the gene splice dataset. Left: relative log-likelihood (the higher the better)\nof learned models on in-class test sequences, where zero corresponds to a uniform model (all\nobservations are i.i.d. and equally likely). Right: relative classi\ufb01cation accuracy on test data (the\nhigher the better), where zero corresponds to a classi\ufb01er that always predicts the neutral label. (b)\nUnexpected results (Sec. 6.4). The \ufb01gure shows the performance of standard spectral learning on\nRING HMMs, where certain empirical estimates are replaced by exact statistics. Sample size is 5000.\nSee text for details.\n\nf (h) = [\u02c6e \u02c6e2 \u02c6l \u02c6l2]>. We generate 100, 200, . . . , 500 trajectories as training data, and evaluate the\nmodels on 1000 trajectories of length 3. T and H consist of all sequences of lengths up to 2 and 3,\nrespectively.\nFig. 1b reports the log-loss of standard spectral learning, both using f alone and using our Algorithm 3\nas a function of sample size. Model rank is optimized separately between 1 and 20 for each model\nat each sample size. The \ufb01gure shows that PSR-f outperforms both PSRs and f alone in the small\nsample region. As sample size grows, PSRs are able to improve by discovering good representations\nfrom the data, whereas using f alone suffers from a \ufb01xed and limited representation. In this case,\nPSR-f smoothly converges to match the PSR, enjoying the best of both worlds.\n\n6.3 Gene splice dataset\n\nFinally, we experiment on a gene splice dataset [13]. Each data point is a DNA sequence of length\n60 and a class label that is either positive, negative, or neutral. The class prior is roughly 1:1:2.\nFollowing prior work [14], we train models of rank 4 for each class separately. Given models for\neach class, we use Bayes rule to compute the posterior over labels given the test sequence and predict\nthe label that has the highest posterior. We compare different algorithms using the log-likelihood of\ntest sequences from the same class, as well as using classi\ufb01cation accuracy.\n\n8\n\n\fFor PSR and PSR-f, H and T are set to all sequences up to length 4. We estimate the empirical\nstatistics by a moving window approach to make full use of long sequences [15], which effectively\nturns every long sequence into 55 short sequences. We use 200 long sequences as training data,\nand 1000 sequences as test data. The PSR-f learned from Algorithm 3 uses a simple 2nd order\nMarkov representation as f (a one-hot vector with m = 16). The hyperparameter d is tuned by\n5-fold cross validation. As an additional baseline, we also learn a rank-4 model using f as the state\nrepresentation by \ufb01rst randomly projecting it down to 4 dimensions and then learning o as in the\nsynthetic experiments. We run this baseline 5 times and report the best performance (legend: \u201cMkv\u201d).\nFig. 2a shows the prediction accuracy on test sequences as well as the \ufb01nal classi\ufb01cation accuracy.\nAgain we observe that PSR-f is able to outperform the standard PSR and the Markov baseline under\nboth metrics, even when the domain knowledge provided by f is fairly weak.\n\n6.4 Unexpected results\n\nWe conduct further experiments to empirically explore what kind of f is most bene\ufb01cial. The full\nexperiments and \ufb01ndings are deferred to Appendix G due to space limitations. Surprisingly, we \ufb01nd\nsome highly counter-intuitive behavior that cannot be well explained by existing theory. Roughly\nspeaking, giving away exact statistics to the algorithm can sometimes hurt performance drastically.\nTo show that this is not speci\ufb01c to our setting, we are able to produce similar behavior in the standard\nPSR learning setting without f. Fig. 2b shows the performance of the standard spectral algorithm\nand its 2 variants: (1) \u02c6PT ,H is replaced by PT ,H, and (2) \u02c6PoT ,H is replaced by PoT ,H. While we\nexpect both variants to improve over the baseline, using exact PT ,H results in severely degenerate\nperformance when model is full rank. More detailed discussions are found in Appendix G, and further\ninvestigations may lead to new theoretical and practical insights of spectral learning.\n\n7 Conclusions\n\nWe proposed the PSR-f, a model that generalizes PSRs by taking advantage of a representation f\nthat encodes domain knowledge. Our Algorithm 3 spectrally learns PSR-f models and discovers\nrelevant components of f using principal angles. The algorithm preserves the dimension of state, is\ninvariant to transformation of f, and can achieve reduced model complexity when f contains useful\ninformation. Future research directions include extending PSR-f to allow more powerful regression\ntools, and unifying PSR-f with prior work based on discrete-valued side information [5, 6, 11].\n\nAcknowledgments\nThis work was supported by NSF grant IIS 1319365. Any opinions, \ufb01ndings, conclusions, or\nrecommendations expressed here are those of the authors and do not necessarily re\ufb02ect the views of\nthe sponsors.\n\nReferences\n[1] Michael L. Littman, Richard S. Sutton, and Satinder P. Singh. Predictive representations of\n\nstate. In NIPS, volume 14, pages 1555\u20131561, 2001.\n\n[2] Satinder Singh, Michael R. James, and Matthew R. Rudary. Predictive state representations:\nA new theory for modeling dynamical systems. In Proceedings of the 20th conference on\nUncertainty in arti\ufb01cial intelligence, pages 512\u2013519. AUAI Press, 2004.\n\n[3] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden\n\nMarkov models. Journal of Computer and System Sciences, 78(5):1460\u20131480, 2012.\n\n[4] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop\nwith predictive state representations. In Proceedings of the 9th International Conference on\nAutonomous Agents and Multiagent Systems, pages 1369\u20131370, 2010.\n\n[5] Michael R James, Britton Wolfe, and Satinder P Singh. Combining memory and landmarks\n\nwith predictive state representations. In IJCAI, pages 734\u2013739. Citeseer, 2005.\n\n9\n\n\f[6] Yunlong Liu, Yun Tang, and Yifeng Zeng. Predictive state representations with state space\npartitioning. In Proceedings of the 2015 International Conference on Autonomous Agents and\nMultiagent Systems, pages 1259\u20131266, 2015.\n\n[7] Junhyuk Oh, Valliappa Chockalingam, Honglak Lee, et al. Control of memory, active perception,\nand action in minecraft. In International Conference on Machine Learning, pages 2790\u20132799,\n2016.\n\n[8] Ahmed Hefny, Carlton Downey, and Geoffrey J Gordon. Supervised learning for dynamical\nsystem learning. In Advances in neural information processing systems, pages 1963\u20131971,\n2015.\n\n[9] Michael R. James and Satinder Singh. Learning and discovery of predictive state representations\nin dynamical systems with reset. In Proceedings of the twenty-\ufb01rst international conference on\nMachine learning, page 53. ACM, 2004.\n\u02d9Ake Bj\u00f6rck and Gene H Golub. Numerical methods for computing angles between linear\nsubspaces. Mathematics of computation, 27(123):579\u2013594, 1973.\n\n[10]\n\n[11] Sylvie CW Ong, Yuri Grinberg, and Joelle Pineau. Mixed observability predictive state\n\nrepresentations. In AAAI, 2013.\n\n[12] Anthony Rocco Cassandra. Exact and approximate algorithms for partially observable Markov\n\ndecision processes. Brown University, 1998.\n\n[13] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017. URL http:\n\n//archive.ics.uci.edu/ml.\n\n[14] Amirreza Shaban, Mehrdad Farajtabar, Bo Xie, Le Song, and Byron Boots. Learning latent\nvariable models by improving spectral solutions with exterior point method. In UAI, pages\n792\u2013801, 2015.\n\n[15] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-rank spectral learning with weighted loss\n\nfunctions. In Arti\ufb01cial Intelligence and Statistics, pages 517\u2013525, 2015.\n\n[16] Matthew Rosencrantz, Geoff Gordon, and Sebastian Thrun. Learning low dimensional predictive\nrepresentations. In Proceedings of the twenty-\ufb01rst international conference on Machine learning,\npage 88. ACM, 2004.\n\n[17] Nan Jiang, Alex Kulesza, and Satinder Singh. Improving predictive state representations via\n\ngradient descent. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[18] Sajid M. Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages 741\u2013748, 2010.\n\n[19] Fran\u00e7ois Denis, Mattias Gybels, and Amaury Habrard. Dimension-free concentration bounds\non hankel matrices for spectral learning. In Proceedings of The 31st International Conference\non Machine Learning, pages 449\u2013457, 2014.\n\n[20] Alex Kulesza, Nan Jiang, and Satinder Singh. Spectral learning of predictive state represen-\ntations with insuf\ufb01cient statistics. In Proceedings of the 29th AAAI Conference on Arti\ufb01cial\nIntelligence, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2114, "authors": [{"given_name": "Nan", "family_name": "Jiang", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Alex", "family_name": "Kulesza", "institution": "Google"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "University of Michigan"}]}