{"title": "Learning Linear Dynamical Systems via Spectral Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 6702, "page_last": 6712, "abstract": "We present an efficient and practical algorithm for the online prediction of discrete-time linear dynamical systems with a symmetric transition matrix. We circumvent the non-convex optimization problem using improper learning: carefully overparameterize the class of LDSs by a polylogarithmic factor, in exchange for convexity of the loss functions. From this arises a polynomial-time algorithm with a near-optimal regret guarantee, with an analogous sample complexity bound for agnostic learning. Our algorithm is based on a novel filtering technique, which may be of independent interest: we convolve the time series with the eigenvectors of a certain Hankel matrix.", "full_text": "Learning Linear Dynamical Systems\n\nvia Spectral Filtering\n\nElad Hazan, Karan Singh, Cyril Zhang\n\nDepartment of Computer Science\n\nPrinceton University\nPrinceton, NJ 08544\n\n{ehazan,karans,cyril.zhang}@cs.princeton.edu\n\nAbstract\n\nWe present an ef\ufb01cient and practical algorithm for the online prediction of\ndiscrete-time linear dynamical systems with a symmetric transition matrix. We\ncircumvent the non-convex optimization problem using improper learning: care-\nfully overparameterize the class of LDSs by a polylogarithmic factor, in exchange\nfor convexity of the loss functions. From this arises a polynomial-time algorithm\nwith a near-optimal regret guarantee, with an analogous sample complexity bound\nfor agnostic learning. Our algorithm is based on a novel \ufb01ltering technique, which\nmay be of independent interest: we convolve the time series with the eigenvectors\nof a certain Hankel matrix.\n\n1\n\nIntroduction\n\nLinear dynamical systems (LDSs) are a class of state space models which accurately model many\nphenomena in nature and engineering, and are applied ubiquitously in time-series analysis, robotics,\neconometrics, medicine, and meteorology. In this model, the time evolution of a system is explained\nby a linear map on a \ufb01nite-dimensional hidden state, subject to disturbances from input and noise.\nRecent interest has focused on the effectiveness of recurrent neural networks (RNNs), a nonlinear\nvariant of this idea, for modeling sequences such as audio signals and natural language.\nCentral to this \ufb01eld of study is the problem of system identi\ufb01cation: given some sample trajectories,\noutput the parameters for an LDS which generalize to predict unseen future data. Viewed directly,\nthis is a non-convex optimization problem, for which ef\ufb01cient algorithms with theoretical guarantees\nare very dif\ufb01cult to obtain. A standard heuristic for this problem is expectation-maximization (EM),\nwhich can \ufb01nd poor local optima in theory and practice.\nWe consider a different approach: we formulate system identi\ufb01cation as an online learning problem,\nin which neither the data nor predictions are assumed to arise from an LDS. Furthermore, we slightly\noverparameterize the class of predictors, yielding an online convex program amenable to ef\ufb01cient\nregret minimization. This carefully chosen relaxation, which is our main theoretical contribution,\nexpands the dimension of the hypothesis class by only a polylogarithmic factor. This construction\nrelies upon recent work on the spectral theory of Hankel matrices.\nThe result is a simple and practical algorithm for time-series prediction, which deviates signi\ufb01cantly\nfrom existing methods. We coin the term wave-\ufb01ltering for our method, in reference to our relax-\nation\u2019s use of convolution by wave-shaped eigenvectors. We present experimental evidence on both\ntoy data and a physical simulation, showing our method to be competitive in terms of predictive\nperformance, more stable, and signi\ufb01cantly faster than existing algorithms.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Our contributions\nConsider a discrete-time linear dynamical system with inputs {xt}, outputs {yt}, and a latent state\n{ht}, which can all be multi-dimensional. With noise vectors {\u03b7t},{\u03bet}, the system\u2019s time evolution\nis governed by the following equations:\n\nht+1 = Aht + Bxt + \u03b7t\nyt = Cht + Dxt + \u03bet.\n\nIf the dynamics A, B, C, D are known, then the Kalman \ufb01lter [Kal60] is known to estimate the\nhidden state optimally under Gaussian noise, thereby producing optimal predictions of the system\u2019s\nresponse to any given input. However, this is rarely the case \u2013 indeed, real-world systems are seldom\npurely linear, and rarely are their evolution matrices known.\nWe henceforth give a provable, ef\ufb01cient algorithm for the prediction of sequences arising from an\nunknown dynamical system as above, in which the matrix A is symmetric. Our main theoretical\ncontribution is a regret bound for this algorithm, giving nearly-optimal convergence to the lowest\nmean squared prediction error (MSE) realizable by a symmetric LDS model:\nTheorem 1 (Main regret bound; informal). On an arbitrary sequence {(xt, yt)}T\n(cid:19)\nmakes predictions {\u02c6yt}T\n\n(cid:18) poly(n, m, d, log T )\n\nt=1, Algorithm 1\n\nt=1 which satisfy\n\n\u2217\n\u2217\nT ) \u2264 \u02dcO\nMSE(\u02c6y1, . . . , \u02c6yT ) \u2212 MSE(\u02c6y\n1, . . . , \u02c6y\n\n\u221aT\n\n,\n\nt=1 by a symmetric LDS, while running in polynomial time.\n\ncompared to the best predictions {y\u2217\nt }T\nNote that the signal need not be generated by an LDS, and can even be adversarially chosen. In the\nless general batch (statistical) setting, we use the same techniques to obtain an analogous sample\ncomplexity bound for agnostic learning:\nTheorem 2 (Batch version; informal). For any choice of \u03b5 > 0, given access to an arbitrary dis-\ntribution D over training sequences {(xt, yt)}T\nt=1, Algorithm 2, run on N i.i.d. sample trajectories\n(cid:20)\nfrom D, outputs a predictor \u02c6\u0398 such that\n\u2217\nMSE( \u02c6\u0398) \u2212 MSE(\u0398\n)\n\n,\ncompared to the best symmetric LDS predictor \u0398\u2217, while running in polynomial time.\nTypical regression-based methods require the LDS to be strictly stable, and degrade on ill-\n1\u2212(cid:107)A(cid:107). Our proposed method\nconditioned systems; they depend on a spectral radius parameter\nof wave-\ufb01ltering provably and empirically works even for the hardest case of (cid:107)A(cid:107) = 1. Our al-\ngorithm attains the \ufb01rst condition number-independent polynomial guarantees in terms of regret\n(equivalently, sample complexity) and running time for the MIMO setting. Interestingly, our algo-\nrithms never need to learn the hidden state, and our guarantees can be sharpened to handle the case\nwhen the dimensionality of ht is in\ufb01nite.\n\n\u02dcO (poly(n, m, d, log T, log 1/\u03b5))\n\n\u2264 \u03b5 +\n\n\u221aN\n\n(cid:21)\n\nE\nD\n\n1\n\n1.2 Related work\n\nThe modern setting for LDS arose in the seminal work of Kalman [Kal60], who introduced the\nKalman \ufb01lter as a recursive least-squares solution for maximum likelihood estimation (MLE) of\nGaussian perturbations to the system. The framework and \ufb01ltering algorithm have proven to be a\nmainstay in control theory and time-series analysis; indeed, the term Kalman \ufb01lter model is often\nused interchangeably with LDS. We refer the reader to the classic survey [Lju98], and the extensive\noverview of recent literature in [HMR16].\nGhahramani and Roweis [RG99] suggest using the EM algorithm to learn the parameters of an LDS.\nThis approach, which directly tackles the non-convex problem, is widely used in practice [Mar10a].\nHowever, it remains a long-standing challenge to characterize the theoretical guarantees afforded by\nEM. We \ufb01nd that it is easy to produce cases where EM fails to identify the correct system.\nIn a recent result of [HMR16], it is shown for the \ufb01rst time that for a restricted class of systems, gra-\ndient descent (also widely used in practice, perhaps better known in this setting as backpropagation)\n\n2\n\n\fguarantees polynomial convergence rates and sample complexity in the batch setting. Their result\napplies essentially only to the SISO case (vs. multi-dimensional for us), depends polynomially on\nthe spectral gap (as opposed to no dependence for us), and requires the signal to be created by an\nLDS (vs. arbitrary for us).\n\n2 Preliminaries\n\n2.1 Linear dynamical systems\n\nMany different settings have been considered, in which the de\ufb01nition of an LDS takes on many vari-\nants. We are interested in discrete time-invariant MIMO (multiple input, multiple output) systems\nwith a \ufb01nite-dimensional hidden state.1 Formally, our model is given as follows:\nDe\ufb01nition 2.1. A linear dynamical system (LDS) is a map from a sequence of input vectors\nx1, . . . , xT \u2208 Rn to output (response) vectors y1, . . . , yT \u2208 Rm of the form\n\nht+1 = Aht + Bxt + \u03b7t\nyt = Cht + Dxt + \u03bet,\n\n(1)\n(2)\nwhere h0, . . . , hT \u2208 Rd is a sequence of hidden states, A, B, C, D are matrices of appropriate\ndimension, and \u03b7t \u2208 Rd, \u03bet \u2208 Rm are (possibly stochastic) noise vectors.\nUnrolling this recursive de\ufb01nition gives the impulse response function, which uniquely determines\nthe LDS. For notational convenience, for invalid indices t \u2264 0, we de\ufb01ne xt, \u03b7t, and \u03bet to be the\nzero vector of appropriate dimension. Then, we have:\n\nT\u22121(cid:88)\n\nyt =\n\nCAi (Bxt\u2212i + \u03b7t\u2212i) + CAth0 + Dxt + \u03bet.\n\n(3)\n\ni=1\n\nWe will consider the (discrete) time derivative of the impulse response function, given by expanding\nyt\u22121 \u2212 yt by Equation (3). For the rest of this paper, we focus our attention on systems subject to\nthe following restrictions:\n\n(i) The LDS is Lyapunov stable: (cid:107)A(cid:107)2 \u2264 1, where (cid:107)\u00b7(cid:107)2 denotes the operator (a.k.a. spectral)\n(ii) The transition matrix A is symmetric and positive semide\ufb01nite.2\n\nnorm.\n\nThe \ufb01rst assumption is standard: when the hidden state is allowed to blow up exponentially, \ufb01ne-\ngrained prediction is futile. In fact, many algorithms only work when (cid:107)A(cid:107) is bounded away from 1,\nso that the effect of any particular xt on the hidden state (and thus the output) dissipates exponen-\ntially. We do not require this stronger assumption.\nWe take a moment to justify assumption (ii), and why this class of systems is still expressive and\nuseful. First, symmetric LDSs constitute a natural class of linearly-observable, linearly-controllable\nsystems with dissipating hidden states (for example, physical systems with friction or heat diffusion).\nSecond, this constraint has been used successfully for video classi\ufb01cation and tactile recognition\ntasks [HSC+16]. Interestingly, though our theorems require symmetric A, our algorithms appear to\ntolerate some non-symmetric (and even nonlinear) transitions in practice.\n\n2.2 Sequence prediction as online regret minimization\n\nA natural formulation of system identi\ufb01cation is that of online sequence prediction. At each time\nstep t, an online learner is given an input xt, and must return a predicted output \u02c6yt. Then, the true\nresponse yt is observed, and the predictor suffers a squared-norm loss of (cid:107)yt \u2212 \u02c6yt(cid:107)2. Over T rounds,\nthe goal is to predict as accurately as the best LDS in hindsight.\n\ndimension has no role in our algorithm, and shows up as (cid:107)B(cid:107)F and (cid:107)C(cid:107)F in the regret bound.\n\n1We assume \ufb01nite dimension for simplicity of presentation. However, it will be evident that hidden-state\n2The psd constraint on A can be removed by augmenting the inputs xt with extra coordinates (\u22121)t(xt).\n\nWe omit this for simplicity of presentation.\n\n3\n\n\fNote that the learner is permitted to access the history of observed responses {y1, . . . , yt\u22121}. Even in\nthe presence of statistical (non-adversarial) noise, the \ufb01xed maximum-likelihood sequence produced\nby \u0398 = (A, B, C, D, h0) will accumulate error linearly as T . Thus, we measure performance\nagainst a more powerful comparator, which \ufb01xes LDS parameters \u0398, and predicts yt by the previous\nresponse yt\u22121 plus the derivative of the impulse response function of \u0398 at time t.\nWe will exhibit an online algorithm that can compete against the best \u0398 in this setting. Let\n\u02c6y1, . . . , \u02c6yT be the predictions made by an online learner, and let y\u2217\nT be the sequence of\npredictions, realized by a chosen setting of LDS parameters \u0398, which minimize total squared error.\nThen, we de\ufb01ne regret by the difference of total squared-error losses:\n\n1, . . . , y\u2217\n\nT(cid:88)\nt=1(cid:107)yt \u2212 \u02c6yt(cid:107)2 \u2212\n\nT(cid:88)\nt=1(cid:107)yt \u2212 y\n\nRegret(T )\n\ndef\n=\n\n\u2217\nt (cid:107)2.\n\nThis setup \ufb01ts into the standard setting of online convex optimization (in which a sublinear regret\nbound implies convergence towards optimal predictions), save for the fact that the loss functions are\nnon-convex in the system parameters. Also, note that a randomized construction (set all xt = 0,\nand let yt be i.i.d. Bernoulli random variables) yields a lower bound3 for any online algorithm:\nE [Regret(T )] \u2265 \u2126(\u221aT ).\nTo quantify regret bounds, we must state our scaling assumptions on the (otherwise adversarial)\ninput and output sequences. We assume that the inputs are bounded: (cid:107)xt(cid:107)2 \u2264 Rx. Also, we assume\nthat the output signal is Lipschitz in time: (cid:107)yt \u2212 yt\u22121(cid:107)2 \u2264 Ly. The latter assumption exists to\npreclude pathological inputs where an online learner is forced to incur arbitrarily large regret. For a\ntrue noiseless LDS, Ly is not too large; see Lemma F.5 in the appendix.\nWe note that an optimal \u02dcO(\u221aT ) regret bound can be trivially achieved in this setting by algorithms\nsuch as Hedge [LW94], using an exponential-sized discretization of all possible LDS parameters;\nthis is the online equivalent of brute-force grid search. Strikingly, our algorithms achieve essentially\nthe same regret bound, but run in polynomial time.\n\n2.3 The power of convex relaxations\n\nMuch work in system identi\ufb01cation, including the EM method, is concerned with explicitly \ufb01nding\nthe LDS parameters \u0398 = (A, B, C, D, h0) which best explain the data. However, it is evident from\nEquation 3 that the CAiB terms cause the least-squares (or any other) loss to be non-convex in \u0398.\nMany methods used in practice, including EM and subspace identi\ufb01cation, heuristically estimate\neach hidden state ht, after which estimating the parameters becomes a convex linear regression\nproblem. However, this \ufb01rst step is far from guaranteed to work in theory or practice.\nInstead, we follow the paradigm of improper learning: in order to predict sequences as accurately as\nthe best possible LDS \u0398\u2217\n\u2208 H, one need not predict strictly from an LDS. The central driver of our\nalgorithms is the construction of a slightly larger hypothesis class \u02c6H, for which the best predictor\n\u02c6\u0398\u2217 is nearly as good as \u0398\u2217. Furthermore, we construct \u02c6H so that the loss functions are convex under\nthis new parameterization. From this will follow our ef\ufb01cient online algorithm.\nAs a warmup example, consider the following overparameterization: pick some time window\n\u03c4 (cid:28) T , and let the predictions \u02c6yt be linear in the concatenation [xt, . . . , xt\u2212\u03c4 ] \u2208 R\u03c4 d. When\n(cid:107)A(cid:107) is bounded away from 1, this is a sound assumption.4 However, in general, this approximation\nis doomed to either truncate longer-term input-output dependences (short \u03c4), or suffer from over-\n\ufb01tting (long \u03c4). Our main theorem uses an overparameterization whose approximation factor \u03b5 is\nindependent of (cid:107)A(cid:107), and whose sample complexity scales only as \u02dcO(polylog(T, 1/\u03b5)).\n2.4 Low approximate rank of Hankel matrices\n\nOur analysis relies crucially on the spectrum of a certain Hankel matrix, a square matrix whose\nanti-diagonal stripes have equal entries (i.e. Hij is a function of i + j). An important example is the\n\n3This is a standard construction; see, e.g. Theorem 3.2 in [Haz16].\n4This assumption is used in autoregressive models; see Section 6 of [HMR16] for a theoretical treatment.\n\n4\n\n\fHilbert matrix Hn,\u03b8, the n-by-n matrix whose (i, j)-th entry is\n\n1\n\ni+j+\u03b8 . For example,\n\n(cid:34) 1\n\n1/2\n1/3\n\nH3,\u22121 =\n\n(cid:35)\n\n.\n\n1/2\n1/3\n1/4\n\n1/3\n1/4\n1/5\n\nThis and related matrices have been studied under various lenses for more than a century: see, e.g.,\n[Hil94, Cho83]. A basic fact is that Hn,\u03b8 is a positive de\ufb01nite matrix for every n \u2265 1, \u03b8 > \u22122.\nThe property we are most interested in is that the spectrum of a positive semide\ufb01nite Hankel matrix\ndecays exponentially, a dif\ufb01cult result derived in [BT16] via Zolotarev rational approximations. We\nstate these technical bounds in Appendix E.\n\n3 The wave-\ufb01ltering algorithm\n\ndef\n\nOur online algorithm (Algorithm 1) runs online projected gradient descent [Zin03] on the squared\nloss ft(Mt)\n= (cid:107)yt \u2212 \u02c6yt(Mt)(cid:107)2. Here, each Mt is a matrix specifying a linear map from fea-\nturized inputs \u02dcXt to predictions \u02c6yt. Speci\ufb01cally, after choosing a certain bank of k \ufb01lters {\u03c6j},\n\u02dcXt \u2208 Rnk+2n+m consists of convolutions of the input time series with each \u03c6j (scaled by certain\nconstants), along with xt\u22121, xt, and yt\u22121. The number of \ufb01lters k will turn out to be polylogarithmic\nin T .\nThe \ufb01lters {\u03c6j} and scaling factors {\u03c31/4\nHankel matrix ZT \u2208 RT\u00d7T , whose entries are given by\n\nj } are given by the top eigenvectors and eigenvalues of the\n\n2\n\nZij :=\n\n(i + j)3 \u2212 (i + j)\n\n.\n\nj=1, the top k eigenpairs of ZT .\n= nk + 2n + m.\n\nIn the language of Section 2.3, one should think of each Mt as arising from an\n\u02dcO(poly(m, n, d, log T ))-dimensional hypothesis class \u02c6H, which replaces the original O((m + n +\nd)2)-dimensional class H of LDS parameters (A, B, C, D, h0). Theorem 3 gives the key fact that\n\u02c6H approximately contains H.\nAlgorithm 1 Online wave-\ufb01ltering algorithm for LDS sequence prediction\n1: Input: time horizon T , \ufb01lter parameter k, learning rate \u03b7, radius parameter RM .\n2: Compute {(\u03c3j, \u03c6j)}k\n3: Initialize M1 \u2208 Rm\u00d7k(cid:48)\n4: for t = 1, . . . , T do\nCompute \u02dcX \u2208 Rk(cid:48)\n5:\nthe 2n + m entries of xt\u22121, xt, and yt\u22121.\nPredict \u02c6yt := Mt \u02dcX.\n6:\nObserve yt. Suffer loss (cid:107)yt \u2212 \u02c6yt(cid:107)2.\n7:\nGradient update: Mt+1 \u2190 Mt \u2212 2\u03b7(yt \u2212 \u02c6yt) \u2297 \u02dcX.\n8:\nif (cid:107)Mt+1(cid:107)F \u2265 RM then\n9:\nPerform Frobenius norm projection: Mt+1 \u2190 RM(cid:107)Mt+1(cid:107)F\n10:\nend if\n11:\n12: end for\n\n, with \ufb01rst nk entries \u02dcX(i,j) := \u03c31/4\n\nu=1 \u03c6j(u)xt\u2212u(i), followed by\n\n(cid:80)T\u22121\n\n, where k(cid:48) def\n\nMt+1.\n\nj\n\nIn Section 4, we provide the precise statement and proof of Theorem 1, the main regret bound for\nAlgorithm 1, with some technical details deferred to the appendix. We also obtain analogous sample\ncomplexity results for batch learning; however, on account of some de\ufb01nitional subtleties, we defer\nall discussion of the of\ufb02ine case, including the statement and proof of Theorem 2, to Appendix A.\nWe make one \ufb01nal interesting note here, from which the name wave-\ufb01ltering arises: when plotted\ncoordinate-wise, our \ufb01lters {\u03c6j} look like the vibrational modes of an inhomogeneous spring (see\nFigure 1). We provide some insight on this phenomenon (along with some other implementation\nconcerns) in Appendix B. Succinctly:\nin the scaling limit, (ZT /(cid:107)ZT(cid:107)2)T\u2192\u221e commutes with a\ncertain second-order Sturm-Liouville differential operator D. This allows us to approximate \ufb01lters\nwith eigenfunctions of D, using ef\ufb01cient numerical ODE solvers.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) The entries of some typical eigenvectors of Z1000, plotted coordinate-wise. (b) \u03c627 of\nZ1000 (\u03c327 \u2248 10\u221216) computed with \ufb01nite-precision arithmetic, along with a numerical solution to\nthe ODE in Appendix B.1 with \u03bb = 97. (c) Some very high-order \ufb01lters, computed using the ODE,\nwould be dif\ufb01cult to obtain by eigenvector computations.\n\n4 Analysis\n\nWe \ufb01rst state the full form of the regret bound achieved by Algorithm 1:5\nTheorem 1\nchoice\nof k\n\u0398((R2\n\n\u0398(cid:0)log2 T log(R\u0398RxLyn)(cid:1), RM\n\nxLy log(R\u0398RxLyn) n\u221aT log4 T )\u22121), achieves regret\n\n(Main). On\n=\n\n{(xt, yt)}T\n\nsequence\n\nt=1,\n=\n\nany\n\n(cid:16)\n\nAlgorithm 1,\n\u0398(R2\n\nand\n\nwith\n\u03b7\n\na\n=\n\n\u0398\u221ak),\n(cid:17)\n\n,\n\nRegret(T ) \u2264 O\n\nR4\n\n\u0398 R2\n\nx Ly log2(R\u0398RxLyn) \u00b7 n\u221aT log6 T\n\ncompeting with LDS predictors (A, B, C, D, h0) with 0 (cid:52) A (cid:52) I and (cid:107)B(cid:107)F ,(cid:107)C(cid:107)F ,(cid:107)D(cid:107)F ,(cid:107)h0(cid:107) \u2264\nR\u0398.\n\nNote that the dimensions m, d do not appear explicitly in this bound, though they typically factor\ninto R\u0398. In Section 4.1, we state and prove Theorem 3, the convex relaxation guarantee for the\n\ufb01lters, which may be of independent interest. This allows us to approximate the optimal LDS in\nhindsight (the regret comparator) by the loss-minimizing matrix Mt : \u02dcX (cid:55)\u2192 \u02c6yt. In Section 4.2, we\ncomplete the regret analysis using Theorem 3, along with bounds on the diameter and gradient, to\nconclude Theorem 1.\nSince the batch analogue is less general (and uses the same ideas), we defer discussion of Algo-\nrithm 2 and Theorem 2 to Appendix A.\n\n4.1 Approximate convex relaxation via wave \ufb01lters\n\nAssume for now that h0 = 0; we will remove this at the end, and see that the regret bound is\nasymptotically the same. Recall (from Section 2.2) that we measure regret compared to predictions\nobtained by adding the derivative of the impulse response function of an LDS \u0398 to yt\u22121. Our\napproximation theorem states that for any \u0398, there is some M\u0398 \u2208 \u02c6H which produces approximately\nthe same predictions. Formally:\nTheorem 3 (Spectral convex relaxation for symmetric LDSs). Let {\u02c6yt}T\nt=1 be the online predictions\nmade by an LDS \u0398 = (A, B, C, D, h0 = 0). Let R\u0398 = max{(cid:107)B(cid:107)F ,(cid:107)C(cid:107)F ,(cid:107)D(cid:107)F}. Then, for any\n\u03b5 > 0, with a choice of k = \u2126 (log T log(R\u0398RxLynT /\u03b5)), there exists an M\u0398 \u2208 Rm\u00d7k(cid:48)\nsuch that\n\nT(cid:88)\nt=1(cid:107)M\u0398 \u02dcXt \u2212 yt(cid:107)2 \u2264\n\nT(cid:88)\nt=1(cid:107)\u02c6yt \u2212 yt(cid:107)2 + \u03b5.\n\nHere, k(cid:48) and \u02dcXt are de\ufb01ned as in Algorithm 1 (noting that \u02dcXt includes the previous ground truth\nyt\u22121).\n\n5Actually, for a slightly tighter proof, we analyze a restriction of the algorithm which does not learn the\n\nportion M (y), instead always choosing the identity matrix for that block.\n\n6\n\n02004006008001000\u22120.20\u22120.100.000.100.20\u03c61\u03c63\u03c65\u03c610\u03c615\u03c62002004006008001000\u03c627\u03c6ODE(97)02004006008001000\u03c6ODE(500)\u03c6ODE(5000)\fProof. We construct this mapping \u0398 (cid:55)\u2192 M\u0398 explicitly. Write M\u0398 as the block matrix\n\n(cid:2)\n\nM (1) M (2)\n\n\u00b7\u00b7\u00b7 M (k) M (x(cid:48)) M (x) M (y)(cid:3) ,\n\nwhere the blocks\u2019 dimensions are chosen to align with \u02dcXt, the concatenated vector\n\n\u03c31/4\n1\n\n(X \u2217 \u03c61)t \u03c31/4\n\n2\n\n(X \u2217 \u03c62)t\n\n\u00b7\u00b7\u00b7\n\n\u03c31/4\nk\n\n(X \u2217 \u03c6k)t\n\nxt\u22121\n\nxt\n\nyt\u22121\n\nso that the prediction is the block matrix-vector product\n\n(cid:104)\n\n(cid:105)\n\n,\n\nM\u0398 \u02dcXt =\n\nj M (j)(X \u2217 \u03c6j)t + M (x(cid:48))xt\u22121 + M (x)xt + M (y)yt\u22121.\n\u03c31/4\n\nk(cid:88)\n\nj=1\n\nl=1.6 Let bl be the l-th row\nWithout loss of generality, assume that A is diagonal, with entries {\u03b1l}d\nof B, and cl the l-th column of C. Also, we de\ufb01ne a continuous family of vectors \u00b5 : [0, 1] \u2192 RT ,\nwith entries \u00b5(\u03b1)(i) = (\u03b1l \u2212 1)\u03b1i\u22121\n. Then, our construction is as follows:\n\nl\n\n\u2022 M (j) =(cid:80)d\n\u2022 M (x(cid:48)) = \u2212D, M (x) = CB + D, M (y) = Im\u00d7m.\n\n(cid:104)\u03c6j, \u00b5(\u03b1l)(cid:105) (cl \u2297 bl), for each 1 \u2264 j \u2264 k.\n\n\u22121/4\nl=1 \u03c3\nj\n\nBelow, we give the main ideas for why this M\u0398 works, leaving the full proof to Appendix C.\nSince M (y) is the identity, the online learner\u2019s task is to predict the differences yt \u2212 yt\u22121 as well as\nthe derivative \u0398, which we write here:\n\n\u02c6yt \u2212 yt\u22121 = (CB + D)xt \u2212 Dxt\u22121 +\n\n= (CB + D)xt \u2212 Dxt\u22121 +\n\n= (CB + D)xt \u2212 Dxt\u22121 +\n\nC(Ai \u2212 Ai\u22121)Bxt\u2212i\n(cid:32) d(cid:88)\n(cid:0)\u03b1i\nl \u2212 \u03b1i\u22121\nT\u22121(cid:88)\n\nl=1\n\nC\n\nl\n\n(cid:1) el \u2297 el\n\n(cl \u2297 bl)\n\ni=1\n\n\u00b5(\u03b1l)(i) xt\u2212i.\n\n(cid:33)\n\nBxt\u2212i\n\n(4)\n\nthe inner sum is an inner product between each coordinate of the past\n\nNotice that\ninputs\n(xt, xt\u22121, . . . , xt\u2212T ) with \u00b5(\u03b1l) (or a convolution, viewed across the entire time horizon). The crux\nj=1.\nof our proof is that one can approximate \u00b5(\u03b1) using a linear combination of the \ufb01lters {\u03c6j}k\nWriting Z := ZT for short, notice that\n\ni=1\n\nT\u22121(cid:88)\nT\u22121(cid:88)\nd(cid:88)\n\ni=1\n\nl=1\n\n(cid:90) 1\n\nZ =\n\n0\n\n\u00b5(\u03b1) \u2297 \u00b5(\u03b1) d\u03b1,\n\nsince the (i, j) entry of the RHS is\n\n(cid:90) 1\n\n0\n\n(\u03b1 \u2212 1)2\u03b1i+j\u22122 d\u03b1 =\n\n1\n\ni + j \u2212 1 \u2212\n\n2\n\ni + j\n\n+\n\n1\n\ni + j + 1\n\n= Zij.\n\nWhat follows is a spectral bound for reconstruction error, relying on the low approximate rank of Z:\nLemma 4.1. Choose any \u03b1 \u2208 [0, 1]. Let \u02dc\u00b5(\u03b1) be the projection of \u00b5(\u03b1) onto the k-dimensional\nsubspace of RT spanned by {\u03c6j}k\n\nj=1. Then,\n\n(cid:118)(cid:117)(cid:117)(cid:116)6\n\nT(cid:88)\n\n\u03c3j \u2264 O\n\nj=k+1\n\n(cid:16)\n\n(cid:112)\n\n(cid:17)\n\n\u2212k/ log T\nc\n0\n\nlog T\n\n,\n\n(cid:107)\u00b5(\u03b1) \u2212 \u02dc\u00b5(\u03b1)(cid:107)2 \u2264\n\nfor an absolute constant c0 > 3.4.\n\n6Write the eigendecomposition A = U \u039bU T . Then, the LDS with parameters ( \u02c6A, \u02c6B, \u02c6C, D, h0) :=\n\n(\u039b, BU, U T C, D, h0) makes the same predictions as the original, with \u02c6A diagonal.\n\n7\n\n\fBy construction of M (j), M\u0398 \u02dcXt replaces each \u00b5(\u03b1l) in Equation (4) with its approximation \u02dc\u00b5(\u03b1l).\nHence we conclude that\n\nM\u0398 \u02dcXt = yt\u22121 + (CB + D)xt \u2212 Dxt\u22121 +\n\n(cl \u2297 bl)\n\n\u02dc\u00b5(\u03b1l)(i) xt\u2212i\n\nd(cid:88)\n\nl=1\n\nT\u22121(cid:88)\n\ni=1\n\n= yt\u22121 + (\u02c6yt \u2212 yt\u22121) + \u03b6t = \u02c6yt + \u03b6t,\n\nletting {\u03b6t} denote some residual vectors arising from discarding the subspace of dimension T \u2212 k.\nTheorem 3 follows by showing that these residuals are small, using Lemma 4.1: it turns out that\n(cid:107)\u03b6t(cid:107) is exponentially small in k/ log T , which implies the theorem.\n4.2 From approximate relaxation to low regret\nLet \u0398\u2217\n\u2208 H denote the best LDS predictor, and let M\u0398\u2217 \u2208 \u02c6H be its image under the map\nfrom Theorem 3, so that total squared error of predictions M\u0398\u2217 \u02dcXt is within \u03b5 from that of\n\u0398\u2217. Notice that the loss functions ft(M )\n= (cid:107)yt \u2212 M \u02dcXt(cid:107)2 are quadratic in M, and thus con-\nvex. Algorithm 1 runs online gradient descent [Zin03] on these loss functions, with decision set\n(cid:107)F be the diameter\nM\nof M, and Gmax := supM\u2208M, \u02dcX(cid:107)\u2207ft(M )(cid:107)F be the largest norm of a gradient. We can invoke the\nclassic regret bound:\n\u221a\nLemma 4.2 (e.g. Thm. 3.1 in [Haz16]). Online gradient descent, using learning rate Dmax\n, has\nGmax\nregret\n\n= {M \u2208 Rm\u00d7k(cid:48) (cid:12)(cid:12) (cid:107)M(cid:107)F \u2264 RM}. Let Dmax := supM,M(cid:48)\u2208M(cid:107)M \u2212 M(cid:48)\n\ndef\n\ndef\n\nT\n\nRegretOGD(T )\n\ndef\n=\n\nft(Mt) \u2212 min\nM\u2208M\n\nft(M ) \u2264 2GmaxDmax\u221aT .\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nTo \ufb01nish, it remains to show that Dmax and Gmax are small.\nIn particular, since the gradients\ncontain convolutions of the input by (cid:96)2 (not (cid:96)1) unit vectors, special care must be taken to ensure\nthat these do not grow too quickly. These bounds are shown in Section D.2, giving the correct\nregret of Algorithm 1 in comparison with the comparator M\u2217\n\u2208 \u02c6H. By Theorem 3, M\u2217 competes\narbitrarily closely with the best LDS in hindsight, concluding the theorem.\nFinally, we discuss why it is possible to relax the earlier assumption h0 = 0 on the initial hidden\nstate. Intuitively, as more of the ground truth responses {yt} are revealed, the largest possible effect\nof the initial state decays. Concretely, in Section D.4, we prove that a comparator who chooses a\nnonzero h0 can only increase the regret by an additive \u02dcO(log2 T ) in the online setting.\n\n5 Experiments\n\nIn this section, to highlight the appeal of our provable method, we exhibit two minimalistic cases\nwhere traditional methods for system identi\ufb01cation fail, while ours successfully learns the system.\nFinally, we note empirically that our method seems not to degrade in practice on certain well-\nbehaved nonlinear systems. In each case, we use k = 25 \ufb01lters, and a regularized follow-the-leader\nvariant of Algorithm 1 (see Appendix B.2).\n\n5.1 Synthetic systems: two hard cases for EM and SSID\n\nWe construct two dif\ufb01cult systems, on which we run either EM or subspace identi\ufb01cation7 (SSID),\nfollowed by Kalman \ufb01ltering to obtain predictions. Note that our method runs signi\ufb01cantly (>1000\ntimes) faster than this traditional pipeline.\nIn the \ufb01rst example (Figure 2(a), left), we have a SISO system (n = m = 1) and d = 2; all xt, \u03bet,\nand \u03b7t are i.i.d. Gaussians, and B(cid:62) = C = [1 1], D = 0. Most importantly, A = diag ([0.999, 0.5])\nis ill-conditioned, so that there are long-term dependences between input and output. Observe that\nalthough EM and SSID both \ufb01nd reasonable guesses for the system\u2019s dynamics, they turns out to be\nlocal optima. Our method learns to predict as well as the best possible LDS.\n\n7Speci\ufb01cally, we use \u201cDeterministic Algorithm 1\u201d from page 52 of [VODM12].\n\n8\n\n\f(a) Two synthetic systems. For clarity, error plots are smoothed by a median \ufb01lter. Left: Noisy SISO system\nwith a high condition number; EM and SSID \ufb01nds a bad local optimum. Right: High-dimensional MIMO\nsystem; other methods fail to learn any reasonable model of the dynamics.\n\n(b) Forced pendulum, a physical simulation our method learns in practice, despite a lack of theory.\n\nFigure 2: Visualizations of Algorithm 1. All plots: blue = ours, yellow = EM, red = SSID, black = true\nresponses, green = inputs, dotted lines = \u201cguess the previous output\u201d baseline. Horizontal axis is time.\n\nThe second example (Figure 2(a), right) is a MIMO system (with n = m = d = 10), also with\nGaussian noise. The transition matrix A = diag ([0, 0.1, 0.2, . . . , 0.9]) has a diverse spectrum, the\nobservation matrix C has i.i.d. Gaussian entries, and B = In, D = 0. The inputs xt are random\nblock impulses. This system identi\ufb01cation problem is high-dimensional and non-convex; it is thus\nno surprise that EM and SSID consistently fail to converge.\n\n5.2 The forced pendulum: a nonlinear, non-symmetric system\n\nWe remark that although our algorithm has provable regret guarantees only for LDSs with symmetric\ntransition matrices, it appears in experiments to succeed in learning some non-symmetric (even\nnonlinear) systems in practice, much like the unscented Kalman \ufb01lter [WVDM00]. In Figure 2(b),\nwe provide a typical learning trajectory for a forced pendulum, under Gaussian noise and random\nblock impulses. Physical systems like this are widely considered in control and robotics, suggesting\npossible real-world applicability for our method.\n\n6 Conclusion\n\nWe have proposed a novel approach for provably and ef\ufb01ciently learning linear dynamical systems.\nOur online wave-\ufb01ltering algorithm attains near-optimal regret in theory; and experimentally out-\nperforms traditional system identi\ufb01cation in both prediction quality and running time. Furthermore,\nwe have introduced a \u201cspectral \ufb01ltering\u201d technique for convex relaxation, which uses convolutions\nby eigenvectors of a Hankel matrix. We hope that this theoretical tool will be useful in tackling more\ngeneral cases, as well as other non-convex learning problems.\n\nAcknowledgments\n\nWe thank Holden Lee and Yi Zhang for helpful discussions. We especially grateful to Holden for a\nthorough reading of our manuscript, and for pointing out a way to tighten the result in Lemma C.1.\n\n9\n\nSystem1:ill-conditionedSISO\u2212200\u22121000100Timeseries(xt,yt)xtyt010020030040050010\u2212310\u22121101103Error||\u02c6yt\u2212yt||2EMSSIDours\u02c6yt=yt\u22121System2:10-dimensionalMIMO\u22122\u2212101xt(1)yt(1)0200400600800100010\u2212410\u22122100102EMSSIDours\u02c6yt=yt\u2212102004006008001000\u22120.50.00.5(xt,yt,\u02c6yt)xtyt\u02c6yt\fReferences\n\n[Aud14] Koenraad MR Audenaert. A generalisation of mirsky\u2019s singular value inequalities.\n\narXiv preprint arXiv:1410.4941, 2014.\n\n[BM02] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk\nbounds and structural results. Journal of Machine Learning Research, 3(Nov):463\u2013\n482, 2002.\n\n[BT16] Bernhard Beckermann and Alex Townsend. On the singular values of matrices with\n\ndisplacement structure. arXiv preprint arXiv:1609.09494, 2016.\n\n[Cho83] Man-Duen Choi. Tricks or treats with the hilbert matrix. The American Mathematical\n\nMonthly, 90(5):301\u2013312, 1983.\n\n[DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online\nlearning and stochastic optimization. The Journal of Machine Learning Research,\n12:2121\u20132159, 2011.\n\n[GH96] Zoubin Ghahramani and Geoffrey E Hinton. Parameter estimation for linear dynamical\nsystems. Technical report, Technical Report CRG-TR-96-2, University of Toronto,\nDeptartment of Computer Science, 1996.\n\n[Gr\u00a8u82] F Alberto Gr\u00a8unbaum. A remark on hilbert\u2019s matrix. Linear Algebra and its Applica-\n\ntions, 43:119\u2013124, 1982.\n\n[Haz16] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in\n\nOptimization, 2(3-4):157\u2013325, 2016.\n\n[Hil94] David Hilbert. Ein beitrag zur theorie des legendre\u2019schen polynoms. Acta mathemat-\n\nica, 18(1):155\u2013159, 1894.\n\n[HMR16] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dy-\n\nnamical systems. arXiv preprint arXiv:1609.05191, 2016.\n\n[HSC+16] Wenbing Huang, Fuchun Sun, Lele Cao, Deli Zhao, Huaping Liu, and Mehrtash Ha-\nrandi. Sparse coding and dictionary learning with linear dynamical systems. In Pro-\nceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3938\u20133947, 2016.\n\n[Kal60] Rudolph Emil Kalman. A new approach to linear \ufb01ltering and prediction problems.\n\nJournal of Basic Engineering, 82.1:35\u201345, 1960.\n\n[KV05] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems.\n\nJournal of Computer and System Sciences, 71(3):291\u2013307, 2005.\n\n[Lju98] Lennart Ljung. System identi\ufb01cation: Theory for the User. Prentice Hall, Upper\n\nSaddle Riiver, NJ, 2 edition, 1998.\n\n[Lju02] Lennart Ljung. Prediction error estimation methods. Circuits, Systems and Signal\n\nProcessing, 21(1):11\u201321, 2002.\n\n[LW94] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Infor-\n\nmation and Computation, 108(2):212\u2013261, 1994.\n\n[Mar10a] James Martens. Learning the linear dynamical system with asos. In Johannes Frnkranz\nand Thorsten Joachims, editors, Proceedings of the 27th International Conference on\nMachine Learning, pages 743\u2013750. Omnipress, 2010.\n\n[Mar10b] James Martens. Learning the linear dynamical system with asos. In Proceedings of\n\nthe 27th International Conference on Machine Learning, pages 743\u2013750, 2010.\n\n[RG99] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models.\n\nNeural computation, 11(2):305\u2013345, 1999.\n\n10\n\n\f[Sch11] J Schur. Bemerkungen zur theorie der beschr\u00a8ankten bilinearformen mit unendlich\nvielen ver\u00a8anderlichen. Journal f\u00a8ur die reine und Angewandte Mathematik, 140:1\u201328,\n1911.\n\n[Sle78] David Slepian. Prolate spheroidal wave functions, fourier analysis, and uncertainty:\n\nThe discrete case. Bell Labs Technical Journal, 57(5):1371\u20131430, 1978.\n\n[SS82] Robert H Shumway and David S Stoffer. An approach to time series smoothing and\nforecasting using the em algorithm. Journal of Time Series Analysis, 3(4):253\u2013264,\n1982.\n\n[VODM12] Peter Van Overschee and BL De Moor. Subspace Identi\ufb01cation for Linear Systems.\n\nSpringer Science & Business Media, 2012.\n\n[WVDM00] Eric A Wan and Rudolph Van Der Merwe. The unscented kalman \ufb01lter for nonlinear\nestimation. In Adaptive Systems for Signal Processing, Communications, and Control\nSymposium 2000. AS-SPCC. The IEEE 2000, pages 153\u2013158. IEEE, 2000.\n\n[Zin03] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient\nascent. In Proceedings of the 20th International Conference on Machine Learning,\npages 928\u2013936, 2003.\n\n11\n\n\f", "award": [], "sourceid": 3368, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Karan", "family_name": "Singh", "institution": "Princeton University"}, {"given_name": "Cyril", "family_name": "Zhang", "institution": "Princeton University"}]}