{"title": "Multi-view Matrix Factorization for Linear Dynamical System Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 7092, "page_last": 7101, "abstract": "We consider maximum likelihood estimation of linear dynamical systems with generalized-linear observation models. Maximum likelihood is typically considered to be hard in this setting since latent states and transition parameters must be inferred jointly. Given that expectation-maximization does not scale and is prone to local minima, moment-matching approaches from the subspace identification literature have become standard, despite known statistical efficiency issues. In this paper, we instead reconsider likelihood maximization and develop an optimization based strategy for recovering the latent states and transition parameters. Key to the approach is a two-view reformulation of maximum likelihood estimation for linear dynamical systems that enables the use of global optimization algorithms for matrix factorization. We show that the proposed estimation strategy outperforms widely-used identification algorithms such as subspace identification methods, both in terms of accuracy and runtime.", "full_text": "Multi-view Matrix Factorization for Linear\n\nDynamical System Estimation\n\nMahdi Karami, Martha White, Dale Schuurmans, Csaba Szepesv\u00e1ri\n\n{karami1, whitem, daes, szepesva}@ualberta.ca\n\nDepartment of Computer Science\n\nUniversity of Alberta\nEdmonton, AB, Canada\n\nAbstract\n\nWe consider maximum likelihood estimation of linear dynamical systems with\ngeneralized-linear observation models. Maximum likelihood is typically considered\nto be hard in this setting since latent states and transition parameters must be\ninferred jointly. Given that expectation-maximization does not scale and is prone\nto local minima, moment-matching approaches from the subspace identi\ufb01cation\nliterature have become standard, despite known statistical ef\ufb01ciency issues. In this\npaper, we instead reconsider likelihood maximization and develop an optimization\nbased strategy for recovering the latent states and transition parameters. Key to\nthe approach is a two-view reformulation of maximum likelihood estimation for\nlinear dynamical systems that enables the use of global optimization algorithms for\nmatrix factorization. We show that the proposed estimation strategy outperforms\nwidely-used identi\ufb01cation algorithms such as subspace identi\ufb01cation methods, both\nin terms of accuracy and runtime.\n\n1\n\nIntroduction\n\nLinear dynamical systems (LDS) provide a fundamental model for estimation and forecasting in\ndiscrete-time multi-variate time series. In an LDS, each observation is associated with a latent state;\nthese unobserved states evolve as a Gauss-Markov process where each state is a linear function of the\nprevious state plus noise. Such a model of a partially observed dynamical system has been widely\nadopted, particularly due to its ef\ufb01ciency for prediction of future observations using Kalman \ufb01ltering.\nEstimating the parameters of an LDS\u2014sometimes referred to as system identi\ufb01cation\u2014is a dif\ufb01cult\nproblem, particularly if the goal is to obtain the maximum likelihood estimate of parameters. Con-\nsequently, spectral methods from the subspace identi\ufb01cation literature, based on moment-matching\nrather than maximum likelihood, have become popular. These methods provide closed form solutions,\noften involving a singular value decomposition of a matrix constructed from the empirical moments\nof observations (Moonen and Ramos, 1993; Van Overschee and De Moor, 1994; Viberg, 1995;\nKatayama, 2006; Song et al., 2010; Boots and Gordon, 2012). The most widely used such algorithms\nfor parameter estimation in LDSs are the family of N4SID algorithms (Van Overschee and De Moor,\n1994), which are computationally ef\ufb01cient and asymptotically consistent (Andersson, 2009; Hsu\net al., 2012). Recent evidence, however, suggests that these moment-matching approaches may suffer\nfrom weak statistical ef\ufb01ciency, performing particularly poorly with small sample sizes (Foster et al.,\n2012; Zhao and Poupart, 2014).\nMaximum likelihood for LDS estimation, on the other hand, has several advantages. For example, it\nis asymptotically ef\ufb01cient under general conditions (Cram\u00e9r, 1946, Ch.33), and this property often\ntranslates to near-minimax \ufb01nite-sample performance. Further, maximum likelihood is amenable\nto coping with missing data. Another bene\ufb01t is that, since the likelihood for exponential families\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fand corresponding convex losses (Bregman divergences) are well understood (Banerjee et al., 2005),\nmaximum likelihood approaches can generalize to a broad range of distributions over the observations.\nSimilarly, other common machine learning techniques, such as regularization, can be naturally\nincorporated in a maximum likelihood framework, interpretable as maximum a posteriori estimation.\nUnfortunately, unlike spectral methods, there is no known ef\ufb01cient algorithm for recovering pa-\nrameters that maximize the marginal likelihood of observed data in an LDS. Standard iterative\napproaches are based on EM (Ghahramani and Hinton, 1996; Roweis and Ghahramani, 1999), which\nare computationally expensive and have been observed to produce locally optimal solutions that yield\npoor results (Katayama, 2006). A classical system identi\ufb01cation method, called the prediction error\nmethod (PEM), is based on minimization of prediction error and can be interpreted as maximum\nlikelihood estimation under certain distributional assumptions (e.g., Ch. 7.4 of Ljung 1999, \u00c5str\u00f6m\n1980). PEM, however, is prone to local minima and requires selection of a canonical parameterization,\nwhich can be dif\ufb01cult in practice and can result in ill-conditioned problems (Katayama, 2006).\nIn this paper, we propose an alternative approach to LDS parameter estimation under exponential\nfamily observation noise. In particular, we reformulate the LDS as a two-view generative model,\nwhich allows us to approximate the estimation task as a form of matrix factorization, and apply recent\nglobal optimization techniques for such models (Zhang et al., 2012; Yu et al., 2014). To extend these\nprevious algorithms to this setting, we provide a novel proximal update for the two-view approach\nthat signi\ufb01cantly simpli\ufb01es the algorithm. Finally, for forecasting on synthetic and real data, we\ndemonstrate that the proposed algorithm matches or outperforms N4SID, while scaling better with\nincreasing sample size and data dimension.\n\n2 Linear dynamical systems\n\nWe address discrete-time, time-invariant linear dynamical systems, speci\ufb01ed as\n\nt+1 = At + \u2318t\nxt = Ct + \u270ft\n\n(1)\n\nwhere t 2 Rk is the hidden state at time t; xt 2 Rd is the observation vector at time t; A 2 Rk\u21e5k is\nthe dynamics matrix; C 2 Rd\u21e5k is the observation matrix; \u2318 is the state evolution noise; and \u270f is the\nobservation noise. The noise terms are assumed to be independent. As is common, we assume that the\nstate evolution noise is Gaussian: \u2318 \u21e0 N (0, \u2303\u2318). We additionally allow for general observation noise\nto be generated from an exponential family distribution (e.g., Poisson). The graphical representation\nfor this LDS is shown in Figure 1.\nAn LDS encodes the intuition that a latent state is driving the dynamics, which can signi\ufb01cantly\nsimplify estimation and forecasting. The observations typically contain only partial information\nabout the environment (such as in the form of limited sensors), and further may contain noisy or\neven irrelevant observations. Learning transition models for such observations can be complex,\nparticularly if the observations are high-dimensional. For example, in spatiotemporal processes, the\ndata is typically extremely high-dimensional, composed of structured grid data; however, it is possible\nto extract a low-rank state-space that signi\ufb01cantly simpli\ufb01es analysis (Gelfand et al., 2010, Chapter\n8). Further, for forecasting, iterating transitions for such a low-rank state-space can provide longer\nrange predictions with less error accumulation than iterating with the observations themselves.\nThe estimation problem for an LDS involves extracting the unknown parameters, given a time series\nof observations x1, . . . , xT . Unfortunately, jointly estimating the parameters A, C and t is dif\ufb01cult\nbecause the multiplication of these variables typically results in a nonconvex optimization. Given the\nlatent states t, estimation of A and C is more straightforward, though there are still some issues\nwith maintaining stability (Siddiqi et al., 2007). There are some recent advances improving estimation\nin time series models using matrix factorization. White et al. (2015) provide a convex formulation for\nauto-regressive moving average models\u2014although related to state-space models, these do not permit\na straightforward conversion between the parameters of one to the other. Yu et al. (2015) factorize the\nobservation into a hidden state and dictionary, using a temporal regularizer on the extracted hidden\nstate\u2014the resulting algorithm, however, is not guaranteed to provide an optimal solution.\n\n2\n\n\f2\n\n3\n\n1\n\nC\n\nA\n\nE\n\nx1\n\nx2\n\nx3\n\n. . .\n\n. . .\n\nFigure 1: Graphical representation for the standard\nLDS formulation and the corresponding two-view\nmodel. The two-view formulation is obtained by a lin-\near transformation of the LDS model. The LDS model\nincludes only parameters C and A and the two-view\nmodel includes parameters C and E = CA, where A\ncan be extracted from E after C and E are estimated.\n\n3 Two-view Formulation of LDS\n\nIn this section, we reformulate the LDS as a generative two-view model with a shared latent factor. In\nthe following section, we demonstrate how to estimate the parameters of this reformulation optimally,\nfrom which parameter estimates of the original LDS can be recovered.\nTo obtain a two-view formulation, we re-express the two equations for the LDS as two equations for\npairs of sequential observations. To do so, we multiply the state evolution equation in (1) by C and\nadd \u270ft+1 to obtain Ct+1 + \u270ft+1 = CAt + C\u2318t + \u270ft+1; representing the LDS model as\n\nxt+1 = Et + \u270f0t+1\n\nxt = Ct + \u270ft\n\n(2)\n\nwhere we refer to E := CA as the factor loading matrix and \u270f0t+1 := C\u2318t + \u270ft+1 as the noise of\nthe second view. We then have a two-view problem where we need to estimate parameters E and C.\nSince the noise components \u270ft and \u270f0t are independent, the two views xt and xt+1 are conditionally\nindependent given the shared latent state t. The maximum log likelihood problem for the two-view\nformulation then becomes\n\nmax\nC,E,\n\nlog p(x1, . . . , xT|0, 1, . . . , T , C, E) = max\n\nC,E,\n\nlog p(xt|t1, t, C, E)\n\n(3)\n\nwhere, given the hidden states, the observations are conditionally independent. The log-likelihood (3)\nis equivalent to the original LDS, but is expressed in terms of the distribution p(xt|t1, t, C, E),\nwhere the probability of an observation increases if it has high probability under both t1 and t.\nThe graphical depiction of the LDS and its implied two-view model is illustrated in Figure 1.\n\nTXt=1\n\n3.1 Relaxation\nTo tackle the estimation problem, we reformulate the estimation problem for this equivalent two-view\nmodel of the LDS. Note that according to the two-view model (2), the conditional distribution (3) can\nbe expressed as p(xt|t1, t, C, E) = p(xt|Et1) = p(xt|Ct). Substituting each of these in\nthe summation (3) would result in a factor loading model that ignores the temporal correlation among\ndata; therefore, to take the system dynamics into account we choose a balanced averaging of both\nas log p(xt|t1, t, C, E) = 1\n2 log p(xt|Ct), where the likelihood of an\nobservation increases if it has high conditional likelihood given both t1 and t.1 With this choice\nand the exponential family speci\ufb01ed by the log-normalizer (also called potential function) F : Rd !\nR, with the corresponding Bregman divergence de\ufb01ned as DF (\u02c6zkz) := F (\u02c6z) F (z) f (z)>(\u02c6z z)\nusing transfer function f = rF ,2 the log-likelihood separates into the two components\n\n2 log p(xt|Et1) + 1\n\nargmax\nC,E,\n\nTXt=1\n\nlog p(xt|t1, t, C, E) = argmax\n\nC,E,\n\nlog p(xt|Et1) + log p(xt|Ct)\n\n= argmin\nC,E,\n\nDF (Et1||f 1(xt)) + DF (Ct||f 1(xt))\n\n1The balanced averaging can be generalized to a convex combination of the log-likelihood which adds a\n\ufb02exibility to the problem that can be tuned to improve performance. However, we found that the simple balanced\ncombination renders the best experimental performance in most cases.\n\n2 Consult Banerjee et al. (2005) for a complete overview of this correspondence.\n\n1\n2\n\nTXt=1\nTXt=1\n\n3\n\n\f2kzk2\n\nEach Bregman divergence term can be interpreted as the \ufb01tness measure for each view. For example,\na Gaussian distribution can be expressed by an exponential family de\ufb01ned by F (z) = 1\n2. The\nabove derivation could be extended to different variance terms for \u270f and \u270f0, which would result in\ndifferent weights on the two Bregman divergences above. Further, we could also allow different\nexponential families (hence different Bregman divergences) for the two distributions; however, there\nis no clear reason why this would be bene\ufb01cial over simply selecting the same exponential family,\nsince both describe xt. In this work, therefore, we will explore a balanced loss, with the same\nexponential family for each view.\nIn order to obtain a low rank solution, one can relax the hard rank constraint and employ the block\nj=1 kj:k2 as the rank-reducing regularizer on the latent state.3 This regularizer\noffers an adaptive rank reducing scheme that zeros out many of the rows of the latent states and\nhence results a low rank solution without knowing the rank a priori. For the reconstruction models\nC and E, we need to specify a prior that respects the conditional independence of the views xt and\nxt+1 given t. This goal can be achieved if C and E are constrained individually so that they do not\ncompete against each other to reconstruct their respective views (White et al., 2012). Incorporating\nthe regularizer and constraints, the resulting optimization problem has the form\n\nnorm kk2,1 =Pk\n\nTXt=1\n\nargmin\nC,E,\n\nL1(Et1; xt) + L2(Ct; xt) + \n\ns.t.kC:jk2 \uf8ff 1,kE:jk2 \uf8ff 2 8j 2 (1, k).\n\nkXj=1\n\nkj:k2\n\n(4)\n\nThe above constrained optimization problem is convex in each of the factor loading matrices {C, E}\nand the state matrix , but not jointly convex in terms of all these variables. Nevertheless, the\nfollowing lemma show that (4) admits a convex reformulation by change of variable.\n\n\u02c6Z(2) and\nLemma 1 Let \u02c6Z(1) := C and \u02c6Z(2) := E with their concatenated matrix \u02c6Z := \uf8ff\u02c6Z(1)\n0), I(2) := diag(\uf8ff0\nZ(1) := [x1:T1], Z(2) := [x2:T ]. In addition, let\u2019s de\ufb01ne I(1) := diag(\uf8ff1\n1),\nL1(C; Z(1)) + L2(E; Z(2)) + kk2,1\n\nthen the multi-view optimization problem (4) can be reformulated in the following convex form\n\nmin\n\nmin\n\nkC:jk2\uf8ff1\nkE:jk2\uf8ff2\n\n:\"C\n\nE#= \u02c6Z\n\n= min\n\n\u02c6Z\n\nL1(\u02c6Z(1); Z(1)) + L2(\u02c6Z(2); Z(2)) + max\n\n0\uf8ff\u2318\uf8ff1kU1\n\n\u2318\n\n\u02c6Zktr\n\n\u2318\n\nt=1 Li(yt; \u02c6yt). Moreover, we can show that\n\n\u02c6Zktr is concave in \u2318. The trace norm induces a low rank result.\n\nwhere U\u2318 = 1p\u2318 I(1) + 2p1\u2318 I(2) and Li(Y; \u02c6Y) =PT\nthe regularizer term kU1\nProof: The proof can be readily derived from the results of White et al. (2012).\nIn the next section, we demonstrate how to obtain globally optimal estimates of E, C and .\nRemark 1: This maximum likelihood formulation demonstrates how the distributional assumptions\non the observations xt can be generalized to any exponential family. Once expressed as the above\noptimization problem, one can further consider other losses and regularizers that may not immediately\nhave a distributional interpretation, but result in improved prediction performance. This generalized\nformulation of maximum likelihood for LDS, therefore, has the additional bene\ufb01t that it can \ufb02exibly\nincorporate optimization improvements, such as robust losses.4 Also a regularizer can be designed to\ncontrol over\ufb01tting to noisy observation, which is an issue in LDS that can result in an unstable latent\ndynamics estimate (Buesing et al., 2012a). Therefore, by controlling undesired over\ufb01tting to noisy\nsamples one can also prevent unintended unstable model identi\ufb01cation.\n\n\u2305\n\n3 Throughout this paper, Xi: (X:i) is used to denote the ith row (ith column) of matrix X and also [X; Y]\n\n([x; y]) denotes the matrix (vector) concatenation operator which is equal to [X>, Y>]> ([x>, y>]>).\n\n4Thus, we used L1 and L2 in (4) to generally refer to any loss function that is convex in its \ufb01rst argument.\n\n4\n\n\fRemark 2: We can generalize the optimization further to learn an LDS with exogenous input: a\ncontrol vector ut 2 Rd that impacts both the hidden state and observations. This entails adding some\nnew variables to the general LDS model that can be expressed as\n\nt+1 = At + But + \u2318t\nxt = Ct + Dut + \u270ft\n\nwith additional matrices B 2 Rk\u21e5d and D 2 Rd\u21e5d. Again by multiplying the state evolution\nequation by matrix C the resulting equations are\n\nxt+1 = Et + Fut + Dut+1 + \u270f0t+1\n\nxt = Ct + Dut + \u270ft\n\nwhere F := CB. Therefore, the loss can be generally expressed as\n\nL1(Et1 + Fut1 + Dut; xt) + L2(Ct + Dut; xt).\n\nThe optimization would now be over the variables C, E, , D, F, where the optimization could\nadditionally include regularizers on D and F to control over\ufb01tting. Importantly, the addition of these\nvariables D, F does not modify the convexity properties of the loss, and the treatment for estimating\nE, C and in section 4 directly applies. The optimization problem is jointly convex in D, F and\nany one of E, C or and jointly convex in D and F. Therefore, an outer minimization over D and\nF can be added to Algorithm 1 and we will still obtain a globally optimal solution.\n\n4 LDS Estimation Algorithm\n\nTo learn the optimal parameters for the reformulated two-view model, we adopt the generalized con-\nditional gradient (GCG) algorithm developed by Yu et al. (2014). GCG is designed for optimization\nproblems of the form l(x) + f (x) where l(x) is convex and continuously differentiable with Lipschitz\ncontinuous gradient and f (x) is a (possibly non-differentiable) convex function. The algorithm is\ncomputationally ef\ufb01cient, as well providing a reasonably fast O(1/t) rate of convergence to the global\nminimizer. Though we have a nonconvex optimization problem, we can use the convex reformulation\nfor two-view low-rank matrix factorization and resulting algorithm in (Yu et al., 2014, Section\n4). This algorithm includes a generic local improvement step, which signi\ufb01cantly accelerates the\nconvergence of the algorithm to a global optimum in practice. We provide a novel local improvement\nupdate, which both speeds learning and enforces a sparser structure on , while maintaining the\nsame theoretical convergence properties of GCG.\nIn our experiments, we speci\ufb01cally address the setting when the observations are assumed to be\nGaussian, giving an `2 loss. We also prefer the unconstrained objective function that can be ef\ufb01ciently\nminimized by fast unconstrained optimization algorithms. Therefore, using the well-established\nequivalent form of the regularizer (Bach et al., 2008), the objective (4) can be equivalently cast for\nthe Gaussian distributed time series xt as\n\nkEt1 xtk2\n\n2 + kCt xtk2\n\n2 + \n\nkj:k2 max( 1\n\n1kC:jk2, 1\n\n2kE:jk2).\n\n(5)\n\nkXj=1\n\nmin\nC,E,\n\nTXt=1\n\nThis product form of the regularizer is also preferred over the square form used in (Yu et al., 2014),\nF admits ef\ufb01cient optimizers\nsince it induces row-wise sparsity on . Though the square form kk2\ndue to its smoothness, it does not prefer to zero out rows of while with the regularizer of the form\n(5), the learned hidden state will be appropriately projected down to a lower-dimensional space where\nmany dimensions could be dropped from , C and E giving a low rank solution. In practice, we\nfound that enforcing this sparsity property on signi\ufb01cantly improved stability.5 Consequently, we\nneed optimization routines that are appropriate for the non smooth regularizer terms.\nThe local improvement step involves alternating block coordinate descent between C, E and , with\nan accelerated proximal gradient algorithm (FISTA) (Beck and Teboulle, 2009) for each descent step.\nTo use the FISTA algorithm we need to provide a proximal operator for the non-smooth regularizer\nin (5).\n\n5This was likely due to a reduction in the size of the transition parameters, resulting in improved re-estimation\n\nof A and a corresponding reduction in error accumulation when using the model for forecasting.\n\n5\n\n\fAlgorithm 1 LDS-DV\n\nInput: training sequence {xt, t 2 [1, T ]}\nOutput: C, A, t, \u2303\u2318, \u2303\u270f\nInitialize C0, E0, 0\nU1 [C>0 ; E>0 ]>, V1 >0\nfor i = 1, . . . do\n\n`((1 \u2318)UiV>i + \u2713uiv>i ) + ((1 \u2318)\u21e2i + \u2713) // partially corrective up-\n\ndate (PCU)\n\n(ui, vi) arg minuv>2A\u2326r`(Ui, Vi), uv>\u21b5 // compute polar\n(\u2318i, \u2713i) arg min\nUinit [p1 \u2318iUi,p\u2713iui], Vinit [p1 \u2318iVi,p\u2713ivi]\n(Ui+1, Vi+1) FISTA(UinitVinit)\n\u21e2i = 1\n\n0\uf8ff\u2318\uf8ff1,\u27130\n\n2v + k(Vi+1):ik2\n2)\n\n2Pi+1\nj=1(k(Ui+1):ik2\nend for\n(C; E) Ui+1, V>i+1\nA 2:T \u21e4 \u20201:T1\nestimate \u2303\u2318, \u2303\u270f by sample covariances\n\nLet the proximal operator of a convex and possibly non-differentiable function f (y) be de\ufb01ned as\n\nproxf (x) = arg min\ny\n\nf (y) + 1\n\n2kx yk2\n2.\n\nFISTA is an accelerated version of ISTA (Iterative Shrinkage-Thresholding Algorithm) that it-\neratively performs a gradient descent update with the smooth component of the objective, and\nthen applies the proximal operator as a projection step. Each iteration updates the variable x as\n\nthe proximal operator, as is the case for our non-differentiable regularizer, a common strategy is to\nnumerically calculate the proximal update. This approach, however, can be prohibitively expensive,\nand an analytic (closed) form is clearly preferable. We derive such a closed form for (5) in Theorem 1.\n\nxk+1 = proxkfxk krl(xk), which converges to a \ufb01xed point. If there is no known form for\nv2i composed of two subvectors v1, v2, de\ufb01ne f (v) = kvk2v :=\nTheorem 1 For a vector v =hv1\n max(kv1k2,kv2k2). The proximal operator for this function is\n\"v1 max{1 \u21b5\nkv1k\nv2 max{1 \u21b5\nkv2k\n\"v1 max{1 \nkv1k\nv2 max{1 \nkv2k\n\n, 0}# if kv1k \uf8ff kv2k\n, 0}# if kv2k \uf8ff kv1k\n\n, 0}\n\n, 0}\n\nproxf (v) =\n\nwhere \u21b5 := max{.5(kv1k kv2k + ), 0} and := max{.5(kv2k kv1k + ), 0}.\n\u2305\nProof: See Appendix A.\nThis result can be further generalized to enable additional regularization components on C and E,\nsuch as including an `1 norm on each column to further enforce sparsity (such as in the elastic net).\nThere is no closed form for the proximal operator of the sum of two functions in general. We prove,\nhowever, that for special case of a linear combination of the two-view norm with any norms on the\ncolumns of C and E, the proximal mapping reduces to a simple composition rule.\n\n8>>>><>>>>:\n\nTheorem 2 For norms R1(v1) and R2(v2), the proximal operator of the linear combination\nRc(v) = kvk2v + \u232b1R1(v1) + \u232b2R2(v2) for \u232b1, \u232b2 0 admits the simple composition\nproxRc(v) = proxk.k2v\u2713\uf8ffprox\u232b1R1(v1)\nprox\u232b2R2(v2)\u25c6 .\n\nProof: See Appendix A.\n\n\u2305\n\n4.1 Recovery of the LDS model parameters\nThe above reformulation provides a tractable learning approach to obtain the optimal parameters for\nthe two-view reformulation of LDS; given this optimal solution, we can then estimate the parameters\n\n6\n\n\fto the original LDS. The \ufb01rst step is to estimate the transition matrix A. A natural approach is to\nuse (2), and set \u02c6A = \u02c6C\u2020 \u02c6E for pseudoinverse \u02c6C\u2020. This \u02c6A, however, might be sensitive to inaccurate\nestimation of the (effective) hidden state dimension k. We found in practice that modi\ufb01cations from\nthe optimal choice of k might result in unstable solutions and produce unreliable forecasts. Instead,\na more stable \u02c6A can be learned from the hidden states themselves. This approach also focuses\nestimation of A on the forecasting task, which is our ultimate aim.\nGiven the sequence of hidden states, 1, . . . , T , there are several strategies that could be used to\nestimate A, including simple autoregressive models to more sophisticated strategies (Siddiqi et al.,\n2 which\n\n2007). We opt for a simple linear regression solution \u02c6A = arg minAPT1\nwe found produced stable \u02c6A.\nTo estimate the noise parameters \u2303\u2318, \u2303\u270f, recall \u2318t = t+1 \u02c6At, \u270ft = xt Ct. Having obtained\n\u02c6A, therefore, we can estimate the noise covariance matrices by computing their sample covariances\nas \u02c6\u2303\u2318 = 1\nt=1 \u270ft\u270f>t . The \ufb01nal LDS learning procedure is outlined in\nAlgorithm 1. For more details about polar computation and partially corrective subroutine see (Yu\net al., 2014, Section 4).\n\nt=1 kt+1 Atk2\n\nT1PT\n\nt=1 \u2318t\u2318>t , \u02c6\u2303\u270f = 1\n\nT1PT\n\n5 Experimental results\n\nt=1 yt.\n\nWe evaluate the proposed algorithm by comparing one step prediction performance and computation\nspeed with alternative methods for real and synthetic time series. We report the normalized mean\n\nTtestPTtest\n\nsquare error (NMSE) de\ufb01ned as NMSE = PTtest\nt=1 kyt\u02c6ytk2\nt=1 kyt\u00b5yk2 where \u00b5y = 1\nPTtest\nAlgorithms: We compared the proposed algorithm to a well-established method-of moment-based\nalgorithm, N4SID (Van Overschee and De Moor, 1994), Hilbert space embeddings of hidden Markov\nmodels (HSE-HMM) (Song et al., 2010), expectation-maximization for estimating the parameters of\na Kalman \ufb01lter (EM) (Roweis and Ghahramani, 1999) and PEM (Ljung, 1999). These are standard\nbaseline algorithms that are used regularly for LDS identi\ufb01cation. The estimated parameters by\nN4SID were used as the initialization point for EM and PEM algorithms in our experiments. We used\nthe built-in functions, n4sid and pem, in Matlab, with the order selected by the function, for the\nsubspace identi\ufb01cation method and PEM, respectively. For our algorithm, we select the regularization\nparameter using cross-validation. For the time series, the training data is split by performing the\nlearning on \ufb01rst 80% of the training data and evaluating the prediction performance on the remaining\n20%.\nReal datasets: For experiments on real datasets we select the climate time series from IRI data\nlibrary that recorded the surface temperature on the monthly basis for tropical Atlantic ocean (ATL)\nand tropical Paci\ufb01c ocean (CAC). In CAC we selected \ufb01rst 30 \u21e5 30 grids out of the total 84 \u21e5 30\nlocations with 399 monthly samples, while in ATL the \ufb01rst 9 \u21e5 9 grids out of the total 38 \u21e5 25\nlocations are selected each with timeseries of length 564. We partitioned each area to smaller areas\nof size 3 \u21e5 3 and arrange them to vectors of size 9, then seasonality component of the time series are\nremoved and data is centered to have zero mean. We ran two experiments for each dataset. For the\n\ufb01rst, the whole sequence is sliced into 70% training and 30% test. For the second, a short training set\nof 70 samples is selected, with a test sequence of size 50.\nSynthetic datasets: In the synthetic experiments, the datasets are generated by an LDS model (1) of\ndifferent system orders, k, and observation sizes, d. For each test case, 100 data sequences of length\n200 samples are generated and sliced to 70%, 30% ratios for training set and test set, respectively. The\ndynamics matrix A is selected to produce a stable system: {|i(A)| = s : s \uf8ff 1, 8i 2 (1, k)} where\ni(A) is the ith eigen value of matrix A. The noise components are drawn from Gaussian distributions\nand scaled so that p\u2318 := E{\u2318>\u2318}/m and p\u270f := E{\u270f>\u270f}/n. Each test is repeated with the following\nsettings: {S1: s = 0.970, p\u2318 = 0.50 and p\u270f = 0.1}, {S2: s = 0.999, p\u2318 = 0.01 and p\u270f = 0.1}.\nResults: The NMSE and run-time results obtained on real and synthetic datasets are shown in Table\n1 and Table 2, respectively. In terms of NMSE, LDS-DV outperforms and matches the alternative\nmethods. In terms of algorithm speed, the LDS-DV learns the model much faster than the competitors\nand scales well to larger dimension models. The speed improvement is more signi\ufb01cant for larger\ndatasets and observations with higher dimensions.\n\n7\n\n\fATL(Long)\n\nLDS-MV\nN4SID\nEM\nHSE-HMM 675.87\u00b1629.46\nPEM-SSID\n\nNMSE\n0.45\u00b10.03\n0.52\u00b10.04\n0.64\u00b10.04\n0.71\u00b10.08\n\nTime\n0.26\n2.34\n7.87\n0.79\n20.00\n\nTable 1: Real time series\n\nATL(Short)\nNMSE\n0.54\u00b10.05\n0.59\u00b10.05\n0.88\u00b10.07\n0.97\u00b10.01\n1.52\u00b10.66\n\nTime\n0.22\n0.95\n3.92\n0.16\n16.38\n\nCAC(Long)\nNMSE\n0.58\u00b10.02\n0.61\u00b10.02\n0.81\u00b10.02\n11.24\u00b18.23\n1.38\u00b10.15\n\nTime\n0.28\n1.23\n5.70\n0.39\n19.67\n\nCAC(Short)\nNMSE\n0.63\u00b10.03\n0.84\u00b10.07\n1.02\u00b10.08\n2.82\u00b11.60\n2.68\u00b10.78\n\nTime\n0.14\n1.08\n4.12\n0.17\n20.58\n\nTable 2: Synthetic time series\n\n(S2) d=8 , k=6\n\n(S1) d=16 , k=9\n\nTime\n0.49\n0.81\n4.99\n\n(S1) d=5 , k=3\n(S2) d=16 , k=9\nNMSE\nTime NMSE Time NMSE\nTime NMSE Time\n0.12\u00b10.01\n0.66 0.04\u00b10.00 0.52 0.07\u00b10.00 1.01 0.03\u00b10.00 1.72\nLDS-MV\n0.12\u00b10.01\nN4SID\n1.45 0.39\u00b10.04 1.38 0.10\u00b10.00 4.29 0.42\u00b10.04 4.40\nEM\n6.01 0.04\u00b10.00 5.03 0.13\u00b10.00 19.21 0.03\u00b10.00 19.83\n0.18\u00b10.01\nHSE-HMM 2.4e+4\u00b11.7e+4 0.48 2.2e+7\u00b12.2e+7 0.50 7.8e+03\u00b17.7e+03 0.49 0.65\u00b10.02 0.55 22.92\u00b121.83 0.53 0.71\u00b10.01 0.61\nPEM-SSID 0.14\u00b10.01\n15.22 0.08\u00b10.01 13.97 0.09\u00b10.01 38.39 0.06\u00b10.02 41.10\nResults for real and synthetic datasets are listed in Table 1 and Table 2, respectively. The \ufb01rst column of each\ndataset is the average normalized MSE with standard error and the second column is the algorithm runtime in\nCPU seconds. The best NMSE according to pairwise t-test with signi\ufb01cance level of 5% is highlighted.\n\n(S1) d=8 , k=6\nNMSE\n0.08\u00b10.00\n0.11\u00b10.00\n0.14\u00b10.01\n0.12\u00b10.01\n\n(S2) d=5 , k=3\nNMSE\n0.17\u00b10.02\n0.42\u00b10.04\n0.15\u00b10.02\n0.25\u00b10.03\n\nTime\n0.36\n0.76\n4.62\n\n10.72\n\n9.08\n\nLDS-DV\nN4SID\nEM\n\nE\nS\nM\nN\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\ns\nd\nn\no\nc\ne\nS\n\n0\n100\n\nLDS-DV\nN4SID\nEM\nHSE-HMM\n\n \n\n-\n\nV\nM\nS\nD\nL\n \nf\no\nE\nS\nM\nn\no\ni\nt\nc\ni\nd\ne\nr\nP\n\n \n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n200\n\n300\n\n400\n\n500\n\n600\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n3.5\n\nTraining Sequence Length (T)\n\nTraining Sequence Length(T)\n\n(a) NMSE\n\n(b) Runtime\n\nPrediction MSE of n4SID\n\n(c) Scatter plot of MSE\n\nFigure 2: a) NMSE of the LDS-DV for increasing length of training sequence. The difference between LDS-DV\nand N4SID is more signi\ufb01cant in shorter training length, while both converge to the same accuracy in large\nT . HSE-HMM is omitted due to its high error. b) Runtime in CPU seconds for increasing length of training\nsequence. LDS-DV scales well with large sample length. c) MSE of the LDS-DV versus MSE of N4SID. In\nhigher values of MSE, the points are below identity function line and LDS-DV is more likely to win.\n\nFor test cases with |i(A)| ' 1, designed to evaluate the prediction performance of the methods for\nmarginally stable systems, LDS-DV still can learn a stable model while the other algorithms might\nnot learn a stable model. The proposed LDS-DV method does not explicitly impose stability, but the\nregularization favors A that is stable. The regularizer on latent state encourages smooth dynamics\nand controls over\ufb01tting: over\ufb01tting to noisy observations can lead to unstable estimate of the model\n(Buesing et al., 2012a), and a smooth latent trajectory is a favorable property in most real-world\napplications.\nFigure 2(c) shows the MSE of LDS-DV versus N4SID, for all the CAC time-series. This \ufb01gure\nillustrates that for easier problems, LDS-DV and N4SID are more comparable. However, as the\ndif\ufb01culty increase, and MSE increases, LDS-DV begins to consistently outperform N4SID.\nFigures 2(a) and 2(b) illustrate the accuracy and runtime respectively of the algorithms versus training\nlength. We used the synthetic LDS model under condition S1 with n = 8, m = 6. Values are\naveraged over 20 runs with a test length of 50 samples. LDS-DV has better early performance, for\nsmaller sample sizes. At larger sample sizes, they reach approximately the same error level.\n\n6 Conclusion\n\nIn this paper, we provided an algorithm for optimal estimation of the parameters for a time-invariant,\ndiscrete-time linear dynamical system. More precisely, we provided a reformulation of the model as a\ntwo-view objective, which allowed recent advances for optimal estimation for two-view models to be\napplied. The resulting algorithm is simple to use and \ufb02exibly allows different losses and regularizers\n\n8\n\n\fto be incorporated. Despite this simplicity, signi\ufb01cant improvements were observed over a widely\naccepted method for subspace identi\ufb01cation (N4SID), both in terms of accuracy for forecasting and\nruntime.\nThe focus in this work was on forecasting, therefore on optimal estimation of the hidden states and\ntransition matrices; however, in some settings, estimation of noise parameters for LDS models is\nalso desired. An unresolved issue is joint optimal estimation of these noise parameters. Though\nwe do explicitly estimate the noise parameters, we do so only from the residuals after obtaining the\noptimal hidden states and transition and observation matrices. Moreover, consistency of the learned\nparameters by the proposed procedure of this paper is still an open problem and will be an interesting\nfuture work.\nThe proposed optimization approach for LDSs should be useful for applications where alternative\nnoise assumptions are desired. A Laplace assumption on the observations, for example, provides a\nmore robust `1 loss. A Poisson distribution has been advocated for count data, such as for neural\nactivity, where the time series is a vector of small integers (Buesing et al., 2012b). The proposed\nformulation of estimation for LDSs easily enables extension to such distributions. An important next\nstep is to investigate the applicability to a wider range of time series data.\n\nAcknowledgments\nThis work was supported in part by the Alberta Machine Intelligence Institute and NSERC. During\nthis work, M. White was with the Department of Computer Science, Indiana University.\n\nReferences\nAndersson, S. (2009). Subspace estimation and prediction methods for hidden Markov models. The\n\nAnnals of Statistics.\n\n\u00c5str\u00f6m, K. (1980). Maximum likelihood and prediction error methods. Automatica, 16(5):551\u2013574.\nBach, F., Mairal, J., and Ponce, J. (2008). Convex sparse matrix factorizations. arXiv:0812.1869v1.\nBanerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. (2005). Clustering with Bregman divergences.\n\nJournal of Machine Learning Research.\n\nBeck, A. and Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear\n\nInverse Problems. SIAM Journal on Imaging Sciences, 2.\n\nBoots, B. and Gordon, G. (2012). Two-manifold problems with applications to nonlinear system\n\nidenti\ufb01cation. In International Conference on Machine Learning.\n\nBoyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.\nBuesing, L., Macke, J., and Sahani, M. (2012a). Learning stable, regularised latent models of neural\n\npopulation dynamics. Network: Computation in Neural Systems.\n\nBuesing, L., Macke, J., and Sahani, M. (2012b). Spectral learning of linear dynamics from generalised-\nlinear observations with application to neural population data. In Advances in Neural Information\nProcessing Systems.\n\nCram\u00e9r, H. (1946). Mathematical Methods of Statistics. Princeton University Press.\nFoster, D., Rodu, J., and Ungar, L. (2012). Spectral dimensionality reduction for HMMs.\n\narXiv:1203.6130v1.\n\nGelfand, A., Diggle, P., Guttorp, P., and Fuentes, M. (2010). Handbook of Spatial Statistics. CRC\n\nPress.\n\nGhahramani, Z. and Hinton, G. (1996). Parameter estimation for linear dynamical systems. Technical\n\nreport.\n\nHaeffele, B., Young, E., and Vidal, R. (2014). Structured Low-Rank Matrix Factorization: Optimality,\nIn International Conference on Machine\n\nAlgorithm, and Applications to Image Processing.\nLearning.\n\n9\n\n\fHsu, D., Kakade, S., and Zhang, T. (2012). A spectral algorithm for learning Hidden Markov Models.\n\nJournal of Computer and System Sciences.\n\nKatayama, T. (2006). Subspace Methods for System Identi\ufb01cation. Springer.\nLjung, L. (1999). System Identi\ufb01cation (2Nd Ed.): Theory for the User. Prentice Hall PTR.\nMacke, J., Buesing, L., and Sahani, M. (2015). Estimating State and Model Parameters in State-Space\n\nModels of Spike Trains. Advanced State Space Methods for Neural and Clinical Data.\n\nMoonen, M. and Ramos, J. (1993). A subspace algorithm for balanced state space system identi\ufb01ca-\n\ntion. IEEE Transactions on Automatic Control.\n\nParikh, N. and Boyd, S. (2013). Proximal Algorithms. Foundations and Trends in Optimization. Now\n\nPublishers.\n\nRoweis, S. and Ghahramani, Z. (1999). A unifying review of linear Gaussian models. Neural\n\nComputation.\n\nSiddiqi, S., Boots, B., and Gordon, G. (2007). A Constraint Generation Approach to Learning Stable\n\nLinear Dynamical Systems. In Advances in Neural Information Processing Systems.\n\nSong, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. (2010). Hilbert space embeddings of\n\nhidden Markov models. In International Conference on Machine Learning.\n\nVan Overschee, P. and De Moor, B. (1994). N4SID: Subspace algorithms for the identi\ufb01cation of\n\ncombined deterministic-stochastic systems. Automatica.\n\nViberg, M. (1995). Subspace-based methods for the identi\ufb01cation of linear time-invariant systems.\n\nAutomatica.\n\nWhite, M., Wen, J., Bowling, M., and Schuurmans, D. (2015). Optimal estimation of multivariate\n\nARMA models. In AAAI Conference on Arti\ufb01cial Intelligence.\n\nWhite, M., Yu, Y., Zhang, X., and Schuurmans, D. (2012). Convex multi-view subspace learning. In\n\nAdvances in Neural Information Processing Systems.\n\nYu, H., Rao, N., and Dhillon, I. (2015). High-dimensional Time Series Prediction with Missing\n\nValues. arXiv:1509.08333.\n\nYu, Y., Zhang, X., and Schuurmans, D. (2014). Generalized Conditional Gradient for Sparse\n\nEstimation. arXiv:1410.4828.\n\nZhang, X., Yu, Y., and Schuurmans, D. (2012). Accelerated training for matrix-norm regularization:\n\nA boosting approach. In Advances in Neural Information Processing Systems.\n\nZhao, H. and Poupart, P. (2014). A sober look at spectral learning. arXiv:1406.4631.\n\n10\n\n\f", "award": [], "sourceid": 3588, "authors": [{"given_name": "Mahdi", "family_name": "Karami", "institution": "University of Alberta"}, {"given_name": "Martha", "family_name": "White", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Google"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}