{"title": "Multilinear Dynamical Systems for Tensor Time Series", "book": "Advances in Neural Information Processing Systems", "page_first": 2634, "page_last": 2642, "abstract": "Many scientific data occur as sequences of multidimensional arrays called tensors.  How can hidden, evolving trends in such data be extracted while preserving the tensor structure?  The model that is traditionally used is the linear dynamical system (LDS), which treats the observation at each time slice as a vector.  In this paper, we propose the multilinear dynamical system (MLDS) for modeling tensor time series and an expectation-maximization (EM) algorithm to estimate the parameters.  The MLDS models each time slice of the tensor time series as the multilinear projection of a corresponding member of a sequence of latent, low-dimensional tensors.  Compared to the LDS with an equal number of parameters, the MLDS achieves higher prediction accuracy and marginal likelihood for both simulated and real datasets.", "full_text": "Multilinear Dynamical Systems\n\nfor Tensor Time Series\n\nMark Rogers\n\nmarkrogersjr@berkeley.edu, {leili,russell}@cs.berkeley.edu\n\nEECS Department, University of California, Berkeley\n\nLei Li\n\nStuart Russell\n\nAbstract\n\nData in the sciences frequently occur as sequences of multidimensional arrays\ncalled tensors. How can hidden, evolving trends in such data be extracted while\npreserving the tensor structure? The model that is traditionally used is the linear\ndynamical system (LDS) with Gaussian noise, which treats the latent state and\nobservation at each time slice as a vector. We present the multilinear dynamical\nsystem (MLDS) for modeling tensor time series and an expectation\u2013maximization\n(EM) algorithm to estimate the parameters. The MLDS models each tensor obser-\nvation in the time series as the multilinear projection of the corresponding member\nof a sequence of latent tensors. The latent tensors are again evolving with respect\nto a multilinear projection. Compared to the LDS with an equal number of param-\neters, the MLDS achieves higher prediction accuracy and marginal likelihood for\nboth arti\ufb01cial and real datasets.\n\n1\n\nIntroduction\n\nA tenet of mathematical modeling is to faithfully match the structural properties of the data; yet, on\noccasion, the available tools are inadequate to perform the task. This scenario is especially common\nwhen the data are tensors, i.e., multidimensional arrays: vector and matrix models are \ufb01tted to them\nwithout justi\ufb01cation. This is, perhaps, due to the lack of an agreed-upon tensor model. There are\nmany examples that seem to require such a model: The spatiotemporal grid of atmospheric data in\nclimate modeling is a time series of n\u00d7 m\u00d7 l tensors, where n, m and l are the numbers of latitude,\nlongitude, and elevation grid points. If k measurements\u2014e.g., temperature, humidity, and wind\nspeed for k=3\u2014are made, then a time series of n\u00d7 m\u00d7 l\u00d7 k tensors is constructed. The daily high,\nlow, opening, closing, adjusted closing, and volume of the stock prices of n multiple companies\ncomprise a time series of 6 \u00d7 n tensors. A grayscale video sequence is a two-dimensional tensor\ntime series because each frame is a two-dimensional array of pixels.\nSeveral queries can be made when one is presented with a tensor time series. As with any time\nseries, a forecast of future data may be requested. For climate data, successful prediction may\nspell out whether the overall ocean temperatures will increase. Prediction of stock prices may not\nonly inform investors but also help to stabilize the economy and prevent market collapse. The\nrelationships between particular subsets of tensor elements could be of signi\ufb01cance. How does the\ntemperature of the ocean at 8\u25e6N, 165\u25e6E affect the temperature at 5\u25e6S, 125\u25e6W? For stock price data,\none may investigate how the stock prices of electric car companies affect those of oil companies.\nFor a video sequence, one might expect adjacent pixels to be more correlated than those far away\nfrom each other. Another way to describe the relationships among tensor elements is in terms of\ntheir covariances. Equipped with a tabulation of the covariances, one may read off how a given\ntensor element affects others. Later in this paper, we will de\ufb01ne a tensor time series model and a\ncovariance tensor that permits the modeling of general noise relationships among tensor elements.\nMore formally, a tensor X \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM is a multidimensional array with elements that can each be\nindexed by a vector of positive integers. That is, every element Xi1\u00b7\u00b7\u00b7iM \u2208 R is uniquely addressed\n\n1\n\n\fby a vector (i1,\u00b7\u00b7\u00b7 , iM ) such that 1 \u2264 im \u2264 Im for all m. Each of the M dimensions of X is called\na mode and represents a particular component of the data. The simplest tensors are vectors and\nmatrices: vectors are tensors with only a single mode, while matrices are tensors with two modes.\nWe will consider the tensor time series, which is an ordered, \ufb01nite collection of tensors that all share\nthe same dimensionality. In practice, each member of an observed tensor time series re\ufb02ects the\nstate of a dynamical system that is measured at discrete epochs.\nWe propose a novel model for tensor time series: the multilinear dynamical system (MLDS). The\nMLDS explicitly incorporates the dynamics, noise, and tensor structure of the data by juxtaposing\nconcepts in probabilistic graphical models and multilinear algebra. Speci\ufb01cally, the MLDS gener-\nalizes the states of the linear dynamical system (LDS) to tensors via a probabilistic variant of the\nTucker decomposition. The LDS tracks latent vector states and observed vector sequences; this\npermits forecasting, estimation of latent states, and modeling of noise but only for vector objects.\nMeanwhile, the Tucker decomposition of a single tensor computes a latent \u201ccore\u201d tensor but has\nno dynamics or noise capabilities. Thus, the MLDS achieves the best of both worlds by uniting\nthe two models in a common framework. We show that the MLDS, in fact, generalizes LDS and\nother well-known vector models to tensors of arbitrary dimensionality. In our experiments on both\nsynthetic and real data, we demonstrate that the MLDS outperforms the LDS with an equal number\nof parameters.\n\n2 Tensor algebra\nLet N be the set of all positive integers and R be the set of all real numbers. Given I \u2208 NM ,\nwhere M \u2208 N, we assemble a tensor-product space RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM , which will sometimes be written\nas RI = R(I1,...,IM ) for shorthand. Then a tensor X \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM is an element of a tensor-product\nspace. A tensor X may be referenced by either a full vector (i1, . . . , iM ) or a by subvector, using\nthe \u2022 symbol to indicate coordinates that are not \ufb01xed. For example, let X \u2208 RI1\u00d7I2\u00d7I3. Then\nXi1i2i3 is a scalar, X\u2022i2i3 \u2208 RI1 is the vector obtained by setting the second and third coordinates\nto i2 and i3, and X\u2022\u2022i3 \u2208 RI1\u00d7I2 is the matrix obtained by setting the third coordinate to i3. The\nconcatenation of two M-dimensional vectors I = (I1, . . . , IM ) and J = (J1, . . . , JM ) is given by\nIJ = (I1, . . . , IM , J1, . . . , JM ), a vector with 2M entries.\nLet X \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM , M \u2208 N. The vectorization vec(X) \u2208 RI1\u00b7\u00b7\u00b7IM is obtained by shaping the\ntensor into a vector. In particular, the elements of vec(X) are given by vec(X)k = Xi1\u00b7\u00b7\u00b7iM , where\n\n(cid:81)m\u22121\n(cid:19)\nn=1 In(im \u2212 1). For example, if X \u2208 R2\u00d73\u00d72 is given by\n\n(cid:18) 7\nis given by mat(A)kl = Ai1\u00b7\u00b7\u00b7iM j1\u00b7\u00b7\u00b7jM , where k = 1 +(cid:80)M\n(cid:80)M\n(cid:18) 9\n\nthen vec(X) = (1 2 3 4 5 6 7 8 9 10 11 12)T .\n(cid:81)m\u22121\nLet I, J \u2208 NM , M \u2208 N. The matricization mat(A) \u2208 RI1\u00b7\u00b7\u00b7IM\u00d7J1\u00b7\u00b7\u00b7JM of a tensor A \u2208 RIJ\n(cid:81)m\u22121\nn=1 In(im \u2212 1) and l = 1 +\nn=1 Jn(jm \u2212 1). The matricization \u201c\ufb02attens\u201d a tensor into a matrix. For example, de\ufb01ne\n(cid:19)\n(cid:18) 1\n(cid:19)\n\nk = 1 +(cid:80)M\n\n(cid:18) 1 3\n\n(cid:18) 13\n\nm=1\n\nA \u2208 R2\u00d72\u00d72\u00d72 by\n\n(cid:18) 5\n\n, X\u2022\u20222 =\n\nX\u2022\u20221 =\n\n(cid:19)\n\n9\n10\n\n11\n12\n\n8\n\n(cid:19)\n\n2 4\n\nm=1\n\nm=1\n\n5\n6\n\n,\n\nA\u2022\u202211 =\n\n3\n4\n\n2\n\n, A\u2022\u202212 =\n\n11\n12\n\n10\n\n, A\u2022\u202222 =\n\n15\n16\n\n14\n\n.\n\n(cid:19)\n\uf8f6\uf8f7\uf8f8 .\n\n7\n8\n\n13\n14\n15\n16\n\n, A\u2022\u202221 =\n\n\uf8eb\uf8ec\uf8ed 1 5\n\n2 6\n3 7\n4 8\n\n6\n\n9\n10\n11\n12\n\nThen we have mat(A) =\n\nThe vec and mat operators put tensors in bijective correspondence with vectors and matrices. To\nde\ufb01ne the inverse of each of these operators, a reference must be made to the dimensionality of the\noriginal tensor. In other words, given X \u2208 RI and A \u2208 RIJ, where I, J \u2208 NM , M \u2208 N, we have\nX = vec\u22121\n(cid:81)M\nLet I, J \u2208 NM , M \u2208 N. The factorization of a tensor A \u2208 RIJ is given by Ai1\u00b7\u00b7\u00b7iM j1\u00b7\u00b7\u00b7jM =\n, where A(m) \u2208 RIm\u00d7Jm for all m. The factorization exponentially reduces the\n\n(vec(X)) and A = mat\u22121\n\nIJ (mat(A)).\n\nI\n\nm=1 A(m)\nimjm\n\n2\n\n\fnumber of parameters needed to express A from(cid:81)M\n\nm=1 ImJm to(cid:80)M\n\n=(cid:80)\n\nm=1 ImJm. In matrix form, we\nhave mat(A) = A(M ) \u2297 A(M\u22121) \u2297 \u00b7\u00b7\u00b7 \u2297 A(1), where \u2297 is the Kronecker matrix product [1]. Note\nthat tensors in RIJ are not factorizable in general [2].\nThe product A (cid:126) X of two tensors A \u2208 RIJ and X \u2208 RJ, where I, J \u2208 NM , M \u2208 N, is given\nby (A (cid:126) X)i1\u00b7\u00b7\u00b7iM\nXj1\u00b7\u00b7\u00b7jM . The tensor A is called a multilinear operator\nwhen it appears in a tensor product as above. The product is only de\ufb01ned if the dimensionalities of\nthe last M modes of A match the dimensionalities of X. Note that this tensor product generalizes\nthe standard matrix-vector product in the case M = 1.\nWe shall primarily work with tensors in their vector and matrix representations. Hence, we appeal\nto the following\nLemma 1. Let I, J \u2208 NM , M \u2208 N, A \u2208 RIJ , X \u2208 RJ. Then\n\nAi1\u00b7\u00b7\u00b7iM j1\u00b7\u00b7\u00b7jM\n\nj1\u00b7\u00b7\u00b7jM\n\nFurthermore, if A is factorizable with matrices A(m), then\n\nvec(A (cid:126) X) = mat(A) vec(X) .\n\nA(M ) \u2297 \u00b7\u00b7\u00b7 \u2297 A(1)(cid:105)\n(cid:104)\n(cid:81)m\u22121\nn=1 In(im \u2212 1) and l = 1 +(cid:80)M\n\nvec(A (cid:126) X) =\n\nm=1\n\nvec(X) .\n\n(cid:81)m\u22121\nn=1 Jn(jm \u2212 1) for some\n\n(2)\n\n(1)\n\nProof. Let k = 1 +(cid:80)M\n(cid:88)\n\n(j1, . . . , jM ). We have\n\nvec(A (cid:126) X)k =\n\nm=1\n\nj1\u00b7\u00b7\u00b7jM\n\n(cid:88)\n\nl\n\nAi1\u00b7\u00b7\u00b7iM j1\u00b7\u00b7\u00b7jM\n\nXj1\u00b7\u00b7\u00b7jM =\n\nmat(A)kl vec(X)l = (mat(A) vec(X))k ,\n\nwhich holds for all 1 \u2264 im \u2264 Im, 1 \u2264 m \u2264 M. Thus, (1) holds. To prove (2), we express mat(A)\nas the Kronecker product of M matrices A(1), . . . , A(M ).\nThe Tucker decomposition can be expressed using the product (cid:126) de\ufb01ned above.\nThe\nTucker decomposition models a given tensor X \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM as the result of a multilin-\near transformation that is applied to a latent core tensor Z \u2208 RJ1\u00d7\u00b7\u00b7\u00b7\u00d7JM : X = A (cid:126) Z.\nThe multilinear operator A is a factorizable tensor such that\nmat(A) = A(M )\u2297A(M\u22121)\u2297\u00b7\u00b7\u00b7\u2297A(1),. where A(1), . . . , A(M )\nare projection matrices (Figure 1). The canonical decomposi-\ntion/parallel factors (CP) decomposition is a special case of the\nTucker decomposition in which Z is \u201csuperdiagonal\u201d, i.e., J1 =\n\u00b7\u00b7\u00b7 = JM = R and only the Zj1\u00b7\u00b7\u00b7jM such that j1 = \u00b7\u00b7\u00b7 = jM\ncan be nonzero. The CP decomposition expresses X as a sum\n\u2208 RIm for all m and r and \u25e6 denotes the tensor outer\n\nFigure 1: The Tucker decomposi-\ntion of a third-order tensor X.\n\nX = (cid:80)R\n\nr \u25e6 \u00b7\u00b7\u00b7 \u25e6 u(M )\n\n, where u(m)\n\nproduct [3].\nTo illustrate, consider the case M = 2 and let X = A(cid:126)Z, where X \u2208 Rn\u00d7m and Z \u2208 Rp\u00d7q. Then\nX = AZBT, where mat(A) = B \u2297 A. If p \u2264 n and q \u2264 m, then Z is a dimensionality-reduced\nversion of X: the matrix A increases the number of rows of Z from p to n via left-multiplication,\nwhile the matrix B increases the number of columns of Z from q to m via right-multiplication. To\nreconstruct X, we simply apply A (cid:126) Z. See Figure 1 for an illustration of the case M = 3.\n\nr=1 u(1)\n\nr\n\nr\n\n3 Random tensors\nGiven I \u2208 NM , M \u2208 N, we de\ufb01ne a random tensor X \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM as follows. Suppose vec(X)\nis normally distributed with expectation vec(U) and positive-de\ufb01nite covariance mat(S), where U \u2208\nRI and S \u2208 RII. Then we say that X has the normal distribution with expectation U \u2208 RI and\ncovariance S \u2208 RII and write X \u223c N (U, S). The de\ufb01nition of the normal distribution on tensors\ncan thus be restated more succinctly as\n\nX \u223c N (U, S) \u21d0\u21d2 vec(X) \u223c N (vec U, mat S) .\n\n(3)\n\nOur formulation extends the normal distribution de\ufb01ned in [4], which is restricted to symmetric,\nsecond-order tensors.\n\n3\n\nZX=A(2)A(3)A(1)\fWe will make use of an important special case of the normal distribution de\ufb01ned on tensors: the\nmultilinear Gaussian distribution. Let I, J \u2208 NM , M \u2208 N, and suppose X \u2208 RI and Z \u2208 RJ are\njointly distributed as\n\nZ \u223c N (U, G) and X | Z \u223c N (C (cid:126) Z, S) ,\n\n(4)\nwhere C \u2208 RIJ. The marginal distribution of X and the posterior distribution of Z given X are given\nby the following result.\nLemma 2. Let I, J \u2208 NM , M \u2208 N, and suppose the joint distribution of random tensors X \u2208 RI\nand Z \u2208 RJ is given by (4). Then the marginal distribution of X is\n\nX \u223c N(cid:0)C (cid:126) U, C (cid:126) G (cid:126) CT + S(cid:1) ,\n\n(5)\n\nwhere CT \u2208 RJI and CT\n\nj1\u00b7\u00b7\u00b7jM i1\u00b7\u00b7\u00b7iM\n\n= Ci1\u00b7\u00b7\u00b7iM j1\u00b7\u00b7\u00b7jM . The conditional distribution of Z given X is\n\nZ | X \u223c N(cid:16) \u02c6U, \u02c6G\n\n(cid:17)\n\n,\n\nwhere \u02c6U = vec\u22121\n\n\u0393 = mat(G), \u03a3 = mat(S), and W = \u0393mat(C)T(cid:104)\n\nJ (\u00b5 + W (vec(X) \u2212 mat(C) \u00b5)), \u02c6G = mat\u22121\n\nmat(C) \u0393mat(C)T + \u03a3\n\n(cid:105)\u22121\n\n.\n\n(6)\nJJ (\u0393 \u2212 W mat(C) \u0393), \u00b5 = vec(U),\n\nProof. Lemma 1, (3), and (4) imply that the vectorizations of Z and X given Z follow vec(Z) \u223c\nN (\u00b5, \u0393) and vec(X) | vec(Z) \u223c N (mat(C) vec(Z) , \u03a3). By the properties of the multivariate\nnormal distribution, the marginal distribution of vec(X) and the conditional distribution of vec(Z)\ngiven vec(X) are vec(X) \u223c N (mat(C) vec(U), mat(C) \u0393mat(C)T + \u03a3) and vec(Z) | vec(X) \u223c\nN (vec( \u02c6U), mat(\u02c6G)). The associativity of (cid:126) implies that mat(C (cid:126) G (cid:126) CT) = mat(C) \u0393mat(C)T.\nFinally, we apply Lemma 1 once more to obtain (5) and (6).\n\n4 Multilinear dynamical system\n\nThe aim is to develop a model of a tensor time series X1, . . . , XN that takes into account tensor\nstructure. In de\ufb01ning the MLDS, we build upon the results of previous sections by treating each\nXn as a random tensor and relating the model components with multilinear transformations. When\nthe MLDS components are vectorized and matricized, an LDS with factorized transition and projec-\ntion matrices is revealed. Hence, the strategy for \ufb01tting the MLDS is to vectorize each Xn, run the\nexpectation-maximization (EM) algorithm of the LDS for all components but the matricized transi-\ntion and projection tensors\u2013which are learned via an alternative gradient method\u2013and \ufb01nally convert\nall model components back to tensor form.\n\n4.1 De\ufb01nition\nLet I, J \u2208 NM , M \u2208 N. The MLDS model consists of a sequence Z1, . . . , ZN of latent tensors,\nwhere Zn \u2208 RJ1\u00d7\u00b7\u00b7\u00b7\u00d7JM for all n. Each latent tensor Zn emits an observation Xn \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM .\nThe system is initialized by a latent tensor Z1 distributed as\nZ1 \u223c N (U0, Q0) .\n\n(7)\n\nGiven Zn, 1 \u2264 n \u2264 N \u2212 1, we generate Zn+1 according to the conditional distribution\n\nZn+1 | Zn \u223c N (A (cid:126) Zn, Q) ,\n\n(8)\nwhere Q is the conditional covariance shared by all Zn, 2 \u2264 n \u2264 N, and A is the transition tensor\nwhich describes the dynamics of the evolving sequence Z1, . . . , ZN . The transition tensor A is\nfactorized into M matrices A(m), each of which acts on a mode of Zn. In matrix form, we have\nmat(A) = A(M ) \u2297 \u00b7\u00b7\u00b7 \u2297 A(1). To each Zn there corresponds an observation Xn generated by\n\nXn | Zn \u223c N (C (cid:126) Zn, R) ,\n\n(9)\n\n4\n\n\fwhere R is the covariance shared by all Xn and C is the projec-\ntion tensor which multilinearly transforms the latent tensor Zn.\nLike the transition tensor A, the projection tensor C is factoriz-\nable, i.e., mat(C) = C (M ) \u2297 \u00b7\u00b7\u00b7 \u2297 C (1). See Figure 2 for an\nillustration of the MLDS.\nBy vectorizing each Xn and Zn, the MLDS becomes an LDS\nwith factorized transition and projection matrices mat(A) and\nmat(C). For the LDS, the transition and projection operators are\nnot factorizable in general [2]. The factorizations of A and C\nfor the MLDS not only allow for a generalized dimensionality\nreduction of tensors but exponentially reduce the number of parameters of the transition and projec-\nm=1 ImJm down to |AMLDS| + |CMLDS| =\n\ntion operators from |ALDS| + |CLDS| =(cid:81)M\n(cid:80)M\n\nFigure 2: Schematic of the MLDS\nwith three modes.\n\nm +(cid:81)M\n\nm=1 J 2\n\nm +(cid:80)M\n\nm=1 J 2\n\nm=1 ImJm.\n\n4.2 Parameter estimation\n\n\u2126mat(C)\n\n,\n\nGiven a sequence of observations X1, . . . , XN , we wish to \ufb01t the MLDS model by estimating\n\u03b8 = (U0, Q0, Q, A, R, C). Because the MLDS model contains latent variables Zn, we cannot di-\nrectly maximize the likelihood of the data with respect to \u03b8. The EM algorithm circumvents this\ndif\ufb01culty by iteratively updating (E(Z1), . . . , E(ZN )) and \u03b8 in an alternating manner until the ex-\npected, complete likelihood of the data converges [5]. The normal distribution of tensors (3) will\nfacilitate matrix and vector computations rather than compel us to work directly with tensors. In\nparticular, we can express the complete likelihood of the MLDS model as\n\nwhere \u2126 = mat( \u02c6R)\u22121, \u03a8 = (cid:80)N\n\nl(v) =tr\n\nL (\u03b8 | Z1, X1, . . . , ZN , XN ) = L (vec \u03b8 | vec Z1, vec X1, . . . , vec ZN , vec XN ) ,\n\nn=1 E(vec Znvec ZT\nn=1 vec (Xn)E(vec Zn)T. Now\nand let \u2206ij \u2208 RIm\u00d7Jm be the indicator matrix that is one at the\n\n(10)\nwhere vec \u03b8 = (vec U0, mat Q0, mat Q, mat A, mat R, mat C). It follows that the vectorized MLDS\nis an LDS that inherits the Kalman \ufb01lter updates for the E-step and the M-step for all parameters\nexcept mat A and mat C. See [6] for the EM algorithm of the LDS.\nBecause A and C are factorizable, an alternative to the standard LDS updates is required. We\nlocally maximize the expected, complete log-likelihood by computing the gradient with respect to\nthe vector v = [vec C (1)T \u00b7\u00b7\u00b7 vec C (M )T]T \u2208 R\nm ImJm, which is obtained by concatenating the\nvectorizations of the projection matrices C (m). The expected, complete log-likelihood (with terms\nconstant with respect to C deleted) can be written as\n\n(cid:80)\n\u03a8mat(C)T \u2212 2\u03a6T(cid:105)(cid:111)\n(cid:104)\nn), and \u03a6 = (cid:80)N\n(cid:80)\nlet k correspond to some C (m)\n(cid:104)\n\u03a8mat(C)T \u2212 \u03a6T(cid:105)(cid:111)\n(i, j)th entry and zero elsewhere. The gradient \u2207l(v) \u2208 R\n(cid:27)\nC (M\u22121) \u2297 \u00b7\u00b7\u00b7 \u2297 C (1)(cid:105)T\nn(cid:54)=M Jn) shifted by(cid:81)\n\nof \u2202vkmat(C) by computing the trace of the product of two submatrices each with(cid:81)\nand(cid:81)\nwhere \u039bij is the submatrix of \u2126 [mat(C) \u03a8 \u2212 \u03a6] with row indices (1, . . . ,(cid:81)\n(cid:81)\nn(cid:54)=M In(i \u2212 1) and column indices (1, . . . ,(cid:81)\n\n(12)\nwhere \u2202vkmat(C) = C (M ) \u2297\u00b7\u00b7\u00b7\u2297 \u2206ij \u2297\u00b7\u00b7\u00b7\u2297 C (1) [1]. If m = M, then we can exploit the sparsity\nn(cid:54)=M In rows\n\nn(cid:54)=M In) shifted by\nn(cid:54)=M Jn(j \u2212 1). If m (cid:54)= M,\nthen the ordering of the modes can be replaced by 1, . . . , m\u2212 1, m + 1, . . . , M, m and the rows and\ncolumns of \u2126 [mat(C) \u03a8 \u2212 \u03a6] can be permuted accordingly. In other words, the original tensors Xn\nare \u201crotated\u201d so that the mth mode becomes the M th mode.\nThe M-step for A can be computed in a manner analogous to that of C by replacing I by J, replacing\n\u22121, \u03a8 =\nmat(C) by mat(A), and substituting v = [vec(A(1))T \u00b7\u00b7\u00b7 vec(A(M ))T]T, \u2126 = mat(Q)\n\nm ImJm is given elementwise by\n\nn(cid:54)=M Jn columns:\n\n\u2207l(v)k = 2tr\n\n\u2207l(v)k = 2tr\n\n\u2126\u2202vkmat(C)\n\n(cid:80)N\u22121\n\nn=1 E\n\n(cid:104)\n\nvec(Zn) vec(Zn)T(cid:105)\n\n, and \u03a6 =(cid:80)N\u22121\n\nn=1 E\n\n(cid:104)\nvec(Zn+1) vec(Zn)T(cid:105)\n\n(cid:26)(cid:104)\n\ninto (11).\n\n(cid:110)\n\n(cid:110)\n\n\u039bij\n\n,\n\n(13)\n\n(11)\n\nij\n\n,\n\n5\n\n......X1Z1Z2X2XnZnXn+1Zn+1XNZN\f4.3 Special cases of the MLDS and their relationships to existing models\n\nm=1 Im and q =(cid:81)M\n(cid:81)M\n\nIt is clear that the MLDS is exactly an LDS in the case M = 1. Certain constraints on the MLDS\nalso lead to generalizations of factor analysis, probabilistic principal components analysis (PPCA),\nthe CP decomposition, and the matrix factorization model of collaborative \ufb01ltering (MF). Let p =\nm=1 Jm. If A = 0, U0 = 0, and Q0 = Q, then the Xn of the MLDS become\nindependent and identically distributed draws from the multilinear Gaussian distribution. Setting\nmat(Q) = Idq and mat(R) to a diagonal matrix results in a model that reduces to factor analysis\nin the case M = 1. A further constraint on R, mat(R) = \u03c12Idp, yields a multilinear extension of\nPPCA. Removing the constraints on R and forcing mat(Zn) = Idq for all n results in a probabilistic\nCP decomposition in which the tensor elements have general covariances. Finally, the constraint\nM = 2 yields a probabilistic MF.\n\n5 Experimental results\n\nTo determine how well the MLDS could model tensor time series, the \ufb01ts of the MLDS were com-\npared to those of the LDS for both synthetic and real data. To avoid unnecessary complexity and\nhighlight the difference between the two models\u2014namely, how the transition and projection oper-\nators are de\ufb01ned\u2014the noises in the models are isotropic. The MLDS parameters are initialized so\nthat U0 is drawn from the standard normal distribution, the matricizations of the covariance ten-\nsors are identity matrices, and the columns of each A(m) and C (m) are the \ufb01rst Jm eigenvectors of\nsingular-value-decomposed matrices with entries drawn from the standard normal distribution. The\nLDS parameters are initialized in the same way by setting M = 1.\nThe prediction error and convergence in likelihood were measured for each dataset. For the\nsynthetic dataset, model complexity was also measured. The prediction error \u0001M\nn of a given\nmodel M for the nth member of a tensor time series X1, . . . , XN is the relative Euclidean dis-\nn =\nvec\u22121\nof the last member of the training sequence. The convergence in likelihood of each model is deter-\nmined by monitoring the marginal likelihood as the number of EM iterations increases. Each model\nis allowed to run until the difference between consecutive log-likelihood values is less than 0.1%\nof the latter value. Lastly, the model complexity is determined by observing how the likelihood\nand prediction error of each model vary as the model size |\u03b8M| increases. Aside from the model\ncomplexity experiment, the LDS latent dimensionality is always set to the smallest value such that\nthe number of parameters of the LDS is greater than or equal to that of the MLDS.\n\n(cid:12)(cid:12)(cid:12)(cid:12) /||Xn||, where ||\u00b7|| = ||vec(\u00b7)||2. Each estimate XM\n\n(cid:0)mat(cid:0)CM(cid:1) mat(cid:0)AM(cid:1)n vec(cid:0)E(cid:2)ZM\n\n(cid:3) is the estimate of the latent state\n\nn is given by XM\n\ntance (cid:12)(cid:12)(cid:12)(cid:12)Xn \u2212 XM\n\nn\n\nI\n\n(cid:3)(cid:1)(cid:1), where E(cid:2)ZM\n\nNtrain\n\nNtrain\n\n5.1 Results for synthetic data\n\nThe synthetic dataset is an MLDS with dimensions I = (7, 11), J = (3, 5), and N = 1100 and\nparameters initialized as described in the \ufb01rst paragraph of this section. For the prediction error and\nconvergence analyses, the latent dimensionality of the MLDS for \ufb01tting was set to J = (3, 5) as\nwell. Each model was trained on the \ufb01rst 1000 elements and tested on the last 100 elements of the\nsequence. The results are shown in Figure 3. According to Figure 3(a), the prediction error of MLDS\nmatches that of the true model and is below that of the LDS. Furthermore, the MLDS converges to\nthe likelihood of the true model, which is greater than that of the LDS (see Figure 3(b)). As for\nmodel complexity, the model size needed for the MLDS to match the likelihood and prediction error\nof the true model is much smaller than that of the LDS (see Figure 3(c) and (d)).\n\n5.2 Results for real data\n\nWe consider the following datasets:\nSST: A 5-by-6 grid of sea-surface temperatures from 5\u25e6N, 180\u25e6W to 5\u25e6S, 110\u25e6W recorded hourly\nfrom 7:00PM on 4/26/94 to 3:00AM on 7/19/94, yielding 2000 epochs [7].\nTesla: Opening, closing, high, low, and volume of the stock prices of 12 car and oil companies\n(e.g., Tesla Motors Inc.), from 6/29/10 to 5/10/13 (724 epochs).\nNASDAQ-100: Opening, closing, adjusted-closing, high, low, and volume for 20 randomly-\nchosen NASDAQ-100 companies, from 1/1/05 to 12/31/09 (1259 epochs).\n\n6\n\n\f(a)\n\n(b)\n\nfunction of model size is shown in (c), and cumulative prediction error(cid:80)Ntrain+Ntest\n\nFigure 3: Results for synthetic data. Prediction error \u0001M\nthe time slice n in (a), convergence of marginal log-likelihood is shown in (b), marginal log-likelihood as a\n\u0001M\nn as a function of model\n\nn\n\nn=Ntrain+1\n\nn =(cid:12)(cid:12)(cid:12)(cid:12)Xn \u2212 XM\n\n(c)\n\n(cid:12)(cid:12)(cid:12)(cid:12) /||Xn|| is shown as a function of\n\n(d)\n\nsize is shown in (d) for LDS, MLDS, and the true model.\n\n(a) SST\n\n(b) Tesla\n\n(c) NASDAQ-100\n\n(d) Video\n\n(e) SST\n\n(f) Tesla\n\n(g) NASDAQ-100\n\n(h) Video\n\nFigure 4: Results for LDS and MLDS applied to real data. The \ufb01rst row corresponds to prediction error \u0001M\nn\nas a function of the time slice n, while the second corresponds to convergence in log-likelihood. Sea-surface\ntemperature, Tesla, NASDAQ-100, and Video results are given by the respective columns.\nVideo: 1171 grayscale frames of ocean surf during low tide. This dataset was chosen because it\nrecords a quasiperiodic natural scene.\nFor each dataset, MLDS achieved higher prediction accuracy and likelihood than LDS. For the SST\ndataset, each model was trained on the \ufb01rst 1800 epochs; occlusions were \ufb01lled in using linear\ninterpolation and re\ufb01ned with an extra step during the learning that replaced the estimates of the\noccluded values by the conditional expectation given all the training data. For results when the\nMLDS dimensionality is set to (3, 3), see Figure 4(a) and (e). For the Tesla dataset, each time series\n((X1)ij, . . . , (XN )ij) were normalized prior to learning by subtracting by the mean and dividing by\nthe standard deviation. Each model was trained on the \ufb01rst 700 epochs. See Figure 4(b) and (f) for\nresults when the MLDS dimensionality is set to (5, 2). For the NASDAQ-100 dataset, each model\nwas trained on the \ufb01rst 1200 epochs. The data were normalized in the same way as with the Tesla\ndataset. For results when the MLDS dimensionality is set to (10, 3), see Figure 4(c) and (g). For the\nVideo dataset, a 100-by-100 patch was selected, spatially downsampled to a 10-by-10 patch for each\nframe, and normalized as before. Each model was trained on the \ufb01rst 1000 frames. See Figure 4(d)\nand (h) for results when the MLDS dimensionality is set to (5, 5).\n\n6 Related work\n\nSeveral existing models can be \ufb01tted to tensor time series. If each tensor is \u201cvectorized\u201d, i.e., reex-\npressed as a vector so that each element is indexed by a single positive integer, then an LDS can be\napplied [8, 6]. An obvious limitation of the LDS for modeling tensor time series is that the tensor\nstructure is not preserved. Thus, it is less clear how the latent vector space of the LDS relates to the\nvarious tensor modes. Further, one cannot postulate a latent dimension for each mode as with the\nMLDS. The net result, as we have shown, is that the LDS requires more parameters than the MLDS\nto model a given system (assuming it does have tensor structure).\n\n7\n\n10201060110000.51Time sliceError  LDSMLDStrue5101520\u22124\u221220x 106Number of EM iterationsLog\u2212likelihood  LDSMLDStrue010002000\u22123\u22122\u22121x 105Number of parametersLog\u2212likelihood  LDSMLDStrue010002000050100Number of parametersCumulative error  LDSMLDStrue18501900195020000204060Time sliceError  LDSMLDS7057107157200.20.40.60.81Time sliceError  LDSMLDS1210123012500.20.40.60.81Time sliceError  LDSMLDS105011001150050100150Time sliceError  LDSMLDS510152025\u22128\u22126\u22124x 104Number of EM iterationsLog\u2212likelihood  LDSMLDS10203040\u22128\u22126\u22124\u22122x 104Number of EM iterationsLog\u2212likelihood  LDSMLDS204060\u22123\u22122\u221210x 105Number of EM iterationsLog\u2212likelihood  LDSMLDS204060\u22121.5\u22121\u22120.5x 105Number of EM iterationsLog\u2212likelihood  LDSMLDS\fDynamic tensor analysis (DTA) and Bayesian probabilistic tensor factorization (BPTF) are explicit\nmodels of tensor time series [9, 10]. For DTA, a latent, low-dimensional \u201ccore\u201d tensor and a set of\nprojection matrices are learned by processing each member Xn \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM of the sequence as\nk(cid:54)=m Ik)\u00d7Im and then\nfollows. For each mode m, the tensor is \ufb02attened into a matrix X(m)\nmultiplied by its transpose. The result X(m)T\nis added to a matrix S(m) that has accumulated\nthe \ufb02attenings of the previous n \u2212 1 tensors. The eigenvalue decomposition U \u039bU T of the updated\n\nS(m) is then computed and the mth projection matrix is given by the \ufb01rst rank(cid:0)S(m)(cid:1) columns of\n\nn \u2208 R((cid:81)\n\nn X(m)\n\nn\n\nU. After this procedure is carried out for each mode, the core tensor is updated via the multilinear\ntransformation given by the Tucker decomposition. Like the LDS, DTA is a sequential model. An\nadvantage of DTA over the LDS is that the tensor structure of the data is preserved. A disadvantage\nis that there is no straightforward way to predict future terms of the tensor time series. Another\ndisadvantage is that there is no mechanism that allows for arbitrary noise relationships among the\ntensor elements. In other words, the noise in the system is assumed to be isotropic.\nOther families of isotropic models have been devised that \u201ctensorize\u201d the time dimension by con-\ncatenating the tensors in the time series to yield a single new tensor with an additional temporal\nmode. These models include multilinear principal components analysis [11], the memory-ef\ufb01cient\nTucker algorithm [12], and Bayesian tensor analysis [13]. For \ufb01tting to data, such models rely on\nalternating optimization methods, such as alternating least squares, which are applied to each mode.\nBPTF allows for prediction and more general noise modeling than DTA. BPTF is a mul-\ntilinear extension of collaborative \ufb01ltering models [14, 15, 16] that concatenates the mem-\nbers of the tensor time series (Xn), Xn \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IM , to yield a higher-order tensor R \u2208\nRI1\u00d7\u00b7\u00b7\u00b7\u00d7IM\u00d7K, where K is the sequence length. Each element of R is independently distributed\n, Tk(cid:105), \u03b1\u22121), where (cid:104)\u00b7, . . . ,\u00b7(cid:105) denotes the tensor inner product\nas Ri1\u00b7\u00b7\u00b7iM k \u223c N ((cid:104)u(1)\nand \u03b1 is a global precision parameter. Bayesian methods are then used to compute the canonical-\n\u25e6Tr, where \u25e6 is\nthe tensor outer product. Each u(m)\nis independently drawn from a normal distribution with expec-\ntation \u00b5m and precision matrix \u039bm, while each Tr is recursively drawn from a normal distribution\nwith expectation Tr\u22121 and precision matrix \u039bT . The parameters, in turn, have conjugate prior distri-\nbutions whose posterior distributions are sampled via Markov-chain Monte Carlo for model \ufb01tting.\nThough BPTF supports prediction and general noise models, the latent tensor structure is limited.\nOther models with anisotropic noise include probabilistic tensor factorization (PTF) [17], tensor\nprobabilistic independent component analysis (TPICA) [18], and generalized coupled tensor factor-\nization (GCTF) [19]. As with BPTF, PTF and TPICA utilize the CP decomposition of tensors. PTF\nis \ufb01t to tensor data by minimizing a heuristic loss function that is expressed as a sum of tensor inner\nproducts. TPICA iteratively \ufb02attens the tensor of data, executes a matrix model called probabilistic\nICA (PICA) as a subroutine, and decouples the factor matrices of the CP decomposition that are em-\nbedded in the \u201cmixing matrix\u201d of PICA. GCTF relates a collection of tensors by a hidden layer of\ndisconnected tensors via tensor inner products, drawing analogies to probabilistic graphical models.\n\ndecomposition/parallel-factors (CP) decomposition of R: R =(cid:80)R\n\nr \u25e6\u00b7\u00b7\u00b7\u25e6u(M )\n\n, . . . , u(M )\niM\n\nr=1 u(1)\n\ni1\n\nr\n\nr\n\n7 Conclusion\n\nWe have proposed a novel probabilistic model of tensor time series called the multilinear dynamical\nsystem (MLDS), based on a tensor normal distribution. By putting tensors and multilinear operators\nin bijective correspondence with vectors and matrices in a way that preserves tensor structure, the\nMLDS is formulated so that it becomes an LDS when its components are vectorized and matricized.\nIn matrix form, the transition and projection tensors can each be written as the Kronecker product of\nM smaller matrices and thus yield an exponential reduction in model complexity compared to the\nunfactorized transition and projection matrices of the LDS. As noted in Section 4.3, the MLDS gen-\neralizes the LDS, factor analysis, PPCA, the CP decomposition, and low-rank matrix factorization.\nThe results of multiple experiments that assess prediction accuracy, convergence in likelihood, and\nmodel complexity suggest that the MLDS achieves a better \ufb01t than the LDS on both synthetic and\nreal datasets, given that the LDS has the same number of parameters as the MLDS.\n\n8\n\n\fReferences\n[1] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and\n\nEconometrics. Wiley, revised edition, 1999.\n\n[2] Vin De Silva and Lek-Heng Lim. Tensor rank and the ill-posedness of the best low-rank approximation\n\nproblem. SIAM Journal on Matrix Analysis and Applications, 30(3):1084\u20131127, 2008.\n\n[3] Tamara G. Kolda. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500, 2009.\n[4] Peter J. Basser and Sinisa Pajevic. A normal distribution for tensor-valued random variables: applications\n\nto diffusion tensor MRI. IEEE Transactions on Medical Imaging, 22(7):785\u2013794, 2003.\n\n[5] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via\nthe EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1\u201338, 1977.\n\n[6] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 1st edition, 2006.\n[7] NOAA/Paci\ufb01c Marine Environmental Laboratory. Tropical Atmosphere Ocean Project. http://www.\n\npmel.noaa.gov/tao/data_deliv/deliv.html. Accessed: May 23, 2013.\n\n[8] Zoubin Ghahramani and Geoffrey E. Hinton. Parameter estimation for linear dynamical systems. Tech-\n\nnical Report CRG-TR-96-2, University of Toronto Department of Computer Science, 1996.\n\n[9] Jimeng Sun, Dacheng Tao, and Christos Faloutsos. Beyond streams and graphs: dynamic tensor analysis.\nIn Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, pages 374\u2013383. ACM, 2006.\n\n[10] Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff Schneider, and Jaime G. Carbonell. Temporal collaborative\n\n\ufb01ltering with Bayesian probabilistic tensor factorization. In Proceedings of SIAM Data Mining, 2010.\n\n[11] Haipin Lu, Konstantinos N. Plataniotis, and Anastasios N. Venetsanopoulos. MPCA: Multilinear principal\n\ncomponents analysis of tensor objects. IEEE Transactions on Neural Networks, 19(1), 2008.\n\n[12] Tamara Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining. In Eighth\n\nIEEE International Conference on Data Mining. IEEE, 2008.\n\n[13] Dacheng Tao, Mingli Song, Xuelong Li, Jialie Shen, Jimeng Sun, Xindong Wu, Christos Faloutsos, and\nStephen J. Maybank. Bayesian tensor approach for 3-D face modeling. IEEE Transactions on Circuits\nand Systems for Video Technology, 18(10):1397\u20131410, 2008.\n\n[14] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender sys-\n\ntems. Computer, 42(8):30\u201337, 2009.\n\n[15] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Infor-\n\nmation Processing Systems, volume 20, pages 1257\u20131264, 2008.\n\n[16] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain\n\nMonte Carlo. In Proceedings of the 25th International Conference on Machine Learning. ACM, 2008.\n\n[17] Cyril Goutte and Massih-Reza Amini. Probabilistic tensor factorization and model selection. In Tensors,\n\nKernels, and Machine Learning (TKLM 2010), pages 1\u20134, 2010.\n\n[18] Christian F. Beckmann and Stephen M. Smith. Tensorial extensions of independent component analysis\n\nfor multisubject FMRI analysis. Neuroimage, 25(1):294\u2013311, 2005.\n\n[19] Y. Kenan Yilmaz, A. Taylan Cemgil, and Umut Simsekli. Generalized coupled tensor factorization. In\n\nNeural Information Processing Systems. MIT Press, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1235, "authors": [{"given_name": "Mark", "family_name": "Rogers", "institution": "UC Berkeley"}, {"given_name": "Lei", "family_name": "Li", "institution": "UC Berkeley"}, {"given_name": "Stuart", "family_name": "Russell", "institution": "UC Berkeley"}]}