{"title": "Learning Auto-regressive Models from Sequence and Non-sequence Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1548, "page_last": 1556, "abstract": "Vector Auto-regressive models (VAR) are useful tools for analyzing time series data. In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the discrete-time Lyapunov equation concerning the covariance of the stationary distribution. Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods.", "full_text": "Learning Auto-regressive Models from Sequence and\n\nNon-sequence Data\n\nTzu-Kuo Huang\n\nMachine Learning Department\n\nCarnegie Mellon University\ntzukuoh@cs.cmu.edu\n\nJeff Schneider\nRobotics Institute\n\nCarnegie Mellon University\nschneide@cs.cmu.edu\n\nAbstract\n\nVector Auto-regressive models (VAR) are useful tools for analyzing time series\ndata. In quite a few modern time series modelling tasks, the collection of reliable\ntime series turns out to be a major challenge, either due to the slow progression of\nthe dynamic process of interest, or inaccessibility of repetitive measurements of\nthe same dynamic process over time. In those situations, however, we observe that\nit is often easier to collect a large amount of non-sequence samples, or snapshots\nof the dynamic process of interest. In this work, we assume a small amount of time\nseries data are available, and propose methods to incorporate non-sequence data\ninto penalized least-square estimation of VAR models. We consider non-sequence\ndata as samples drawn from the stationary distribution of the underlying VAR\nmodel, and devise a novel penalization scheme based on the Lyapunov equation\nconcerning the covariance of the stationary distribution. Experiments on synthetic\nand video data demonstrate the effectiveness of the proposed methods.\n\n1\n\nIntroduction\n\nVector Auto-regressive models (VAR) are an important class of models for analyzing multivariate\ntime series data. They have proven to be very useful in capturing and forecasting the dynamic\nproperties of time series in a number of domains, such as \ufb01nance and economics [18, 13]. Recently,\nresearchers in computational biology applied VAR models in the analysis of genomic time series\n[12], and found interesting results that were unknown previously.\n\nIn quite a few scienti\ufb01c modeling tasks, a major dif\ufb01culty turns out to be the collection of reliable\ntime series data. In some situations, the dynamic process of interest may evolve slowly over time,\nsuch as the progression of Alzheimer\u2019s or Parkinson\u2019s diseases, and researchers may need to spend\nmonths or even years tracking the dynamic process to obtain enough time series data for analysis.\nIn other situations, the dynamic process of interest may not be able to undergo repetitive measure-\nments, so researchers have to measure multiple instances of the same process while maintaining\nsynchronization among these instances. One such example is gene expression time series. In their\nstudy, [19] measured expression pro\ufb01les of yeast genes along consecutive metabolic cycles. Due to\nthe destructive nature of the measurement technique, they collected expression data from multiple\nyeast cells. In order to obtain reliable time series data, they spent a lot of effort developing a stable\nenvironment to synchronize the cells during the metabolic cycles. Yet, they point out in their discus-\nsion that such a synchronization scheme may not work for other species, e.g., certain bacteria and\nfungi, as effectively as for yeast.\n\nWhile obtaining reliable time series can be dif\ufb01cult, we observe that it is often easier to collect non-\nsequence samples, or snapshots of the dynamic process of interest1. For example, a scientist studying\n\n1 In several disciplines, such as social and medical sciences, the former is usually referred to as a longitudi-\n\nnal study, while the latter is similar to what is called a cross-sectional study.\n\n1\n\n\fAlzheimer\u2019s or Parkinson\u2019s can collect samples from his or her current pool of patients, each of\nwhom may be in a different stage of the disease. Or in gene expression analysis, current technology\nalready enables large-scale collection of static gene expression data. Previously [6] investigated\nways to extract dynamics from such static gene expression data, and more recently [8, 9] proposed\nmethods for learning \ufb01rst-order dynamic models from general non-sequence data. However, most\nof these efforts suffer from a fundamental limitation: due to lack of temporal information, multiple\ndynamic models may \ufb01t the data equally well and hence certain characteristics of dynamics, such as\nthe step size of a discrete-time model and the overall temporal direction, become non-identi\ufb01able.\n\nIn this work, we aim to combine these two types of data to improve learning of dynamic models. We\nassume that a small amount of sequence samples and a large amount of non-sequence samples are\navailable. Our aim is to rely on the few sequence samples to obtain a rough estimate of the model,\nwhile re\ufb01ning this rough estimate using the non-sequence samples. We consider the following \ufb01rst-\norder p-dimensional vector auto-regressive model:\n\nxt+1 = xtA + \u01eb\n\nt+1,\n\n(1)\n\nwhere xt \u2208 R1\u00d7p is the state vector at time t, A \u2208 Rp\u00d7p is the transition matrix, and \u01eb\nt is a white-\nnoise process with a time-invariant variance \u03c32I. Given a sequence sample, a common estimation\nmethod for A is the least-square estimator, whose properties have been studied extensively (see e.g.,\n[7]). We assume that the process (1) is stable, i.e., the eigenvalues of A have modulus less than one.\nAs a result, the process (1) has a stationary distribution, whose covariance Q is determined by the\nfollowing discrete-time Lyapunov equation:\n\nA\u22a4QA + \u03c32I = Q.\n\n(2)\n\n\u01eb\n\nLinear quadratic Lyapunov theory (see e.g., [1]) gives that Q is uniquely determined if and only if\n\u03bbi(A)\u03bbj(A) 6= 1 for 1 \u2264 i, j \u2264 p, where \u03bbi(A) is the i-th eigenvalue of A. If the noise process\nt follows a normal distribution, the stationary distribution also follows a normal distribution, with\ncovariance Q determined as above. Since our goal is to estimate A, a more relevant perspective is\nviewing (2) as a system of constraints on A. What motivates this work is that the estimation of Q\nrequires only samples drawn from the stationary distribution rather than sequence data. However,\neven if we have the true Q and \u03c32, we still cannot uniquely determine A because (2) is an under-\ndetermined system2 of A. We thus rely on the few sequence samples to resolve the ambiguity.\nWe describe the proposed methods in Section 2, and demonstrate their performance through exper-\niments on synthetic and video data in Section 3. Our \ufb01nding in short is that when the amount of\nsequence data is small and our VAR model assumption is valid, the proposed methods of incorporat-\ning non-sequence data into estimation signi\ufb01cantly improve over standard methods, which use only\nthe sequence data. We conclude this work and discuss future directions in Section 4.\n\n2 Proposed Methods\n\nLet {xi}T\nestimator for the transition matrix A is the solution to the following minimization problem:\n\ni=1 be a sequence of observations generated by the process (1). The standard least-square\n\nmin\n\nA\n\nkY \u2212 XAk2\nF ,\n\n(3)\n\nwhere Y \u22a4 := [(x2)\u22a4 (x3)\u22a4 \u00b7 \u00b7 \u00b7 (xT )\u22a4], X \u22a4 := [(x1)\u22a4 (x2)\u22a4 \u00b7 \u00b7 \u00b7 (xT \u22121)\u22a4], and k \u00b7 kF denotes\nthe matrix Frobenius norm. When p > T , which is often the case in modern time series modeling\ntasks, the least square problem (3) has multiple solutions all achieving zero squared error, and the\nresulting estimator over\ufb01tts the data. A common remedy is adding a penalty term on A to (3) and\nminimizing the resulting regularized sum of squared errors. Usual penalty terms include the ridge\npenalty kAk2\n\nF and the sparse penalty kAk1 :=Pi,j |Aij|.\n\nNow suppose we also have a set of non-sequence observations {zi}n\ni=1 drawn independently from\nthe stationary distribution of (1). Note that we use superscripts for time indices and subscripts for\ndata indices. As described in Section 1, the size n of the non-sequence sample can usually be much\nlarger than the size T of the sequence data. To incorporate the non-sequence observations into the\n\n2If we further require A to be symmetric, (2) would be a simpli\ufb01ed Continuous-time Algebraic Riccati\n\nEquation, which has a unique solution under some conditions (c.f. [1]).\n\n2\n\n\fA = (cid:20)\u22120.4280\n\n\u22121.0428 \u22120.7144(cid:21)\n\n0.5723\n\n(5)\n\n(a) SSE and Ridge\n\n(b) Lyap\n\n(c) SSE+Ridge+ 1\n\n2 Lyap\n\nFigure 1: Level sets of different functions in a bivariate AR example\n\nestimation procedure, we \ufb01rst obtain a covariance estimate bQ of the stationary distribution from\n\nthe non-sequence sample, and then turn the Lyapunov equation (2) into a regularization term on A.\nMore precisely, in addition to the usual ridge or sparse penalty terms, we also consider the following\nregularization:\n\n(4)\nwhich we refer to as the Lyapunov penalty. To compare (4) with the ridge penalty and the sparse\npenalty, we consider (3) as a multiple-response regression problem and view the i-th column of A as\nthe regression coef\ufb01cient vector for the i-th output dimension. From this viewpoint, we immediately\nsee that both the ridge and the sparse penalizations treat the p regression problems as unrelated. On\nthe contrary, the Lyapunov penalty incorporates relations between pairs of columns of A by using a\n\nkA\u22a4bQA + \u03c32I \u2212 bQk2\n\nF ,\n\ncovariance estimate bQ. In other words, although the non-sequence sample does not provide direct\n\ninformation about the individual regression problems, it does reveal how the regression problems\nare related to one another. To illustrate how the Lyapunov penalty may help to improve learning, we\ngive an example in Figure 1. The true transition matrix is\n\nand \u01eb\n\nt \u223c N (0, I). We generate a sequence of 4 points, draw a non-sequence sample of 20 points\n\nindependently from the stationary distribution and obtain the sample covariance bQ. We \ufb01x the\n\nsecond column of A but vary the \ufb01rst, and plot in Figure 1(a) the resulting level sets of the sum of\nsquared errors on the sequence (SSE) and the ridge penalty (Ridge), and in Figure 1(b) the level\nsets of the Lyapunov penalty (Lyap). We also give coordinates of the true [A11 A21]\u22a4, the minima\nof SSE, Ridge, and Lyap, respectively. To see the behavior of the ridge regression, we trace out\na path of the ridge regression solution by varying the penalization parameter, as indicated by the\nred-to-black curve in Figure 1(a). This path is pretty far from the true model, due to insuf\ufb01cient\nsequence data. For the Lyapunov penalty, we observe that it has two local minima, one of which is\nvery close to the true model, while the other, also the global minimum, is very far. Thus, neither\nridge regression nor the Lyapunov penalty can be used on its own to estimate the true model well.\nBut as shown in Figure 1(c), the combined objective, SSE+Ridge+ 1\n2 Lyap, has its global minimum\nvery close to the true model. This demonstrates how the ridge regression and the Lyapunov penalty\nmay complement each other: the former by itself gives an inaccurate estimation of the true model,\nbut is just enough to identify a good model from the many candidate local minima provided by the\nlatter.\n\nIn the following we describe our proposed methods for incorporating the Lyapunov penalty (4) into\nridge and sparse least-square estimation. We also discuss robust estimation for the covariance Q.\n\n2.1 Ridge and Lyapunov penalty\n\nHere we estimate A by solving the following problem:\n\nmin\n\nA\n\n1\n2\n\nkY \u2212 XAk2\n\nF +\n\n\u03bb1\n2\n\nkAk2\n\nF +\n\n\u03bb2\n4\n\n3\n\nkA\u22a4bQA + \u03c32I \u2212 bQk2\n\nF ,\n\n(6)\n\n\fwhere bQ is a covariance estimate obtained from the non-sequence sample. We treat \u03bb1, \u03bb2 and \u03c32\n\nas hyperparameters and determine their values on a validation set. Given these hyperparameters, we\nsolve (6) by gradient descent with back-tracking line search for the step size. The gradient of the\nobjective function is given by\n\n(7)\nAs mentioned before, (6) is a non-convex problem and thus requires good initialization. We use the\nfollowing two initial estimates of A:\n\n\u2212 X \u22a4Y + X \u22a4XA + \u03bb1A + \u03bb2bQA(A\u22a4bQA + \u03c32I \u2212 bQ).\n\nwhere (\u00b7)\u2020 denotes the Moore-Penrose pseudo inverse of a matrix, making bAlsq the minimum-norm\n\nsolution to the least square problem (3). We run the gradient descent algorithm with these two initial\nestimates, and choose the estimated A that gives a smaller objective.\n\nand\n\nbAridge := (X \u22a4X + \u03bb1I)\u22121X \u22a4Y,\n\nbAlsq := (X \u22a4X)\u2020X \u22a4Y\n\n(8)\n\n2.2 Sparse and Lyapunov penalty\n\nSparse learning for vector auto-regressive models has become a useful tool in many modern time\nseries modeling tasks, where the number p of states in the system is usually larger than the length\nT of the time series. For example, an important problem in computational biology is to understand\nthe progression of certain biological processes from some measurements, such as temporal gene\nexpression data.\n\nUsing an idea similar to (6), we estimate A by\n\n1\n2\n\nmin\n\nkY \u2212 XAk2\n\nF +\n\nA\ns.t. kAk1 \u2264 \u03bb1.\n\n\u03bb2\n4\n\nkA\u22a4bQA + \u03c32I \u2212 bQk2\n\nF ,\n\n(9)\n\nInstead of adding a sparse penalty on A to the objective function, we impose a constraint on the\n\u21131 norm of A. Both the penalty and the constraint formulations have been considered in the sparse\nlearning literature, and shown to be equivalent in the case of a convex objective. Here we choose\nthe constraint formulation because it can be solved by a simple projected gradient descent method.\nOn the contrary, the penalty formulation leads to a non-smooth and non-convex optimization prob-\nlem, which is dif\ufb01cult to solve with standard methods for sparse learning. In particular, the soft-\nthresholding-based coordinate descent method for LASSO does not apply due to the Lyapunov\nregularization term. Moreover, most of the common methods for non-smooth optimization, such\nas bundle methods, solve convex problems and need non-trivial modi\ufb01cation in order to handle\nnon-convex problems [14].\n\nLet J(A) denote the objective function in (9) and A(k) denote the intermediate solution at the k-th\niteration. Our projected gradient method updates A(k) to A(k+1) by the following rule:\n\n(10)\nwhere \u03b7(k) > 0 denotes a proper step size, \u2207J(A(k)) denotes the gradient of J(\u00b7) at A(k), and \u03a0(\u00b7)\ndenotes the projection onto the feasible region kAk1 \u2264 \u03bb1. More precisely, for any p-by-p real\nmatrix V we de\ufb01ne\n\nA(k+1) \u2190 \u03a0(A(k) \u2212 \u03b7(k)\u2207J(A(k))),\n\n\u03a0(V ) := arg min\n\nkAk1\u2264\u03bb1\n\nkA \u2212 V k2\nF .\n\n(11)\n\nTo compute the projection, we use the ef\ufb01cient \u21131 projection technique given in Figure 2 of [5],\nwhose expected running time is linear in the size of V .\n\nFor choosing a proper step size \u03b7(k), we consider the simple and effective Armijo rule along the\nprojection arc described in [2]. This procedure is given in Algorithm 1, and the main idea is to\nensure a suf\ufb01cient decrease in the objective value per iteration (13). [2] proved that there always\nexists \u03b7(k) = \u03b2rk > 0 satisfying (13), and every limit point of {A(k)}\u221e\nk=0 is a stationary point of\n(9). In our experiments we set c = 0.01 and \u03b2 = 0.1, both of which are typical values used in\ngradient descent. As in the previous section, we need good initializations for the projected gradient\ndescent method. Here we use these two initial estimates:\n\nbAlsq\u2032\n\n:= arg min\n\nkAk\u2264\u03bb1\n\nF\n\nkA \u2212 bAlsqk2\n\nand\n\nbAsp := arg min\n\nkAk\u2264\u03bb1\n\nwhere bAlsq is de\ufb01ned in (8), and then choose the one that leads to a smaller objective value.\n\n1\n2\n\nkY \u2212 XAk2\nF ,\n\n(12)\n\n4\n\n\fAlgorithm 1: Armijo\u2019s rule along the projection arc\nInput\nOutput: A(k+1)\n\n: A(k), \u2207J(A(k)), 0 < \u03b2 < 1, 0 < c < 1.\n\n1 Find \u03b7(k) = max{\u03b2rk |rk \u2208 {0, 1, . . .}} such that A(k+1) := \u03a0(A(k) \u2212 \u03b7(k)\u2207J(A(k))) satis\ufb01es\n\nJ(A(k+1)) \u2212 J(A(k)) \u2264 c trace(cid:16)\u2207J(A(k))\u22a4(A(k+1) \u2212 A(k))(cid:17)\n\n(13)\n\n2.3 Robust estimation of covariance matrices\n\nTo obtain a good estimator for A using the proposed methods, we need a good estimator for the\ncovariance of the stationary distribution of (1). Given an independent sample {zi}n\ni=1 drawn from\nthe stationary distribution, the sample covariance is de\ufb01ned as\n\nS :=\n\n1\n\nn \u2212 1\n\nnXi=1\n\n(zi \u2212 \u00afz)\u22a4(zi \u2212 \u00afz), where \u00afz := Pn\n\ni=1\nn\n\nzi\n\n.\n\n(14)\n\nAlthough unbiased, the sample covariance is known to be vulnerable to outliers, and ill-conditioned\nwhen the number of sample points n is smaller than the dimension p. Both issues arise in many\nreal world problems, and the latter is particularly common in gene expression analysis. Therefore,\nresearchers in many \ufb01elds, such as statistics [17, 20, 11], \ufb01nance [10], signal processing [3, 4], and\nrecently computational biology [15], have investigated robust estimators of covariances. Most of\nthese results originate from the idea of shrinkage estimators, which shrink the covariance matrix\ntowards some target covariance with a simple structure, such as a diagonal matrix.\nIt has been\nshown in, e.g., [17, 10] that shrinking the sample covariance can achieve a smaller mean-squared\nerror (MSE). More speci\ufb01cally, [10] considers the following linear shrinkage:\n\nbQ = (1 \u2212 \u03b1)S + \u03b1F\nE(kbQ \u2212 Qk2\n\n\u03b1\u2217 := arg min\n0\u2264\u03b1\u22641\n\nF ),\n\n(15)\n\n(16)\n\n(19)\n\n(20)\n\nfor 0 < \u03b1 < 1 and some target covariance F , and derive a formula for the optimal \u03b1 that minimizes\nthe mean-squared error:\n\nwhich involves unknown quantities such as true covariances of S. [15] proposed to estimate \u03b1\u2217 by\nreplacing all the population quantities appearing in \u03b1\u2217 by their unbiased empirical estimates, and\n\nuse the estimator proposed in [15] with the following F :\n\nderived the resulting estimatorb\u03b1\u2217 for several types of target F . For the experiments in this paper we\n\n1 \u2264 i, j \u2264 p.\n\n(17)\n\nFij =(cid:26)Sij,\n\n0\n\nif i = j,\notherwise,\n\nDenoting the sample correlation matrix as R, we give the \ufb01nal estimator bQ (Table 1 in [15]) below:\nbQij :=(Sij,\n\nif i = j,\notherwise,\n\nif i = j,\notherwise,\n\n(18)\n\nn\n\nbRijpSiiSjj\nb\u03b1\u2217 := Pi6=j cVar(Rij)\nPi6=j R2\n\nij\n\nbRij :=(cid:26)1,\n= Pi6=j\n\nwhere\n\nk=1(wkij \u2212 \u00afwij)2\n\nRij min(1, max(0, 1 \u2212b\u03b1\u2217))\n(n\u22121)3Pn\nPi6=j R2\n\u00afwij := Pn\n\nk=1 wkij\n\nn\n\nij\n\n,\n\n,\n\nwkij := (\u02dczk)i(\u02dczk)j,\n\nand {\u02dczi}n\n\ni=1 are standardized non-sequence samples.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d) Eigenvalues in modulus\n\nFigure 2: Testing performances and eigenvalues in modulus for the dense model\n\n3 Experiments\n\nTo evaluate the proposed methods, we conduct experiments on synthetic and video data. In both sets\n\nof experiments we use the following two performance measures for a learnt model bA:\n\n1\n\nNormalized error:\n\nkxt+1 \u2212 xtk2 .\n\nT \u22121Xt=1\nkxt+1 \u2212 xtbAk2\nkxt+1 \u2212 xtkkxtbA \u2212 xtk(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nT \u22121Xt=1\n(xt+1 \u2212 xt)\u22a4(xtbA \u2212 xt)\n\n.\n\nT \u2212 1\n\n1\n\nT \u2212 1(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nCosine score:\n\nt+1, \u01eb\n\nout that a constant prediction \u02c6xt+1 = xt leads to a normalized error of 1, and a random-walk\nt+1 being a white-noise process, results in a nearly-zero cosine\nprediction \u02c6xt+1 = xt + \u01eb\n\nTo give an idea of how a good estimate bA would perform under these two measures, we point\nscore. Thus, when the true model is more than a simple random walk, a good estimate bA should\nmatrix A, so we consider a third criterion, the matrix error: kbA \u2212 AkF /kAkF .\n\nachieve a normalized error much smaller than 1 and a cosine score way above 0. We also note that\nthe cosine score is upper-bounded by 1. In experiments on synthetic data we have the true transition\n\nIn all our experiments, we have a training sequence, a testing sequence, and a non-sequence sample.\nTo choose the hyper-parameters \u03bb1, \u03bb2 and \u03c32, we split the training sequence into two halves and\nuse the second half as the validation sequence. Once we \ufb01nd the best hyper-parameters according to\nthe validation performance, we train a model on the full training sequence and predict on the testing\nsequence. For \u03bb1 and \u03bb2, we adopt the usual grid-search scheme with a suitable range of values.\n\nFor \u03c32, we observe that (2) implies bQ \u2212 \u03c32I should be positive semide\ufb01nite, and thus search the set\n{0.9j mini \u03bbi(bQ) | 1 \u2264 j \u2264 3}. In most of our experiments, we \ufb01nd that the proposed methods are\n\nmuch less sensitive to \u03c32 than to \u03bb1 and \u03bb2.\n\n3.1 Synthetic Data\n\nWe consider the following two VAR models with a Gaussian white noise process \u01eb\n\nt \u223c N (0, I).\n\nDense Model:\n\nA =\n\n0.95M\n\nmax(|\u03bbi(M )|)\n\n, Mij \u223c N (0, 1), 1 \u2264 i, j \u2264 200.\n\nSparse Model:\n\nA =\n\n0.95(M \u2299 B)\n\nmax(|\u03bbi(M \u2299 B)|)\n\n, Mij \u223c N (0, 1), Bij \u223c Bern (1/8), 1 \u2264 i, j \u2264 200,\n\nwhere Bern(h) is the Bernoulli distribution with success probability h, and \u2299 denotes the entrywise\nproduct of two matrices. By setting h = 1/8, we make the sparse transition matrix A have roughly\n40000/8 = 5000 non-zero entries. Both models are stable, and the stationary distribution for each\nmodel is a zero-mean Gaussian. We obtain the covariance Q of each stationary distribution by\nsolving the Lyapunov equation (2). For a single experiment, we generate a training sequence and a\ntesting sequence, both initialized from the stationary distribution, and draw a non-sequence sample\nindependently from the stationary distribution. We set the length of the testing sequence to be\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d) Eigenvalues in modulus\n\nFigure 3: Testing performances and eigenvalues in modulus for the sparse model\n\n800, and vary the training sequence length T and the non-sequence sample size n: for the dense\nmodel, T \u2208 {50, 100, 150, 200, 300, 400, 600, 800} and n \u2208 {50, 400, 1600}; for the sparse model,\nT \u2208 {25, 75, 150, 400} and n \u2208 {50, 400, 1600}. Under each combination of T and n, we compare\nthe proposed Lyapunov penalization method with the baseline approach of penalized least square,\nwhich uses only the sequence data. To investigate the limit of the proposed methods, we also use the\ntrue Q for the Lyapunov penalization. We run 10 such experiments for the dense model and 5 for the\nsparse model, and report the overall performances of both the proposed and the baseline methods.\n\n3.1.1 Experimental results for the dense model\n\nWe give boxplots of the three performance measures in the 10 experiments in Figures 2(a) to 2(c).\nThe ridge regression approach and the proposed Lyapunov penalization method (6) are abbreviated\nas Ridge and Lyap, respectively. For normalized error and cosine score, we also report the perfor-\nmance of the true A on testing sequences.\nWe observe that Lyap improves over Ridge more signi\ufb01cantly when the training sequence length\nT is small (\u2264 200) and the non-sequence sample size n is large (\u2265 400). When T is large, Ridge\nalready performs quite well and Lyap does not improve the performance much. But with the true\nstationary covariance Q, Lyap outperforms Ridge signi\ufb01cantly for all T . When n is small, the\n\ncovariance estimate bQ is far from the true Q and the Lyapunov penalty does not provide useful\n\ninformation about A.\nIn this case, the value of \u03bb2 determined by the validation performance is\nusually quite small (0.5 or 1) compared to \u03bb1 (256), so the two methods perform similarly on testing\nsequences. We note that if instead of the robust covariance estimate in (18) and (19) we use the\nsample covariance, the performance of Lyap can be marginally worse than Ridge when n is small.\n\nqualitative assessment of the estimated transition matrices, in Figure 2(d) we plot the eigenvalues in\n\nA precise statement on how the estimation error in Q affects bA is worth studying in the future. As a\nmodulus of the true A and the bA\u2019s obtained by different methods when T = 50 and n = 1600. The\n\neigenvalues are sorted according to their modulus. Both Ridge and Lyap severely under-estimate the\neigenvalues in modulus, but Lyap preserves the spectrum much better than Ridge.\n\n3.1.2 Experimental results for the sparse model\n\nWe give boxplots of the performance measures in the 5 experiments in Figures 3(a) to 3(c), and the\n\nand the proposed method (9) are abbreviated as Sparse and Lyap, respectively.\n\neigenvalues in modulus of the true A and some bA\u2019s in Figure 3(d). The sparse least-square method\n\nWe observe the same type of improvement as in the dense model: Lyap improves over Sparse more\nsigni\ufb01cantly when T is small and n is large. But the largest improvement occurs when T = 75, not\nthe shortest training sequence length T = 25. A major difference lies in the impact of the Lyapunov\n\npenalization on the spectrum of bA, as revealed in Figure 3(d). When T is as small as 25, the sparse\n\nleast-square method shrinks all the eigenvalues but still keep most of them non-zero, while Lyap\nwith a non-sequence sample of size 1600 over-estimates the \ufb01rst few largest eigenvalues in modulus\nbut shrink the rest to have very small modulus. In contrast, Lyap with the true Q preserves the\nspectrum much better. We may thus need an even better covariance estimate for the sparse model.\n\n7\n\n\fr\no\nr\nr\ne\n\n \n\nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n \n\n \n\nRidge\nLyap\n\nT=6\n\nT=10\n\nT=20\n\nT=50\n\n(a) The pendulum\n\n(b) Normalized error\n\n1\n\n0.8\n\ne\nr\no\nc\ns\n \n\ni\n\ne\nn\ns\no\nC\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n\nRidge\nLyap\n\n \n\nT=6\n\nT=10\n\nT=20\n\nT=50\n\n(c) Cosine score\n\nFigure 4: Results on the pendulum video data\n\n3.2 Video Data\n\nWe test our methods using a video sequence of a periodically swinging pendulum3, which consists\nof 500 frames of 75-by-80 grayscale images. One such frame is given in Figure 4(a) The period\nis about 23 frames. To further reduce the dimension we take the second-level Gaussian pyramids,\nresulting in images of size 9-by-11. We then treat each reduced image as a 99-dimensional vector,\nand normalize each dimension to be zero-mean and standard deviation 1. We analyze this sequence\nwith a 99-dimensional \ufb01rst-order VAR model. To check whether a VAR model is a suitable choice,\nwe estimate a transition matrix from the \ufb01rst 400 frames by ridge regression while choosing the\npenalization parameter on the next 50 frames, and predict on the last 50 frames. The best penal-\nization parameter is 0.0156, and the testing normalized error and cosine score are 0.33 and 0.97,\nrespectively, suggesting that the dynamics of the video sequence is well-captured by a VAR model.\n\nWe compare the proposed method (6) with the ridge regression for two lengths of the training se-\nquence: T \u2208 {6, 10, 20, 50}, and treat the last 50 frames as the testing sequence. For both methods,\nwe split the training sequence into two halves and use the second half as a validation sequence. For\nthe proposed method, we simulate a non-sequence sample by randomly choosing 300 frames from\nbetween the (T + 1)-st frame and the 450-th frame without replacement. We repeat this 10 times.\nThe testing normalized errors and cosine scores of both methods are given in Figures 4(b) and 4(c).\nFor the proposed method, we report the mean performance measures over the 10 simulated non-\nsequence samples with standard deviation. When T \u2264 20, which is close to the period, the proposed\nmethod outperforms ridge regression very signi\ufb01cantly except when T = 10 the cosine score of\nLyap is barely better than Ridge. However, when we increase T to 50, the difference between the\ntwo methods vanishes, even though there is still much room for improvement as indicated by the\nresult of our model sanity check before. This may be due to our use of dependent data as the non-\nsequence sample, or simply insuf\ufb01cient non-sequence data. As for \u03bb1 and \u03bb2, their values decrease\nrespectively from 512 and 2,048 to less than 32 as T increases, but since we \ufb01x the amount of non-\nsequence data, the interaction between their value changes is less clear than on the synthetic data.\n\n4 Conclusion\n\nWe propose to improve penalized least-square estimation of VAR models by incorporating non-\nsequence data, which are assumed to be samples drawn from the stationary distribution of the\nunderlying VAR model. We construct a novel penalization term based on the discrete-time Lya-\npunov equation concerning the covariance (estimate) of the stationary distribution. Preliminary\nexperimental results demonstrate that our methods can improve signi\ufb01cantly over standard penal-\nized least-square methods when there are only few sequence data but abundant non-sequence data\n\nand when the model assumption is valid. In the future, we would like to investigate the impact of bQ\non bA in a precise manner. Also, we may consider noise processes \u01eb\n\nt with more general covariances,\nand incorporate the noise covariance estimation into the proposed Lyapunov penalization scheme.\nFinally and the most importantly, we aim to apply the proposed methods to real scienti\ufb01c time series\ndata and provide a more effective tool for those modelling tasks.\n\n3A similar video sequence has been used in [16].\n\n8\n\n\fReferences\n\n[1] P. Antsaklis and A. Michel. Linear systems. Birkhauser, 2005. 2\n[2] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA 02178-9998, sec-\n\nond edition, 1999. 4\n\n[3] Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero. Shrinkage algorithms for mmse covariance\n\nestimation. IEEE Transactions on Signal Processing, 58:5016\u20135029, 2010. 5\n\n[4] Y. Chen, A. Wiesel, and A. O. Hero. Robust shrinkage estimation of high-dimensional covari-\n\nance matrices. Technical report, arXiv:1009.5331v1 [stat.ME], September 2010. 5\n\n[5] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the \u21131-\nball for learning in high dimensions. In Proceedings of the 25th International Conference on\nMachine Learning, pages 272\u2013279, 2008. 4\n\n[6] A. Gupta and Z. Bar-Joseph. Extracting dynamics from static cancer expression data.\nIEEE/ACM Transactions on Computational Biology and Bioinformatics, 5:172\u2013182, 2008. 2\n\n[7] J. Hamilton. Time series analysis. Princeton Univ Pr, 1994. 2\n[8] T.-K. Huang and J. Schneider. Learning linear dynamical systems without sequence infor-\nIn Proceedings of the 26th International Conference on Machine Learning, pages\n\nmation.\n425\u2013432, 2009. 2\n\n[9] T.-K. Huang, L. Song, and J. Schneider. Learning nonlinear dynamic models from non-\nsequenced data. In Proceedings of the 13th International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2010. 2\n\n[10] O. Ledoit and M. Wolf. Improved estimation of the covariance matrix of stock returns with an\n\napplication to portfolio selection. Journal of Empirical Finance, 10:603\u2013621, 2003. 5\n\n[11] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matri-\n\nces. Journal of Multivariate Analysis, 88:365\u2013411, 2004. 5\n\n[12] A. Lozano, N. Abe, Y. Liu, and S. Rosset. Grouped graphical granger modeling for gene\n\nexpression regulatory networks discovery. Bioinformatics, 25(12):i110, 2009. 1\n\n[13] T. C. Mills. The Econometric Modelling of Financial Time Series. Cambridge University Press,\n\nsecond edition, 1999. 1\n\n[14] D. Noll, O. Prot, and A. Rondepierre. A proximity control algorithm to minimize nonsmooth\n\nand nonconvex functions. Paci\ufb01c Journal of Optimization, 4(3):569\u2013602, 2008. 4\n\n[15] J. Sch\u00a8afer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation\nand implications for functional genomics. Statistical Applications in Genetics and Molecular\nBiology, 4, 2005. 5\n\n[16] S. M. Siddiqi, B. Boots, and G. J. Gordon. Reduced-rank hidden Markov models. In Pro-\nceedings of the 13th International Conference on Arti\ufb01cial Intelligence and Statistics, 2010.\n8\n\n[17] C. Stein. Estimation of a covariance matrix. In Rietz Lecture, 39th Annual Meeting, Atlanta,\n\nGA, 1975. 5\n\n[18] R. S. Tsay. Analysis of \ufb01nancial time series. Wiley-Interscience, 2005. 1\n[19] B. P. Tu, A. Kudlicki, M. Rowicka, and S. L. McKnight. Logic of the yeast metabolic cycle:\nTemporal compartmentalization of cellular processes. Science, 310(5751):1152\u20131158, 2005.\n1\n\n[20] R. Yang and J. O. Berger. Estimation of a covariance matrix using the reference prior. Annals\n\nof Statistics, 22:1195\u20131211, 1994. 5\n\n9\n\n\f", "award": [], "sourceid": 882, "authors": [{"given_name": "Tzu-kuo", "family_name": "Huang", "institution": null}, {"given_name": "Jeff", "family_name": "Schneider", "institution": null}]}