{"title": "Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 847, "page_last": 855, "abstract": "Time series prediction problems are becoming increasingly high-dimensional in modern applications, such as climatology and demand forecasting. For example, in the latter problem, the number of items for which demand needs to be forecast might be as large as 50,000. In addition, the data is generally noisy and full of missing values. Thus, modern applications require methods that are highly scalable, and can deal with noisy data in terms of corruptions or missing values. However, classical time series methods usually fall short of handling these issues. In this paper, we present a temporal regularized matrix factorization (TRMF) framework which supports data-driven temporal learning and forecasting. We develop novel regularization schemes and use scalable matrix factorization methods that are eminently suited for high-dimensional time series data that has many missing values. Our proposed TRMF is highly general, and subsumes many existing approaches for time series analysis. We make interesting connections to graph regularization methods in the context of learning the dependencies in an autoregressive framework. Experimental results show the superiority of TRMF in terms of scalability and prediction quality. In particular, TRMF is two orders of magnitude faster than other methods on a problem of dimension 50,000, and generates better forecasts on real-world datasets such as Wal-mart E-commerce datasets.", "full_text": "Temporal Regularized Matrix Factorization for\n\nHigh-dimensional Time Series Prediction\n\nHsiang-Fu Yu\n\nUniversity of Texas at Austin\nrofuyu@cs.utexas.edu\n\nNikhil Rao\n\nTechnicolor Research\n\nnikhilrao86@gmail.com\n\nInderjit S. Dhillon\n\nUniversity of Texas at Austin\n\ninderjit@cs.utexas.edu\n\nAbstract\n\nTime series prediction problems are becoming increasingly high-dimensional in\nmodern applications, such as climatology and demand forecasting. For example,\nin the latter problem, the number of items for which demand needs to be forecast\nmight be as large as 50,000. In addition, the data is generally noisy and full of\nmissing values. Thus, modern applications require methods that are highly scalable,\nand can deal with noisy data in terms of corruptions or missing values. However,\nclassical time series methods usually fall short of handling these issues. In this\npaper, we present a temporal regularized matrix factorization (TRMF) framework\nwhich supports data-driven temporal learning and forecasting. We develop novel\nregularization schemes and use scalable matrix factorization methods that are\neminently suited for high-dimensional time series data that has many missing values.\nOur proposed TRMF is highly general, and subsumes many existing approaches\nfor time series analysis. We make interesting connections to graph regularization\nmethods in the context of learning the dependencies in an autoregressive framework.\nExperimental results show the superiority of TRMF in terms of scalability and\nprediction quality. In particular, TRMF is two orders of magnitude faster than\nother methods on a problem of dimension 50,000, and generates better forecasts on\nreal-world datasets such as Wal-mart E-commerce datasets.\n\nIntroduction\n\n1\nTime series analysis is a central problem in many applications such as demand forecasting and\nclimatology. Often, such applications require methods that are highly scalable to handle a very large\nnumber (n) of possibly inter-dependent one-dimensional time series and/or have a large time frame\n(T ). For example, climatology applications involve data collected from possibly thousands of sensors,\nevery hour (or less) over several years. Similarly, a store tracking its inventory would track thousands\nof items every day for multiple years. Not only is the scale of such problems huge, but they might\nalso involve missing values, due to sensor malfunctions, occlusions or simple human errors. Thus,\nmodern time series applications present two challenges to practitioners: scalability to handle large n\nand T and the \ufb02exibility to handle missing values.\nMost approaches in the traditional time series literature such as autoregressive (AR) models or\ndynamic linear models (DLM)[7, 21] focus on low-dimensional time-series data and fall short of\nhandling the two aforementioned issues. For example, an AR model of order L requires O(T L2n4 +\nL3n6) time to estimate O(Ln2) parameters, which is prohibitive even for moderate values of n.\nSimilarly, Kalman \ufb01lter based DLM approaches need O(kn2T + k3T ) computation cost to update\nparameters, where k is the latent dimensionality, which is usually chosen to be larger than n in many\nsituations [13]. As a speci\ufb01c example, the maximum likelihood estimator implementation in the\nwidely used R-DLM package [12], which relies on a general optimization solver, cannot scale beyond\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fn in the tens. (See Appendix D for details). On the other hand, for models such as AR, the \ufb02exibility\nto handle missing values can also be very challenging even for one-dimensional time series [1], let\nalone the dif\ufb01culty to handle high dimensional time series.\nA natural way to model high-dimensional time series data is in the form of a matrix, with rows\ncorresponding to each one-dimensional time series and columns corresponding to time points. In light\nof the observation that n time series are usually highly correlated with each other, there have been\nsome attempts to apply low-rank matrix factorization (MF) or matrix completion (MC) techniques\nto analyze high-dimensional time series [2, 14, 16, 23, 26]. Unlike the AR and DLM models above,\nstate-of-the-art MF methods scale linearly in n, and hence can handle large datasets. Let Y 2 Rn\u21e5T\nbe the matrix for the observed n-dimensional time series with Yit being the observation at the t-th\ntime point of the i-th time series. Under the standard MF approach, Yit is estimated by the inner\nproduct f>i xt, where fi 2 Rk is a k-dimensional latent embedding for the i-th time series, and\nxt 2 Rk is a k-dimensional latent temporal embedding for the t-th time point. We can stack the xts\ninto the columns into a matrix X 2 Rk\u21e5T and f>i\ninto the rows of F 2 Rn\u21e5k (Figure 1) to get\nY \u21e1 F X. We can then solve:\n(1)\n\nmin\n\n+ fRf (F ) + xRx(X),\n\nF,X X(i,t)2\u2326Yit f>i xt2\n\nY\n\nYnew\n\n\u21e1\n\nXnew\n\nF\n\nf>i\n\nX\n\nxt\n\ns\n\nm\ne\nt\nI\n\nTime\n\nTime- dependent variables\n\nFigure 1: Matrix Factorization model for mul-\ntiple time series. F captures features for each\ntime series in the matrix Y , and X captures\nthe latent and time-varying variables.\n\nwhere \u2326 is the set of the observed entries. Rf (F ),\nRx(X) are regularizers for F and X, which usu-\nally play a role to avoid over\ufb01tting and/or to en-\ncourage some speci\ufb01c temporal structures among\nthe embeddings. It is clear that the common choice\nof the regularizer Rx(X) = kXkF is no longer\nappropriate for time series applications, as it does\nnot take into account the ordering among the tem-\nporal embeddings {xt}. Most existing MF ap-\nproaches [2, 14, 16, 23, 26] adapt graph-based ap-\nproaches to handle temporal dependencies. Specif-\nically, the dependencies are described by a weighted similarity graph and incorporated through\na Laplacian regularizer [18]. However, graph-based regularization fails in cases where there are\nnegative correlations between two time points. Furthermore, unlike scenarios where explicit graph\ninformation is available with the data (such as a social network or product co-purchasing graph\nfor recommender systems), explicit temporal dependency structure is usually unavailable and has\nto be inferred or approximated, which causes practitioners to either perform a separate procedure\nto estimate the dependencies or consider very short-term dependencies with simple \ufb01xed weights.\nMoreover, existing MF approaches, while yielding good estimations for missing values in past points,\nare poor in terms of forecasting future values, which is the problem of interest in time series analysis.\nIn this paper, we propose a novel temporal regularized matrix factorization framework (TRMF) for\nhigh-dimensional time series analysis. In TRMF, we consider a principled approach to describe the\nstructure of temporal dependencies among latent temporal embeddings {xt} and design a temporal\nregularizer to incorporate this temporal structure into the standard MF formulation. Unlike most\nexisting MF approaches, our TRMF method supports data-driven temporal dependency learning\nand also brings the ability to forecast future values to a matrix factorization approach. In addition,\ninherited from the property of MF approaches, TRMF can easily handle high-dimensional time series\ndata even in the presence of many missing values. As a speci\ufb01c example, we demonstrate a novel\nautoregressive temporal regularizer which encourages AR structure among temporal embeddings\n{xt}. We also make connections between the proposed regularization framework and graph-based\napproaches [18], where even negative correlations can be accounted for. This connection not only\nleads to better understanding about the dependency structure incorporated by our framework but also\nbrings the bene\ufb01t of using off-the-shelf ef\ufb01cient solvers such as GRALS [15] directly to solve TRMF.\nPaper Organization. In Section 2, we review the existing approaches and their limitations on data\nwith temporal dependencies. We present the proposed TRMF framework in Section 3, and show that\nthe method is highly general and can be used for a variety of time series applications. We introduce\na novel AR temporal regularizer in Section 4, and make connections to graph-based regularization\napproaches. We demonstrate the superiority of the proposed approach via extensive experimental\nresults in Section 5 and conclude the paper in Section 6.\n\n2\n\n\f2 Motivations: Existing Approaches and Limitations\n2.1 Classical Time-Series Models\nModels such as AR and DLM are not suitable for modern multiple high-dimensional time series data\n(i.e., both n and T are large) due to their inherent computational inef\ufb01ciency (see Section 1). To avoid\nover\ufb01tting in AR models, there have been studies with various structured transition matrices such\nas low rank and sparse matrices [5, 10, 11]. The focus of this research has been on obtaining better\nstatistical guarantees. The scalability issue of AR models remains open. On the other hand, it is also\nchallenging for many classic time-series models to deal with data that has many missing values [1].\nIn many situations where the model parameters are either given or designed by practitioners, the\nKalman \ufb01lter approach is used to perform forecasting, while the Kalman smoothing approach is\nused to impute missing entries. When model parameters are unknown, EM algorithms are applied to\nestimate both the model parameters and latent embeddings for DLM [3, 8, 9, 17, 19]. As most EM\napproaches for DLM contain the Kalman \ufb01lter as a building block, they cannot scale to very high\ndimensional time series data. Indeed, as shown in Section 5, the popular R package for DLM\u2019s does\nnot scale beyond data with tens of dimensions.\n2.2 Existing Matrix Factorization Approaches for Data with Temporal Dependencies\nt=1kxtk2 is usually the\nIn standard MF (1), the squared Frobenius norm Rx(X) = kXk2\nregularizer of choice for X. Because squared Frobenius norm assumes no dependencies among {xt},\nstandard MF formulation is invariant to column permutation and not applicable to data with temporal\ndependencies. Hence most existing temporal MF approaches turn to the framework of graph-based\nregularization [18] for temporally dependent {xt}, with a graph encoding the temporal dependencies.\nAn exception is the work in [22], where the authors use specially designed regularizers to encourage\na log-normal structure on the temporal coef\ufb01cients.\nGraph regularization for temporal dependencies:The framework of graph-based regularization is\nan approach to describe and incorporate general dependencies among variables. Let G be a graph\nover {xt} and Gts be the edge weight between the t-th node and s-th node. A popular regularizer to\ninclude as part of an objective function is the following:\n\nF = PT\n\n1\n\n\u2318\n\nw4\n\n(2)\n\nkxtk2,\n\n2Xt\n\nGtskxt xsk2 +\n\nRx(X) = G(X | G, \u2318) :=\n\n2Xt\u21e0s\nwhere t \u21e0 s denotes an edge between t-th node and\ns-th node, and the second summation term is used to\nguarantee strong convexity. A large Gts will ensure\nthat xt and xs are close to each other in Euclidean\ndistance, when (2) is minimized. Note that to guaran-\ntee the convexity of G(X | G, \u2318), we need Gts 0.\nTo apply graph-based regularizers to temporal dependencies, we need to specify the (repeating)\ndependency pattern by a lag set L and a weight vector w such that all the edges t \u21e0 s of distance\nl (i.e., |s t| = l) share the same weight Gts = wl. See Figure 2 for an example with L = {1, 4}.\nGiven L and w, the corresponding graph regularizer becomes\nwl(xt xtl)2 +\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\nFigure 2: Graph-based regularization for tem-\nporal dependencies.\n\nG(X | G, \u2318) =\n\nkxtk2.\n\n(3)\n\nt 3\n\nt 1\n\nt 4\n\nw1\n\nw1\n\nt 2\n\nw1\n\nw1\n\nt\n\nw1\n\nt + 1\n\nw4\n\n1\n\n2Xl2LXt:t>l\n\n\u2318\n\n2Xt\n\nThis direct use of graph-based approach, while intuitive, has two issues: a) there might be negatively\ncorrelated dependencies between two time points; b) unlike many applications where such regularizers\nare used, the explicit temporal dependency structure is usually not available and has to be inferred.\nAs a result, most existing approaches consider only very simple temporal dependencies such as a\nsmall size of L (e.g., L = {1}) and/or uniform weights (e.g., wl = 1, 8l 2L ). For example, a\nsimple chain graph is considered to design the smoothing regularizer in TCF [23]. This leads to poor\nforecasting abilities of existing MF methods for large-scale time series applications.\n2.3 Challenges to Learn Temporal Dependencies\nOne could try to learn the weights wl automatically, by using the same regularizer as in (3) but with\nthe weights unknown. This would lead to the following optimization problem:\nwl(xt xtl)2+\n\n+fRf (F )+\n\nx\u2318\n\nmin\n\nF,X,w0 X(i,t)2\u2326Yit f>i xt2\n\n2 Xt kxtk2, (4)\n\nwhere 0 is the zero vector, and w 0 is the constraint imposed by graph regularization.\n\n2 Xl2L Xt:tl>0\n\nx\n\n3\n\n\fIt is not hard to see that the above optimization yields the trivial all-zero solution for w\u21e4, meaning\nthe objective function is minimized when no temporal dependencies exist! To avoid the all zero\n\nis not hard to see that this will result in w\u21e4 being a 1-sparse vector, with wl\u21e4 being 1, where\n\nsolution, one might want to impose a simplex constraint on w (i.e., Pl2L wl = 1). Again, it\nl\u21e4 = arg minl2LPt:t>l kxt xtlk2. Thus, looking to learn the weights automatically by simply\n\nplugging in the regularizer in the MF formulation is not a viable option.\n3 Temporal Regularized Matrix Factorization\nIn order to resolve the limitations mentioned in Sections 2.2 and 2.3, we propose the Temporal\nRegularized Matrix Factorization (TRMF) framework, which is a novel approach to incorporate\ntemporal dependencies into matrix factorization models. Unlike the aforementioned graph-based\napproaches, we propose to use well-studied time series models to describe temporal dependencies\namong {xt} explicitly. Such models take the form:\n(5)\nwhere \u270ft is a Gaussian noise vector, and M\u21e5 is the time-series model parameterized by L and \u21e5. L\nis a set containing the lag indices l, denoting a dependency between t-th and (t l)-th time points,\nwhile \u21e5 captures the weighting information of temporal dependencies (such as the transition matrix\nin AR models). To incorporate the temporal dependency into the standard MF formulation (1), we\npropose to design a new regularizer TM(X | \u21e5) which encourages the structure induced by M\u21e5.\nTaking a standard approach to model time series, we set TM(X | \u21e5) be the negative log likelihood of\nobserving a particular realization of the {xt} for a given model M\u21e5:\n(6)\nTM(X | \u21e5) = log P(x1, . . . , xT | \u21e5).\nWhen \u21e5 is given, we can use Rx(X) = TM(X | \u21e5) in the MF formulation (1) to encourage {xt} to\nfollow the temporal dependency induced by M\u21e5. When the \u21e5 is unknown, we can treat \u21e5 as another\nset of variables and include another regularizer R\u2713(\u21e5) into (1):\n\nxt = M\u21e5({xtl : l 2L} ) + \u270ft,\n\nmin\n\nF,X,\u21e5 X(i,t)2\u2326Yit f>i xt2\n\n+ fRf (F ) + xTM(X | \u21e5) + \u2713R\u2713(\u21e5),\n\n(7)\n\nmin\n\u21e5\n\nxTM(X | \u21e5) + \u2713R\u2713(\u21e5),\n\nwhich be solved by an alternating minimization procedure over F , X, and \u21e5.\nData-driven Temporal Dependency Learning in TRMF:Recall that in Section 2.3, we showed\nthat directly using graph based regularizers to incorporate temporal dependencies leads to trivial\nsolutions for the weights. TRMF circumvents this issue. When F and X are \ufb01xed, (7) is reduced to:\n(8)\nwhich is a maximum-a-posterior (MAP) estimation problem (in the Bayesian sense) to estimate the\nbest \u21e5 for a given {xt} under the M\u21e5 model. There are well-developed algorithms to solve (8) and\nobtain non-trivial \u21e5. Thus, unlike most existing temporal matrix factorization approaches where the\nstrength of dependencies is \ufb01xed, \u21e5 in TRMF can be learned automatically from data.\nTime Series Analysis with TRMF:We can see that TRMF (7) lends itself seamlessly to handle a\nvariety of commonly encountered tasks in analyzing data with temporal dependency:\n\u2022 Time-series Forecasting: Once we have M\u21e5 for latent embeddings {xt : 1, . . . , T}, we can\nuse it to predict future latent embeddings {xt : t > T} and have the ability to obtain non-trivial\nforecasting results for yt = F xt for t > T .\n\u2022 Missing-value Imputation: In some time-series applications, some entries in Y might be unob-\nserved, for example, due to faulty sensors in electricity usage monitoring or occlusions in the\ncase of motion recognition in video. We can use f>i xt to impute these missing entries, much like\nstandard matrix completion, and is useful in recommender systems [23] and sensor networks [26].\nExtensions to Incorporate Extra Information:Like matrix factorization, TRMF (7) can be ex-\ntended to incorporate additional information. For example, pairwise relationships between the time\nseries can be incorporated using structural regularizers on F . Furthermore, when features are known\nfor the time series, we can make use of interaction models such as those in [6, 24, 25]. Also, TRMF\ncan be extended to tensors. More details on these extensions can be found in Appendix B.\n4 A Novel Autoregressive Temporal Regularizer\nIn Section 3, we described the TRMF framework in a very general sense, with the regularizer\nTM(X | \u21e5) incorporating dependencies speci\ufb01ed by the time series model M\u21e5. In this section,\nwe specialize this to the case of AR models, which are parameterized by a lag set L and weights\nW = W (l) 2 Rk\u21e5k : l 2L . Assume that xt is a noisy linear combination of some previous\n\n4\n\n\f1\n2\n\n1\n2\n\n1\n2\n\n(9)\n\n+\n\n(10)\n\n2\n\n\u2318\n\n+\n\n\u2318\n2k \u00afxk2,\n\nTXt=m\n\nW (l)xtl\n\npoints; that is, xt =Pl2L W (l)xtl + \u270ft, where \u270ft is a Gaussian noise vector. For simplicity, we\nassume that the \u270ft \u21e0N (0, 2Ik), where Ik is the k \u21e5 k identity matrix1. The temporal regularizer\nTM(X | \u21e5) corresponding to this AR model can be written as:\n\nTAR( \u00afx|L, \u00afw,\u2318 ) =\n\nxt Xl2L\n\nTAR(X |L,W,\u2318 ) :=\n\nwlxtl!2\n\nTXt=m xt Xl2L\n\nr=1 TAR( \u00afxr |L, \u00afwr,\u2318 ), where we de\ufb01ne\n\n2Xt kxtk2,\nwhere m := 1 + L, L := max(L), and \u2318> 0 to guarantee the strong convexity of (9).\nTRMF allows us to learn the weightsW (l) when they are unknown. Since each W (l) 2 Rk\u21e5k,\nthere will be |L|k2 variables to learn, which may lead to over\ufb01tting. To prevent this and to yield\nmore interpretable results, we consider diagonal W (l), reducing the number of parameters to |L|k.\nTo simplify notation, we use W to denote the k \u21e5 L matrix where the l-th column constitutes\nthe diagonal elements of W (l). Note that for l /2L , the l-th column of W is a zero vector. Let\n\u00afx>r = [\u00b7\u00b7\u00b7 , Xrt,\u00b7\u00b7\u00b7 ] be the r-th row of X and \u00afw>r = [\u00b7\u00b7\u00b7 ,Wrl,\u00b7\u00b7\u00b7 ] be the r-th row of W. Then\n(9) can be written as TAR(X |L,W,\u2318 ) =Pk\nwith xt being the t-th element of \u00afx, and wl being the l-th element of \u00afw.\nCorrelations among Multiple Time Series. Even whenW l is diagonal, TRMF retains the power\nto capture the correlations among time series via the factors {fi}, since it has an effect only on the\nstructure of latent embeddings {xt}. Indeed, as the i-th dimension of {yt} is modeled by f>i X\nin (7), the low rank F is a k dimensional latent embedding of multiple time series. This embedding\ncaptures correlations among multiple time series. Furthermore, {fi} acts as time series features,\nwhich can be used to perform classi\ufb01cation/clustering even in the presence of missing values.\nChoice of Lag Index Set L. Unlike most approaches mentioned in Section 2.2, the choice of L in\nTRMF is more \ufb02exible. Thus, TRMF can provide important advantages: First, because there is no\nneed to specify the weight parameters W, L can be chosen to be larger to account for long range\ndependencies, which also yields more accurate and robust forecasts. Second, the indices in L can be\ndiscontinuous so that one can easily embed domain knowledge about periodicity or seasonality. For\nexample, one might consider L = {1, 2, 3, 51, 52, 53} for weekly data with a one year seasonality.\nConnections to Graph Regularization. We now establish connections between TAR( \u00afx|L, \u00afw,\u2318 )\nand graph regularization (2) for matrix factorization. Let \u00afL := L[{ 0}, w0 = 1 so that (10) is\n\n\u2318\n2k \u00afxk2,\nand let (d) :=l 2 \u00afL : l d 2 \u00afL . We then have the following result:\nTheorem 1. Given a lag index set L, weight vector \u00afw 2 RL, and \u00afx 2 RT , there is a weighted\nsigned graph GAR with T nodes and a diagonal matrix D 2 RT\u21e5T such that\n(11)\nwhere G \u00afx | GAR,\u2318 is the graph regularization (2) with G = GAR. Furthermore, 8t and d\nt,t+d =8<:\n\nwlxtl1A\nTAR( \u00afx|L, \u00afw,\u2318 ) = G \u00afx | GAR,\u2318 +\nand Dtt =0@Xl2 \u00afL\nwlwld if (d) 6= ,\notherwise,\n\nwl1A0@Xl2 \u00afL\n\nXl2(d) Xm\uf8fft+l\uf8ffT\n\nTAR( \u00afx|L, \u00afw,\u2318 ) =\n\n0@Xl2 \u00afL\n\nwl[m \uf8ff t + l \uf8ff T ]1A\n\nTXt=m\n\nSee Appendix C.1 for a detailed proof. From\nTheorem 1, we see that (d) is non-empty if and\nonly if there are edges between time points sep-\narated by d in GAR. Thus, we can construct the\ndependency graph for TAR( \u00afx|L, \u00afw,\u2318 ) by check-\ning whether (d) is empty. Figure 3 demon-\nstrates an example with L = {1, 4}. We can see\nthat besides edges of distance d = 1 and d = 4, there are also edges of distance d = 3 (dotted edges\nin Figure 3) because 4 3 2 \u00afL and (3) = {4}.\n\nFigure 3: The graph structure induced by the AR\ntemporal regularizer (10) with L = {1, 4}.\n\nw1w4\n\nw1w4\n\nw1w4\n\nt 3\n\nt 2\n\nt 4\n\nt 1\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\nw1\n\nw1\n\nw1\n\nw1\n\nw1\n\nt + 1\n\nt\n\n\u00afx>D \u00afx,\n\nw4\n\nw4\n\n2\n\n+\n\nGAR\n\n0\n\n1\n2\n\n1If the (known) covariance matrix is not identity, we can suitably modify the regularizer.\n\n5\n\n\fTable 1: Data statistics.\n\nsynthetic electricity traf\ufb01c walmart-1 walmart-2\n1,582\n187\n49.3%\n\n963\n10,560\n0%\n\n370\n26,304\n0%\n\n1,350\n187\n55.3%\n\n16\n128\n0%\n\nn\nT\n\nmissing ratio\n\nAlthough Theorem 1 shows that AR-based regularizers are similar to the graph-based regularization\nframework, we note the following key differences:\n\u2022 The graph GAR in Theorem 1 contains both positive and negative edges. This implies that the\nAR temporal regularizer is able to support negative correlations, which the standard graph-based\nregularizer cannot. This can make G \u00afx | GAR,\u2318 non-convex. The addition of the second term in\n(11), however, still leads to a convex regularizer TAR( \u00afx|L, \u00afw,\u2318 ).\n\u2022 Unlike (3) where there is freedom to specify a weight for each distance, in the graph GAR, the\nweight values for the edges are more structured (e.g., the weight for d = 3 in Figure 3 is w1w4).\nHence, minimization w.r.t. w0s is not trivial, and neither are the obtained solutions.\nPlugging TM(X | \u21e5) = TAR(X |L,W,\u2318 ) into (7), we obtain the following problem:\n\nmin\n\nF,X,W X(i,t)2\u2326Yit f>i xt2\n\nkXr=1\n\n+ fRf (F ) +\n\nxTAR( \u00afxr |L, \u00afwr,\u2318 ) + wRw(W),\n\n(12)\n\nwhere Rw(W) is a regularizer for W. We will refer to (12) as TRMF-AR. We can apply alternating\nminimization to solve (12). In fact, solving for each variable reduces to well known methods, for\nwhich highly ef\ufb01cient algorithms exist:\nUpdates for F . When X and W are \ufb01xed, the subproblem of updating F is the same as updating F\nwhile X \ufb01xed in (1). Thus, fast algorithms such as alternating least squares or coordinate descent can\nbe applied directly to \ufb01nd F , which costs O(|\u2326|k2) time.\nUpdates for X. We solve arg minXP(i,t)2\u2326Yit f>i xt2 + xPk\nr=1 TAR( \u00afxr |L, \u00afwr,\u2318 ). From\nTheorem 1, we see that TAR( \u00afx|L, \u00afw,\u2318 ) shares the same form as the graph regularizer, and we can\napply GRALS [15] to \ufb01nd X, which costs O(|L|T k2) time.\nUpdates for W. How to update W while F and X \ufb01xed depends on the choice of Rw(W). There\nare many parameter estimation techniques developed for AR with various regularizers [11, 20]. For\nsimplicity, we consider the squared Frobenius norm: Rw(W) = kWk2\nF . As a result, each row of \u00afwr\nof W can be updated by solving the following one-dimensional autoregressive problem.\nw\nx k \u00afwk2,\n\nxTAR( \u00afxr |L, \u00afw,\u2318 ) + wk \u00afwk2 \u2318 arg min\n\nwlxtl!2\n\narg min\n\u00afw\n\n+\n\n\u00afw\n\nTXt=m xt Xl2L\n\nwhich is a simple |L| dimensional ridge regression problem with T m + 1 instances, which can be\nsolved ef\ufb01ciently by Cholesky factorization in O(|L|3 + T|L|2) time\nNote that since our method is highly modular, one can resort to any method to solve the optimization\nsubproblems that arise for each module. Moreover, as mentioned in Section 3, TRMF can also be\nused with different regularization structures, making it highly adaptable.\n4.1 Connections to Existing MF Approaches\nTRMF-AR is a generalization of many existing MF approaches to handle data with temporal depen-\ndencies. Speci\ufb01cally, Temporal Collaborative Filtering [23] corresponds to W (1) = Ik on {xt}. The\nNMF method of [2] is an AR(L) model with W (l) = \u21b5l1(1 \u21b5)Ik, 8l, where \u21b5 is pre-de\ufb01ned.\nThe AR(1) model of [16, 26] has W (1) = In on {F xt}. Finally the DLM [7] is a latent AR(1)\nmodel with a general W (1), which can be estimated by EM algorithms.\n4.2 Connections to Learning Gaussian Markov Random Fields\nThe Gaussian Markov Random Field (GMRF) is a general way to model multivariate data with\ndependencies. GMRF assumes that data are generated from a multivariate Gaussian distribution\nwith a covariance matrix \u2303 which describes the dependencies among T dimensional variables i.e.,\n\u00afx \u21e0N (0, \u2303). If the unknown \u00afx is assumed to be generated from this model, The negative log\nlikelihood of the data can be written as \u00afx>\u23031 \u00afx, ignoring the constants and where \u23031 is the inverse\ncovariance matrix of the Gaussian distribution. This prior can be incorporated into an empirical risk\nminimization framework as a regularizer. Furthermore, it is known that if\u23031st = 0, xt and xs\n\nare conditionally independent, given the other variables. In Theorem 1 we established connections\n\n6\n\n\fTable 2: Forecasting results: ND/ NRMSE for each approach. Lower values are better. \u201c-\u201d indicates\nan unavailability due to scalability or an inability to handle missing values.\n\nForecasting with Full Observation\n\nMatrix Factorization Models\n\nTRMF-AR\n0.373/ 0.487\n0.255/ 1.397\n0.187/ 0.423\n\nSVD-AR(1)\n0.444/ 0.872\n0.257/ 1.865\n0.555/ 1.194\n\nTCF\n\n1.000/ 1.424\n0.349/ 1.838\n0.624/ 0.931\n\nAR(1)\n\n0.928/ 1.401\n0.219/ 1.439\n0.275/ 0.536\n\nTime Series Models\nDLM\n\nR-DLM\n\n0.936/ 1.391\n0.435/ 2.753\n0.639/ 0.951\n\n0.996/ 1.420\n\n-/ -\n-/ -\n\nMean\n\n1.000/ 1.424\n1.410/ 4.528\n0.560/ 0.826\n\nsynthetic\nelectricity\ntraf\ufb01c\n\nForecasting with Missing Values\n\n-/ -\n-/ -\n\n-/ -\n-/ -\n\n-/ -\n-/ -\n\n0.540/2.231\n0.446/1.124\n\n0.602/ 2.293\n0.453/ 1.110\n\n0.533/ 1.958\n0.432/ 1.065\n\nwalmart-1\n1.239/3.103\nwalmart-2\n1.097/2.088\nto graph based regularizers, and that such methods can be seen as regularizing with the inverse\ncovariance matrix for Gaussians [27]. We thus have the following result:\nCorollary 1. For any lag set L, \u00afw, and \u2318> 0, the inverse covariance matrix \u23031\nAR of the GMRF model\ncorresponding to the quadratic regularizer Rx( \u00afx) := TAR( \u00afx|L, \u00afw,\u2318 ) shares the same off-diagonal\nnon-zero pattern as GAR de\ufb01ned in Theorem 1. Moreover, we have TAR( \u00afx|L, \u00afw,\u2318 ) = \u00afx>\u23031\nAR \u00afx.\nA detailed proof is in Appendix C.2. As a result, our proposed AR-based regularizer is equivalent to\nimposing a Gaussian prior on \u00afx with a structured inverse covariance described by the matrix GAR\nde\ufb01ned in Theorem 1. Moreover, the step to learn W has a natural interpretation: the lag set L\nimposes the non-zero pattern of the graphical model on the data, and then we solve a simple least\nsquares problem to learn the weights corresponding to the edges. As an application of Theorem 1\nfrom [15] and Corollary 1, when Rf (F ) = kFk2\nF,X:Z=F XkFk2\n\nF ,we can relate TAR to a weighted nuclear norm:\nF +Xr TAR( \u00afxr |L, \u00afw,\u2318 ),\n(13)\n\nAR = U SU> is the eigen-decomposition of \u23031\n\nwhere B = U S1/2 and \u23031\nAR. (13) enables us to apply\nthe results from [15] to obtain guarantees for the use of AR temporal regularizer when W is given. For\nsimplicity, we assume \u00afwr = \u00afw, 8r and consider a relaxed convex formulation for (12) as follows:\n(14)\n\nkZBk\u21e4 =\n\n1\n2\n\ninf\n\n(Yij Zij)2 + zkZBk\u21e4,\n\n\u02c6Z = arg min\nZ2C\n\nN\n\nN\n\n+ O(\u21b52/N ),\n\nwhere N = |\u2326|, and C is a set of matrices with low spikiness. Full details are provided in Ap-\npendix C.3. As an application of Theorem 2 from [15], we have the following corollary.\nCorollary 2. Let Z? = F X be the ground truth n \u21e5 T time series matrix of rank k. Let Y be\nthe matrix with N = |\u2326| randomly observed entries corrupted with additive Gaussian noise with\nvariance 2. Then if z C1q (n+T ) log(n+T )\n, with high probability for the \u02c6Z obtained by (14),\nZ? \u02c6ZF \uf8ff C2\u21b52 max(1, 2)\nk(n + T ) log(n + T )\n\nwhere C1,C2 are positive constants, and \u21b5 depends on the product Z?B.\nSee Appendix C.3 for details. From the results in Table 3, we observe superior performance of\nTRMF-AR over standard MF, indicating that \u00afw learnt from our data-driven approach (12) does aid\nin recovering the missing entries for time series. We would like to point out that establishing a\ntheoretical guarantee for TRMF with W is unknown remains a challenging research direction.\n5 Experimental Results\nWe consider \ufb01ve datasets (Table 1). For synthetic, we \ufb01rst\nrandomly generate F 2 R16\u21e54 and generate {xt} follow-\ning an AR process with L = {1, 8}. Then Y is obtained\nby yt = F xt + \u270ft where \u270ft \u21e0N (0, 0.1I). The data sets\nelectricity and traf\ufb01c are obtained from the UCI reposi-\ntory, while walmart-1 and walmart-2 are two propriety\ndatasets from Walmart E-commerce containing weekly\nsale information. Due to reasons such as out-of-stock,\n55.3% and 49.3% of entries are missing respectively. To\nevaluate the prediction performance, we consider the nor-\nmalized deviation (ND) and normalized RMSE (NRMSE).\nSee details for the description for each dataset and the\nformal de\ufb01nition for each criterion in Appendix A.\n\nFigure 4: Scalability: T = 512. n 2\n{500, 1000, . . . , 50000}. AR({1, . . . , 8})\ncannot \ufb01nish in 1 day.\n\n1\n\nN X(i,j)2\u2326\n\n7\n\n\f|\u2326|n\u21e5T\nTRMF-AR\n20% 0.467/ 0.661\n30% 0.336/ 0.455\n40% 0.231/ 0.306\n50% 0.201/ 0.270\n20% 0.245/ 2.395\n30% 0.235/ 2.415\n40% 0.231/ 2.429\n50% 0.223/ 2.434\n20% 0.190/ 0.427\n30% 0.186/ 0.419\n40% 0.185/ 0.416\n50% 0.184/ 0.415\n\nTCF\n\n0.713/ 1.030\n0.629/ 0.961\n0.495/ 0.771\n0.289/ 0.464\n0.255/ 2.427\n0.245/ 2.436\n0.242/ 2.457\n0.233/ 2.459\n0.208/ 0.448\n0.199/ 0.432\n0.198/ 0.428\n0.193/ 0.422\n\nMF\n\n0.688/ 1.064\n0.595/ 0.926\n0.374/ 0.548\n0.317/ 0.477\n0.362/ 2.903\n0.355/ 2.766\n0.348/ 2.697\n0.319/ 2.623\n0.310/ 0.604\n0.299/ 0.581\n0.292/ 0.568\n0.251/ 0.510\n\nTime Series Models\nDLM\nMean\n\n0.933/ 1.382\n0.913/ 1.324\n0.834/ 1.259\n0.772/ 1.186\n0.462/ 4.777\n0.410/ 6.605\n0.196/ 2.151\n0.158/ 1.590\n0.353/ 0.603\n0.286/ 0.518\n0.251/ 0.476\n0.224/ 0.447\n\n1.002/ 1.474\n1.004/ 1.445\n1.002/ 1.479\n1.001/ 1.498\n1.333/ 6.031\n1.320/ 6.050\n1.322/ 6.030\n1.320/ 6.109\n0.578/ 0.857\n0.578/ 0.856\n0.578/ 0.857\n0.578/ 0.857\n\nsynthetic\n\nelectricity\n\ntraf\ufb01c\n\nTable 3: Missing value imputation results: ND/ NRMSE for each approach. Note that TRMF\noutperforms all competing methods in almost all cases.\n\nMatrix Factorization Models\n\nMethods/Implementations Compared:\n\u2022 TRMF-AR: The proposed formulation (12) with Rw(W) = kWk2\nF . For L, we use {1, 2, . . . , 8}\nfor synthetic, {1, . . . , 24}[{7 \u21e5 24, . . . , 8 \u21e5 24 1} for electricity and traf\ufb01c, and {1, . . . , 10}[\n{50, . . . , 56} for walmart-1 and walmart-2 to capture seasonality.\n\u2022 SVD-AR(1): The rank-k approximation of Y = U SV > is \ufb01rst obtained by SVD. After setting\nF = U S and X = V >, a k-dimensional AR(1) is learned on X for forecasting.\n\u2022 TCF: Matrix factorization with the simple temporal regularizer proposed in [23].\n\u2022 AR(1): n-dimensional AR(1) model.2\n\u2022 DLM: two implementations: the widely used R-DLM package [12] and the code provided in [8].\n\u2022 Mean: The baseline, which predicts everything to be the mean of the observed portion of Y .\nFor each method and data set, we perform a grid search over various parameters (such as k, values)\nfollowing a rolling validation approach described in [11].\nScalability: Figure 4 shows that traditional time-series approaches such as AR or DLM suffer\nfrom the scalability issue for large n, while TRMF-AR scales much better with n. Speci\ufb01cally, for\nn = 50, 000, TRMF is 2 orders of magnitude faster than competing AR/DLM methods. Note that\nthe results for R-DLM are not available because the R package cannot scale beyond n in the tens\n(See Appendix D for more details.). Furthermore, the dlmMLE routine in R-DLM uses a general\noptimization solver, which is orders of magnitude slower than the implementation provided in [8].\n5.1 Forecasting\nForecasting with Full Observations. We \ufb01rst compare various methods on the task of forecasting\nvalues in the test set, given fully observed training data. For synthetic, we consider one-point ahead\nforecasting task and use the last ten time points as the test periods. For electricity and traf\ufb01c, we\nconsider the 24-hour ahead forecasting task and use last seven days as the test periods. From Table 2,\nwe can see that TRMF-AR outperforms all the other methods on both metrics considered.\nForecasting with Missing Values. We next compare the methods on the task of forecasting in the\npresence of missing values in the data. We use the Walmart datasets here, and consider 6-week ahead\nforecasting and use last 54 weeks as the test periods. Note that SVD-AR(1) and AR(1) cannot handle\nmissing values. The second part of Table 2 shows that we again outperform other methods.\n5.2 Missing Value Imputation\nWe next consider the case of imputing missing values in the data. As in [9], we assume that blocks of\ndata are missing, corresponding to sensor malfunctions for example, over a length of time. To create\ndata with missing entries, we \ufb01rst \ufb01xed the percentage of data that we were interested in observing,\nand then uniformly at random occluded blocks of a predetermined length (2 for synthetic data and\n5 for the real datasets). The goal was to predict the occluded values. Table 3 shows that TRMF\noutperforms the methods we compared to on almost all cases.\n6 Conclusions\nWe propose a novel temporal regularized matrix factorization framework (TRMF) for high-\ndimensional time series problems with missing values. TRMF not only models temporal dependency\namong the data points, but also supports data-driven dependency learning. TRMF generalizes sev-\neral well-known methods and yields superior performance when compared to other state-of-the-art\nmethods on real-world datasets.\nAcknowledgements: This research was supported by NSF grants (CCF-1320746, IIS-1546459, and CCF-\n1564000) and gifts from Walmart Labs and Adobe. We thank Abhay Jha for the help on Walmart experiments.\n\n2In Appendix A, we also show a baseline which applies an independent AR model to each dimension.\n\n8\n\n\fReferences\n[1] O. Anava, E. Hazan, and A. Zeevi. Online time series prediction with missing data. In Proceedings of the\n\nInternational Conference on Machine Learning, pages 2191\u20132199, 2015.\n\n[2] Z. Chen and A. Cichocki. Nonnegative matrix factorization with temporal smoothness and/or spatial\ndecorrelation constraints. Laboratory for Advanced Brain Signal Processing, RIKEN, Tech. Rep, 68, 2005.\n[3] Z. Ghahramani and G. E. Hinton. Parameter estimation for linear dynamical systems. Technical report,\n\nTechnical Report CRG-TR-96-2, University of Totronto, Dept. of Computer Science, 1996.\n\n[4] R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics: A Foundation for Computer Science.\n\nAddison-Wesley Longman Publishing Co., Inc., 2nd edition, 1994.\n\n[5] F. Han and H. Liu. Transition matrix estimation in high dimensional time series. In Proceedings of the\n\nInternational Conference on Machine Learning, pages 172\u2013180, 2013.\n\n[6] P. Jain and I. S. Dhillon. Provable inductive matrix completion. arXiv preprint arXiv:1306.0626, 2013.\n[7] R. E. Kalman. A new approach to linear \ufb01ltering and prediction problems. Journal of Fluids Engineering,\n\n82(1):35\u201345, 1960.\n\n[8] L. Li and B. A. Prakash. Time series clustering: Complex is simpler! In Proceedings of the International\n\nConference on Machine Learning, pages 185\u2013192, 2011.\n\n[9] L. Li, J. McCann, N. S. Pollard, and C. Faloutsos. DynaMMo: Mining and summarization of coevolving\nsequences with missing values. In ACM SIGKDD International Conference on Knowledge discovery and\ndata mining, pages 507\u2013516. ACM, 2009.\n\n[10] I. Melnyk and A. Banerjee. Estimating structured vector autoregressive model. In Proceedings of the\n\nThirty Third International Conference on Machine Learning (ICML), 2016.\n\n[11] W. B. Nicholson, D. S. Matteson, and J. Bien. Structured regularization for large vector autoregressions.\n\nTechnical report, Technical Report, University of Cornell, 2014.\n\n[12] G. Petris. An r package for dynamic linear models. Journal of Statistical Software, 36(12):1\u201316, 2010.\n[13] G. Petris, S. Petrone, and P. Campagnoli. Dynamic Linear Models with R. Use R! Springer, 2009.\n[14] S. Rallapalli, L. Qiu, Y. Zhang, and Y.-C. Chen. Exploiting temporal stability and low-rank structure\nfor localization in mobile networks. In International Conference on Mobile Computing and Networking,\nMobiCom \u201910, pages 161\u2013172. ACM, 2010.\n\n[15] N. Rao, H.-F. Yu, P. K. Ravikumar, and I. S. Dhillon. Collaborative \ufb01ltering with graph information:\n\nConsistency and scalable methods. In Advances in Neural Information Processing Systems 27, 2015.\n\n[16] M. Roughan, Y. Zhang, W. Willinger, and L. Qiu. Spatio-temporal compressive sensing and internet traf\ufb01c\n\nmatrices (extended version). IEEE/ACM Transactions on Networking, 20(3):662\u2013676, June 2012.\n\n[17] R. H. Shumway and D. S. Stoffer. An approach to time series smoothing and forecasting using the EM\n\nalgorithm. Journal of time series analysis, 3(4):253\u2013264, 1982.\n\n[18] A. J. Smola and R. Kondor. Kernels and regularization on graphs. In Learning theory and kernel machines,\n\npages 144\u2013158. Springer, 2003.\n\n[19] J. Z. Sun, K. R. Varshney, and K. Subbian. Dynamic matrix factorization: A state space approach. In\nProceedings of International Conference on Acoustics, Speech and Signal Processing, pages 1897\u20131900.\nIEEE, 2012.\n\n[20] H. Wang, G. Li, and C.-L. Tsai. Regression coef\ufb01cient and autoregressive order shrinkage and selection via\nthe lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(1):63\u201378, 2007.\n[21] M. West and J. Harrison. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics. Springer,\n\n2013.\n\n[22] K. Wilson, B. Raj, and P. Smaragdis. Regularized non-negative matrix factorization with temporal\n\ndependencies for speech denoising. In Interspeech, pages 411\u2013414, 2008.\n\n[23] L. Xiong, X. Chen, T.-K. Huang, J. G. Schneider, and J. G. Carbonell. Temporal collaborative \ufb01ltering\nwith Bayesian probabilistic tensor factorization. In SIAM International Conference on Data Mining, pages\n223\u2013234, 2010.\n\n[24] M. Xu, R. Jin, and Z.-H. Zhou. Speedup matrix completion with side information: Application to multi-\nlabel learning. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances\nin Neural Information Processing Systems 26, pages 2301\u20132309, 2013.\n\n[25] H.-F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale multi-label learning with missing labels. In\n\nProceedings of the International Conference on Machine Learning, pages 593\u2013601, 2014.\n\n[26] Y. Zhang, M. Roughan, W. Willinger, and L. Qiu. Spatio-temporal compressive sensing and internet traf\ufb01c\n\nmatrices. SIGCOMM Comput. Commun. Rev., 39(4):267\u2013278, Aug. 2009. ISSN 0146-4833.\n\n[27] T. Zhou, H. Shan, A. Banerjee, and G. Sapiro. Kernelized probabilistic matrix factorization: Exploiting\n\ngraphs and side information. In SDM, volume 12, pages 403\u2013414. SIAM, 2012.\n\n9\n\n\f", "award": [], "sourceid": 526, "authors": [{"given_name": "Hsiang-Fu", "family_name": "Yu", "institution": "University of Texas at Austin"}, {"given_name": "Nikhil", "family_name": "Rao", "institution": "Technicolor"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}