{"title": "Fast Multivariate Spatio-temporal Analysis via Low Rank Tensor Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3491, "page_last": 3499, "abstract": "Accurate and efficient analysis of multivariate spatio-temporal data is critical in climatology, geology, and sociology applications. Existing models usually assume simple inter-dependence among variables, space, and time, and are computationally expensive. We propose a unified low rank tensor learning framework for multivariate spatio-temporal analysis, which can conveniently incorporate different properties in spatio-temporal data, such as spatial clustering and shared structure among variables. We demonstrate how the general framework can be applied to cokriging and forecasting tasks, and develop an efficient greedy algorithm to solve the resulting optimization problem with convergence guarantee. We conduct experiments on both synthetic datasets and real application datasets to demonstrate that our method is not only significantly faster than existing methods but also achieves lower estimation error.", "full_text": "Fast Multivariate Spatio-temporal Analysis\n\nvia Low Rank Tensor Learning\n\nMohammad Taha Bahadori\u21e4\nDept. of Electrical Engineering\nUniv. of Southern California\n\nLos Angeles, CA 90089\nmohammab@usc.edu\n\nRose Yu\u21e4\n\nDept. of Computer Science\nUniv. of Southern California\n\nLos Angeles, CA 90089\n\nqiyu@usc.edu\n\nYan Liu\n\nDept. of Computer Science\nUniv. of Southern California\n\nLos Angeles, CA 90089\nyanliu.cs@usc.edu\n\nAbstract\n\nAccurate and ef\ufb01cient analysis of multivariate spatio-temporal data is critical in\nclimatology, geology, and sociology applications. Existing models usually assume\nsimple inter-dependence among variables, space, and time, and are computation-\nally expensive. We propose a uni\ufb01ed low rank tensor learning framework for mul-\ntivariate spatio-temporal analysis, which can conveniently incorporate different\nproperties in spatio-temporal data, such as spatial clustering and shared structure\namong variables. We demonstrate how the general framework can be applied to\ncokriging and forecasting tasks, and develop an ef\ufb01cient greedy algorithm to solve\nthe resulting optimization problem with convergence guarantee. We conduct ex-\nperiments on both synthetic datasets and real application datasets to demonstrate\nthat our method is not only signi\ufb01cantly faster than existing methods but also\nachieves lower estimation error.\n\n1\n\nIntroduction\n\nSpatio-temporal data provide unique information regarding \u201cwhere\u201d and \u201cwhen\u201d, which is essential\nto answer many important questions in scienti\ufb01c studies from geology, climatology to sociology. In\nthe context of big data, we are confronted with a series of new challenges when analyzing spatio-\ntemporal data because of the complex spatial and temporal dependencies involved.\nA plethora of excellent work has been conducted to address the challenge and achieved successes to\na certain extent [9, 14]. Often times, geostatistical models use cross variogram and cross covariance\nfunctions to describe the intrinsic dependency structure. However, the parametric form of cross\nvariogram and cross covariance functions impose strong assumptions on the spatial and temporal\ncorrelation, which requires domain knowledge and manual work. Furthermore, parameter learning\nof those statistical models is computationally expensive, making them infeasible for large-scale\napplications.\nCokriging and forecasting are two central tasks in multivariate spatio-temporal analysis. Cokriging\nutilizes the spatial correlations to predict the value of the variables for new locations. One widely\nadopted method is multitask Gaussian process (MTGP) [5], which assumes a Gaussian process prior\nover latent functions to directly induce correlations between tasks. However, for a cokriging task\nwith M variables of P locations for T time stamps, the time complexity of MTGP is O(M 3P 3T )\n[5]. For forecasting, popular methods in multivariate time series analysis include vector autoregres-\nsive (VAR) models, autoregressive integrated moving average (ARIMA) models, and cointegration\nmodels. An alternative method for spatio-temporal analysis is Bayesian hierarchical spatio-temporal\nmodels with either separable and non-separable space-time covariance functions [7]. Rank reduced\n\n\u21e4Authors have equal contributions.\n\n1\n\n\fmodels have been proposed to capture the inter-dependency among variables [1]. However, very\nfew models can directly handle the correlations among variables, space and time simultaneously in\na scalable way. In this paper, we aim to address this problem by presenting a uni\ufb01ed framework for\nmany spatio-temporal analysis tasks that are scalable for large-scale applications.\nTensor representation provides a convenient way to capture inter-dependencies along multiple di-\nmensions. Therefore it is natural to represent the multivariate spatio-temporal data in tensor. Recent\nadvances in low rank learning have led to simple models that can capture the commonalities among\neach mode of the tensor [16, 21]. Similar argument can be found in the literature of spatial data re-\ncovery [12], neuroimaging analysis [27], and multi-task learning [21]. Our work builds upon recent\nadvances in low rank tensor learning [16, 12, 27] and further considers the scenario where additional\nside information of data is available. For example, in geo-spatial applications, apart from measure-\nments of multiple variables, geographical information is available to infer location adjacency; in\nsocial network applications, friendship network structure is collected to obtain preference similarity.\nTo utilize the side information, we can construct a Laplacian regularizer from the similarity matrices,\nwhich favors locally smooth solutions.\nWe develop a fast greedy algorithm for learning low rank tensors based on the greedy structure\nlearning framework [3, 25, 22]. Greedy low rank tensor learning is ef\ufb01cient, as it does not require\nfull singular value decomposition of large matrices as opposed to other alternating direction methods\n[12]. We also provide a bound on the difference between the loss function at our greedy solution\nand the one at the globally optimal solution. Finally, we present experiment results on simulation\ndatasets as well as application datasets in climate and social network analysis, which show that our\nalgorithm is faster and achieves higher prediction accuracy than state-of-art approaches in cokriging\nand forecasting tasks.\n\n2 Tensor formulation for multivariate spatio-temporal analysis\nThe critical element in multivariate spatio-temporal analysis is an ef\ufb01cient way to incorporate the\nspatial temporal correlations into modeling and automatically capture the shared structures across\nvariables, locations, and time. In this section, we present a uni\ufb01ed low rank tensor learning frame-\nwork that can perform various types of spatio-temporal analysis. We will use two important appli-\ncations, i.e., cokriging and forecasting, to motivate and describe the framework.\n\n2.1 Cokriging\n\nIn geostatistics, cokriging is the task of interpolating the data of one variable for unknown locations\nby taking advantage of the observations of variables from known locations. For example, by making\nuse of the correlations between precipitation and temperature, we can obtain more precise estimate\nof temperature in unknown locations than univariate kriging. Formally, denote the complete data\nfor P locations over T time stamps with M variables as X2 RP\u21e5T\u21e5M. We only observe the\nmeasurements for a subset of locations \u2326 \u21e2{ 1, . . . , P} and their side information such as longitude\nand latitude. Given the measurements X\u2326 and the side information, the goal is to estimate a tensor\nW2 RP\u21e5T\u21e5M that satis\ufb01es W\u2326 = X\u2326. Here X\u2326 represents the outcome of applying the index\noperator I\u2326 to X:,:,m for all variables m = 1, . . . , M. The index operator I\u2326 is a diagonal matrix\nwhose entries are one for the locations included in \u2326 and zero otherwise.\nTwo key consistency principles have been identi\ufb01ed for effective cokriging [10, Chapter 6.2]: (1)\nGlobal consistency: the data on the same structure are likely to be similar. (2) Local consistency: the\ndata in close locations are likely to be similar. The former principle is akin to the cluster assumption\nin semi-supervised learning [26]. We incorporate these principles in a concise and computationally\nef\ufb01cient low-rank tensor learning framework.\nTo achieve global consistency, we constrain the tensor W to be low rank. The low rank assumption\nis based on the belief that high correlations exist within variables, locations and time, which leads to\nnatural clustering of the data. Existing literature have explored the low rank structure among these\nthree dimensions separately, e.g., multi-task learning [20] for variable correlation, \ufb01xed rank kriging\n[8] for spatial correlations. Low rankness assumes that the observed data can be described with a\nfew latent factors. It enforces the commonalities along three dimensions without an explicit form\nfor the shared structures in each dimension.\n\n2\n\n\fFor local consistency, we construct a regularizer via the spatial Laplacian matrix. The Laplacian\nmatrix is de\ufb01ned as L = D  A, where A is a kernel matrix constructed by pairwise similarity\nand diagonal matrix Di,i = Pj(Ai,j). Similar ideas have been explored in matrix completion\n[17]. In cokriging literature, the local consistency is enforced via the spatial covariance matrix. The\nBayesian models often impose the Gaussian process prior on the observations with the covariance\nmatrix K = Kv \u2326 Kx where Kv is the covariance between variables and Kx is that for locations.\nThe Laplacian regularization term corresponds to the relational Gaussian process [6] where the\ncovariance matrix is approximated by the spatial Laplacian.\nIn summary, we can perform cokriging and \ufb01nd the value of tensor W by solving the following\noptimization problem:\ncW = argmin\nwhere the Frobenius norm of a tensor A is de\ufb01ned as kAkF = qPi,j,k A2\n\ni,j,k and \u00b5, \u21e2 > 0\nare the parameters that make tradeoff between the local and global consistency, respectively. The\nlow rank constraint \ufb01nds the principal components of the tensor and reduces the complexity of\nthe model while the Laplacian regularizer clusters the data using the relational information among\nthe locations. By learning the right tradeoff between these two techniques, our method is able to\nbene\ufb01t from both. Due to the various de\ufb01nitions of tensor rank, we use rank as supposition for rank\ncomplexity, which will be speci\ufb01ed in later section.\n\ntr(W>:,:,mLW:,:,m)) s.t.\n\nW (kW\u2326 X \u2326k2\n\nrank(W) \uf8ff \u21e2,\n\nMXm=1\n\nF + \u00b5\n\n(1)\n\n2.2 Forecasting\n\nfor m = 1, . . . , M and t = K + 1, . . . , T,\n\nX:,t,m = W:,:,mXt,m + E:,t,m,\n\nForecasting estimates the future value of multivariate time series given historical observations.\nFor ease of presentation, we use the classical VAR model with K lags and coef\ufb01cient tensor\nW2 RP\u21e5KP\u21e5M as an example. Using the matrix representation, the VAR(K) process de\ufb01nes\nthe following data generation process:\n(2)\nwhere Xt,m = [X >:,t1,m, . . . ,X >:,tK,m]> denotes the concatenation of K-lag historical data before\ntime t. The noise tensor E is a multivariate Gaussian with zero mean and unit variance .\nExisting multivariate regression methods designed to capture the complex correlations, such as\nTucker decomposition [21], are computationally expensive. A scalable solution requires a simpler\nmodel that also ef\ufb01ciently accounts for the shared structures in variables, space, and time. Similar\nglobal and local consistency principles still hold in forecasting. For global consistency, we can use\nlow rank constraint to capture the commonalities of the variables as well as the spatial correlations\non the model parameter tensor, as in [9]. For local consistency, we enforce the predicted value\nfor close locations to be similar via spatial Laplacian regularization. Thus, we can formulate the\nforecasting task as the following optimization problem over the model coef\ufb01cient tensor W:\ncW = argmin\n\ntr(bX >:,:,mLbX:,:,m)) s.t. rank(W) \uf8ff \u21e2, bX:,t,m = W:,:,mXt,m\n\n(3)\nThough cokriging and forecasting are two different tasks, we can easily see that both formulations\nfollow the global and local consistency principles and can capture the inter-correlations from spatial-\ntemporal data.\n\nW (kbXXk\n\nMXm=1\n\n2\nF + \u00b5\n\n2.3 Uni\ufb01ed Framework\n\nWe now show that both cokriging and forecasting can be formulated into the same tensor learning\nframework. Let us rewrite the loss function in Eq. (1) and Eq. (3) in the form of multitask regression\nand complete the quadratic form for the loss function. The cokriging task can be reformulated as\nfollows:\n\nkW:,:,mH  (H>)1X\u2326,mk2\n\nrank(W) \uf8ff \u21e2\n\n(4)\n\nF) s.t.\n\nW ( MXm=1\n\ncW = argmin\n\n3\n\n\fwhere we de\ufb01ne HH> = I\u2326 + \u00b5L.1 For the forecasting problem, HH> = IP + \u00b5L and we have:\n\nW ( MXm=1\n\nTXt=K+1\n\nkHW:,:,mXt,m  (H1)X:,t,mk2\n\ncW = argmin\nBy slight change of notation (cf. Appendix D), we can easily see that the optimal solution of both\nproblems can be obtained by the following optimization problem with appropriate choice of tensors\nY and V:\n\nrank(W) \uf8ff \u21e2,\n\n(5)\n\nF) s.t.\n\nW ( MXm=1\n\ncW = argmin\n\nkW:,:,mY:,:,m V :,:,mk2\n\nrank(W) \uf8ff \u21e2.\n\n(6)\n\nF) s.t.\n\nAfter unifying the objective function, we note that tensor rank has different notions such as CP\nrank, Tucker rank and mode n-rank [16, 12]. In this paper, we choose the mode-n rank, which is\ncomputationally more tractable [12, 24]. The mode-n rank of a tensor W is the rank of its mode-n\nunfolding W(n).2 In particular, for a tensor W with N mode, we have the following de\ufb01nition:\n\nmode-n rank(W) =\n\nrank(W(n)).\n\n(7)\n\nA common practice to solve this formulation with mode n-rank constraint is to relax the rank con-\nstraint to a convex nuclear norm constraint [12, 24]. However, those methods are computationally\nexpensive since they need full singular value decomposition of large matrices. In the next section,\nwe present a fast greedy algorithm to tackle the problem.\n\nNXn=1\n\n3 Fast greedy low rank tensor learning\n\nTo solve the non-convex problem in Eq. (6) and \ufb01nd its optimal solution, we propose a greedy\nlearning algorithm by successively adding rank-1 estimation of the mode-n unfolding. The main\nidea of the algorithm is to unfold the tensor into a matrix, seek for its rank-1 approximation and\nthen fold back into a tensor with same dimensionality. We describe this algorithm in three steps:\n(i) First, we show that we can learn rank-1 matrix estimations ef\ufb01ciently by solving a generalized\neigenvalue problem, (ii) We use the rank-1 matrix estimation to greedily solve the original tensor\nrank constrained problem, and (iii) We propose an enhancement via orthogonal projections after\neach greedy step.\n\nOptimal rank-1 Matrix Learning The following lemma enables us to \ufb01nd such optimal rank-1\nestimation of the matrices.\nLemma 1. Consider the following rank constrained problem:\n\nA:rank(A)=1nkY  AXk2\nFo ,\n\nbA1 = argmin\n\n(8)\n\nproblem:\n\nwhere Y 2 Rq\u21e5n, X 2 Rp\u21e5n, and A 2 Rq\u21e5p. The optimal solution of bA1 can be written as\nbA1 =bubv>, kbvk2 = 1 wherebv is the dominant eigenvector of the following generalized eigenvalue\nandbu can be computed as\n\n(XY >Y X>)v = (XX>)v\n\n(10)\n\n(9)\n\n1\n\nThe lemma can be found in e.g.\n(9) is a\ngeneralized eigenvalue problem whose dominant eigenvector can be found ef\ufb01ciently [13]. If XX>\nis full rank, as assumed in Theorem 2, the problem is simpli\ufb01ed to a regular eigenvalue problem\nwhose dominant eigenvector can be ef\ufb01ciently computed.\n\n[2] and we also provide a proof in Appendix A. Eq.\n\nbu =\n\nbv>XX>bv\n\nY X>bv.\n\n1We can use Cholesky decomposition to obtain H. In the rare cases that I\u2326 + \u00b5L is not full rank, \u270fIP is\n\nadded where \u270f is a very small positive value.\n\n2The mode-n unfolding of a tensor is the matrix resulting from treating n as the \ufb01rst mode of the matrix,\n\nand cyclically concatenating other modes. Tensor refolding is the reverse direction operation [16].\n\n4\n\n\fAlgorithm 1 Greedy Low-rank Tensor Learning\n1: Input: transformed data Y,V of M variables, stopping criteria \u2318\n2: Output: N mode tensor W\n3: Initialize W 0\n4: repeat\n5:\n6:\n\nfor n = 1 to N do\nBn argmin\nn L (W;Y,V) L (refold(W(n) + Bn);Y,V)\n\nB: rank(B)=1L(refold(W(n) + B);Y,V)\n\n{n}\n\nn\n\nend if\n\nend for\nn\u21e4 argmax\nif n\u21e4 >\u2318 then\n\n7:\n8:\n9:\n10:\n11:\n12:\n13: W argminrow(A(1))\u2713row(W(1))\n14: until n\u21e4 <\u2318\n\nW W + refold(Bn\u21e4, n\u21e4)\n\ncol(A(1))\u2713col(W(1)) L(A;Y,V)\n\n# Optional Orthogonal Projection Step.\n\nGreedy Low n-rank Tensor Learning The optimal rank-1 matrix learning serves as a basic ele-\nment in our greedy algorithm. Using Lemma 1, we can solve the problem in Eq. (6) in the Forward\nGreedy Selection framework as follows: at each iteration of the greedy algorithm, it searches for the\nmode that gives the largest decrease in the objective function. It does so by unfolding the tensor in\nthat mode and \ufb01nding the best rank-1 estimation of the unfolded tensor. After \ufb01nding the optimal\nmode, it adds the rank-1 estimate in that mode to the current estimation of the tensor. Algorithm\nF . Note\nthat we can \ufb01nd the optimal rank-1 solution in only one of the modes, but it is enough to guarantee\nthe convergence of our greedy algorithm.\nTheorem 2 bounds the difference between the loss function evaluated at each iteration of the greedy\nalgorithm and the one at the globally optimal solution.\nTheorem 2. Suppose in Eq. (6) the matrices Y>:,:,mY:,:,m for m = 1, . . . , M are positive de\ufb01nite.\nThe solution of Algo. 1 at its kth iteration step satis\ufb01es the following inequality:\n\n1 shows the details of this approach, where L(W;Y,V) =PM\n\nm=1 kW:,:,mY:,:,m V :,:,mk2\n\nL(Wk;Y,V) L (W\u21e4;Y,V) \uf8ff\n\n(kYk2kW\u21e4(1)k\u21e4)2\n\n(k + 1)\n\n,\n\n(11)\n\nwhere W\u21e4 is the global minimizer of the problem in Eq. (6) and kYk2 is the largest singular value\nof a block diagonal matrix created by placing the matrices Y(:, :, m) on its diagonal blocks.\nThe detailed proof is given in Appendix B. The key idea of the proof is that the amount of decrease\nin the loss function by each step in the selected mode is not smaller than the amount of decrease if we\nhad selected the \ufb01rst mode. The theorem shows that we can obtain the same rate of convergence for\nlearning low rank tensors as achieved in [23] for learning low rank matrices. The greedy algorithm\nin Algorithm 1 is also connected to mixture regularization in [24]: the mixture approach decomposes\nthe solution into a set of low rank structures while the greedy algorithm successively learns a set of\nrank one components.\n\nGreedy Algorithm with Orthogonal Projections\nIt is well-known that the forward greedy algo-\nrithm may make steps in sub-optimal directions because of noise. A common solution to alleviate the\neffect of noise is to make orthogonal projections after each greedy step [3, 22]. Thus, we enhance the\nforward greedy algorithm by projecting the solution into the space spanned by the singular vectors\nof its mode-1 unfolding. The greedy algorithm with orthogonal projections performs an extra step in\nline 13 of Algorithm 1: It \ufb01nds the top k singular vectors of the solution: [U, S, V ] svd(W(1), k)\nwhere k is the iteration number. Then it \ufb01nds the best solution in the space spanned by U and V by\nsolving bS minS L(U SV >,Y,V) which has a closed form solution. Finally, it reconstructs the\nsolution: W refold(UbSV >, 1). Note that the projection only needs to \ufb01nd top k singular vectors\n\nwhich can be computed ef\ufb01ciently for small values of k.\n\n5\n\n\fE\nS\nM\nR\n \nn\no\ni\nt\na\nm\n\ni\nt\ns\nE\n\n \nr\ne\nt\ne\nm\na\nr\na\nP\n\n1.2\n1.1\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n \n0\n\n \n\nForward\nOrthogonal\nADMM\nTrace\nMTL\u2212L1\nMTL\u2212L21\nMTL\u2212Dirty\n\n50\n\n100\n150\n# of Samples\n(a) RMSE\n\n200\n\n250\n\nl\n\ny\nt\ni\nx\ne\np\nm\no\nC\n \nk\nn\na\nR\n \ne\nr\nu\nt\nx\nM\n\ni\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n \n0\n\nForward\n\nOrthogonal\n\nADMM\n\nTrace\n\nForward Greedy\nOrthogonal Greedy\nADMM\n\n \n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n\n)\nc\ne\nS\n\ni\n\n(\n \ne\nm\nT\n \nn\nu\nR\n\n150\n\n200\n\n0\n \n101\n\n50\n\n100\n\n# of Samples\n(b) Rank\n\n# of Variables\n(c) Scalability\n\n \n\n102\n\nFigure 1: Tensor estimation performance comparison on the synthetic dataset over 10 random runs.\n(a) parameter Estimation RMSE with training time series length, (b) Mixture Rank Complexity with\ntraining time series length, (c) running time for one single round with respect to number of variables.\n4 Experiments\nWe evaluate the ef\ufb01cacy of our algorithms on synthetic datasets and real-world application datasets.\n\n4.1 Low rank tensor learning on synthetic data\n\nFor empirical evaluation, we compare our method with multitask learning (MTL) algorithms, which\nalso utilize the commonalities between different prediction tasks for better performance. We use the\nfollowing baselines: (1) Trace norm regularized MTL (Trace), which seeks the low rank structure\nonly on the task dimension; (2) Multilinear MTL [21], which adapts the convex relaxation of low\nrank tensor learning solved with Alternating Direction Methods of Multiplier (ADMM) [11] and\nTucker decomposition to describe the low rankness in multiple dimensions; (3) MTL-L1 , MTL-L21\n[20], and MTL-LDirty [15], which investigate joint sparsity of the tasks with Lp norm regularization.\nFor MTL-L1 , MTL-L21 [20] and MTL-LDirty, we use MALSAR Version 1.1 [28].\nWe construct a model coef\ufb01cient tensor W of size 20 \u21e5 20 \u21e5 10 with CP rank equals to 1.\nThen, we generate the observations Y and V according to multivariate regression model V:,:,m =\nW:,:,mY:,:,m +E:,:,m for m = 1, . . . , M, where E is tensor with zero mean Gaussian noise elements.\nWe split the synthesized data into training and testing time series and vary the length of the training\ntime series from 10 to 200. For each training length setting, we repeat the experiments for 10 times\nand select the model parameters via 5-fold cross validation. We measure the prediction performance\nvia two criteria: parameter estimation accuracy and rank complexity. For accuracy, we calculate the\nRMSE of the estimation versus the true model coef\ufb01cient tensor. For rank complexity, we calculate\nthe mixture rank complexity [24] as M RC = 1\nThe results are shown in Figure 1(a) and 1(b). We omit the Tucker decomposition as the results are\nnot comparable. We can clearly see that the proposed greedy algorithm with orthogonal projections\nachieves the most accurate tensor estimation. In terms of rank complexity, we make two observa-\ntions: (i) Given that the tensor CP rank is 1, greedy algorithm with orthogonal projections produces\nthe estimate with the lowest rank complexity. This can be attributed to the fact that the orthogonal\nprojections eliminate the redundant rank-1 components that fall in the same spanned space. (ii) The\nrank complexity of the forward greedy algorithm increases as we enlarge the sample size. We be-\nlieve that when there is a limited number of observations, most of the new rank-1 elements added\nto the estimate are not accurate and the cross-validation steps prevent them from being added to the\nmodel. However, as the sample size grows, the rank-1 estimates become more accurate and they are\npreserved during the cross-validation.\nTo showcase the scalability of our algorithm, we vary the number of variables and generate a series\nof tensor W2 R20\u21e520\u21e5M for M from 10 to 100 and record the running time (in seconds) for three\ntensor learning algorithms, i.e, forward greedy, greedy with orthogonal projections and ADMM. We\nmeasure the run time on a machine with a 6-core 12-thread Intel Xenon 2.67GHz processor and\n12GB memory. The results are shown in Figure 1(c). The running time of ADMM increase rapidly\nwith the data size while the greedy algorithm stays steady, which con\ufb01rms the speedup advantage\nof the greedy algorithm.\n\nnPN\nn=1 rank(W(n)).\n\n6\n\n\fTable 1: Cokriging RMSE of 6 methods averaged over 10 runs. In each run, 10% of the locations\nare assumed missing.\n\nDATASET\nUSHCN\nCCDS\nYELP\n\nFOURSQUARE\n\nADMM FORWARD ORTHOGONAL\n0.8051\n0.8292\n0.7730\n0.1373\n\n0.7594\n0.5555\n0.6993\n0.1338\n\n0.7210\n0.4532\n0.6958\n0.1334\n\nNA\nNA\n\nSIMPLE ORDINARY MTGP\n0.8760\n1.0007\n1.0296\n0.7634\n\n0.7803\n0.7312\n\nNA\nNA\n\nNA\nNA\n\n4.2 Spatio-temporal analysis on real world data\n\nWe conduct cokriging and forecasting experiments on four real-world datasets:\nUSHCN The U.S. Historical Climatology Network Monthly (USHCN)3 dataset consists of\nmonthly climatological data of 108 stations spanning from year 1915 to 2000.\nIt has three cli-\nmate variables: (1) daily maximum, (2) minimum temperature averaged over month, and (3) total\nmonthly precipitation.\nCCDS The Comprehensive Climate Dataset (CCDS)4 is a collection of climate records of North\nAmerica from [19]. The dataset was collected and pre-processed by \ufb01ve federal agencies. It contains\nmonthly observations of 17 variables such as Carbon dioxide and temperature spanning from 1990 to\n2001. The observations were interpolated on a 2.5\u21e52.5 degree grid, with 125 observation locations.\nYelp The Yelp dataset5 contains the user rating records for 22 categories of businesses on Yelp\nover ten years. The processed dataset includes the rating values (1-5) binned into 500 time intervals\nand the corresponding social graph for 137 active users. The dataset is used for the spatio-temporal\nrecommendation task to predict the missing user ratings across all business categories.\nFoursquare The Foursquare dataset [18] contains the users\u2019 check-in records in Pittsburgh area\nfrom Feb 24 to May 23, 2012, categorized by different venue types such as Art & Entertainment,\nCollege & University, and Food. The dataset records the number of check-ins by 121 users in each\nof the 15 category of venues over 1200 time intervals, as well as their friendship network.\n\n4.2.1 Cokriging\nWe compare the cokriging performance of our proposed method with the classical cokriging ap-\nproaches including simple kriging and ordinary cokriging with nonbias condition [14] which are\napplied to each variables separately. We further compare with multitask Gaussian process (MTGP)\n[5] which also considers the correlation among variables. We also adapt ADMM for solving the\nnuclear norm relaxed formulation of the cokriging formulation as a baseline (see Appendix C for\nmore details). For USHCN and CCDS, we construct a Laplacian matrix by calculating the pairwise\nHaversine distance of locations. For Foursquare and Yelp, we construct the graph Laplacian from\nthe user friendship network.\nFor each dataset, we \ufb01rst normalize it by removing the trend and diving by the standard deviation.\nThen we randomly pick 10% of locations (or users for Foursquare) and eliminate the measurements\nof all variables over the whole time span. Then, we produce the estimates for all variables of each\ntimestamp. We repeat the procedure for 10 times and report the average prediction RMSE for all\ntimestamps and 10 random sets of missing locations. We use the MATLAB Kriging Toolbox6 for\nthe classical cokriging algorithms and the MTGP code provided by [5].\nTable 1 shows the results for the cokriging task. The greedy algorithm with orthogonal projections is\nsigni\ufb01cantly more accurate in all three datasets. The baseline cokriging methods can only handle the\ntwo dimensional longitude and latitude information, thus are not applicable to the Foursquare and\nYelp dataset with additional friendship information. The superior performance of the greedy algo-\nrithm can be attributed to two of its properties: (1) It can obtain low rank models and achieve global\nconsistency; (2) It usually has lower estimation bias compared to nuclear norm relaxed methods.\n\n3http://www.ncdc.noaa.gov/oa/climate/research/ushcn\n4http://www-bcf.usc.edu/\u02dcliu32/data/NA-1990-2002-Monthly.csv\n5http://www.yelp.com/dataset_challenge\n6http://globec.whoi.edu/software/kriging/V3/english.html\n\n7\n\n\fTable 2: Forecasting RMSE for VAR process with 3 lags, trained with 90% of the time series.\nDATASET TUCKER ADMM FORWARD ORTHO ORTHONL TRACE MTLl1 MTLl21 MTLdirty\nUSHCN 0.8975\n0.9735\nCCDS\n0.9438\n1.0950\n0.1504\n0.1492\nFSQ\n\n0.9273 0.9528 0.9543\n0.8632 0.9105 0.9171\n0.1245 0.1495 0.1495\n\n0.9227\n0.8448\n0.1407\n\n0.9175\n0.8555\n0.1234\n\n0.9171\n0.8810\n0.1241\n\n0.9069\n0.8325\n0.1223\n\nTable 3: Running time (in seconds) for cokriging and forecasting.\nFORECASTING\n\nCOKRIGING\n\nDATASET USHCN\n93.03\nORTHO\nADMM\n791.25\n\nCCDS\n16.98\n320.77\n\nYELP\n78.47\n2928.37\n\nFSQ\n91.51\n720.40\n\nUSHCN CCDS\n21.38\n75.47\n235.73\n45.62\n\nFSQ\n37.70\n33.83\n\np=1PM\n\nas w(t) = PP\n\n4.2.2 Forecasting\nWe present the empirical evaluation on the forecasting task by comparing with multitask regression\nalgorithms. We split the data along the temporal dimension into 90% training set and 10% testing\nset. We choose VAR(3) model and during the training phase, we use 5-fold cross-validation.\nAs shown in Table 2, the greedy algorithm with orthogonal projections again achieves the best pre-\ndiction accuracy. Different from the cokriging task, forecasting does not necessarily need the cor-\nrelations of locations for prediction. One might raise the question as to whether the Laplacian reg-\nularizer helps. Therefore, we report the results for our formulation without Laplacian (ORTHONL)\nfor comparison. For ef\ufb01ciency, we report the running time (in seconds) in Table 3 for both tasks of\ncokriging and forecasting. Compared with ADMM, which is a competitive baseline also capturing\nthe commonalities among variables, space, and time, our greedy algorithm is much faster for most\ndatasets.\nAs a qualitative study, we plot the map of most pre-\ndictive regions analyzed by the greedy algorithm us-\ning CCDS dataset in Fig. 2. Based on the concept\nof how informative the past values of the climate\nmeasurements in a speci\ufb01c location are in predict-\ning future values of other time series, we de\ufb01ne the\naggregate strength of predictiveness of each region\nm=1 |Wp,t,m|. We can see that\ntwo regions are identi\ufb01ed as the most predictive re-\ngions: (1) The southwest region, which re\ufb02ects the\nimpact of the Paci\ufb01c ocean and (2) The southeast re-\ngion, which frequently experiences relative sea level\nrise, hurricanes, and storm surge in Gulf of Mexico.\nAnother interesting region lies in the center of Col-\norado, where the Rocky mountain valleys act as a\nfunnel for the winds from the west, providing locally\ndivergent wind patterns.\n5 Conclusion\nIn this paper, we study the problem of multivariate spatio-temporal data analysis with an emphasis\non two tasks: cokriging and forecasting. We formulate the problem into a general low rank tensor\nlearning framework which captures both the global consistency and the local consistency principle.\nWe develop a fast and accurate greedy solver with theoretical guarantees for its convergence. We\nvalidate the correctness and ef\ufb01ciency of our proposed method on both the synthetic dataset and real-\napplication datasets. For future work, we are interested in investigating different forms of shared\nstructure and extending the framework to capture non-linear correlations in the data.\nAcknowledgment\nWe thank the anonymous reviewers for their helpful feedback and comments. The research was\nsponsored by the NSF research grants IIS-1134990, IIS- 1254206 and Okawa Foundation Research\nAward. The views and conclusions are those of the authors and should not be interpreted as repre-\nsenting the of\ufb01cial policies of the funding agency, or the U.S. Government.\n\nFigure 2: Map of most predictive regions\nanalyzed by the greedy algorithm using 17\nvariables of the CCDS dataset. Red color\nmeans high predictiveness whereas blue de-\nnotes low predictiveness.\n\n8\n\n\fReferences\n[1] T. Anderson. Estimating linear restrictions on regression coef\ufb01cients for multivariate normal\n\ndistributions. The Annals of Mathematical Statistics, pages 327\u2013351, 1951.\n\n[2] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from\n\nexamples without local minima. Neural networks, 2(1):53\u201358, 1989.\n\n[3] A. Barron, A. Cohen, W. Dahmen, and R. DeVore. Approximation and learning by greedy\n\nalgorithms. The Annals of Statistics, 2008.\n\n[4] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods.\n\n[5] E. Bonilla, K. Chai, and C. Williams. Multi-task Gaussian Process Prediction. In NIPS, 2007.\n[6] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian\n\n[7] N. Cressie and H. Huang. Classes of nonseparable, spatio-temporal stationary covariance\n\nPrentice Hall Inc, 1989.\n\nprocesses. In NIPS, 2006.\n\nfunctions. JASA, 1999.\n\n[8] N. Cressie and G. Johannesson. Fixed rank kriging for very large spatial data sets. JRSS B\n\n(Statistical Methodology), 70(1):209\u2013226, 2008.\n\n[9] N. Cressie, T. Shi, and E. Kang. Fixed rank \ufb01ltering for spatio-temporal data. J. Comp. Graph.\n\nStat., 2010.\n\n[10] N. Cressie and C. Wikle. Statistics for spatio-temporal data. John Wiley & Sons, 2011.\n[11] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems\nvia \ufb01nite element approximation. Computers & Mathematics with Applications, 2(1):17\u201340,\n1976.\n\n[12] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via\n\nconvex optimization. Inverse Problems, 2011.\n\n[13] M. Gu, A. Ruhe, G. Sleijpen, H. van der Vorst, Z. Bai, and R. Li. 5. Generalized Hermitian\n\nEigenvalue Problems. Society for Industrial and Applied Mathematics, 2000.\n\n[14] E. Isaaks and R. Srivastava. Applied geostatistics. London: Oxford University, 2011.\n[15] A. Jalali, S. Sanghavi, C. Ruan, and P. Ravikumar. A dirty model for multi-task learning. In\n\nNIPS, 2010.\n\nIn UbiComp, 2012.\n\n[16] T. Kolda and B. Bader. Tensor decompositions and applications. SIAM review, 2009.\n[17] W.-J. Li and D.-Y. Yeung. Relation regularized matrix factorization. In IJCAI, 2009.\n[18] X. Long, L. Jin, and J. Joshi. Exploring trajectory-driven local geographic topics in foursquare.\n\n[19] A. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe. Spatial-\n\ntemporal causal modeling for climate change attribution. In KDD, 2009.\n\n[20] F. Nie, H. Huang, X. Cai, and C. H. Ding. Ef\ufb01cient and robust feature selection via joint\n\n`2,1-norms minimization. In NIPS, 2010.\n\n[21] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil. Multilinear multitask\n\nlearning. In ICML, 2013.\n\nrank constraint. In ICML, 2011.\n\n[22] S. Shalev-Shwartz, A. Gonen, and O. Shamir. Large-scale convex minimization with a low-\n\n[23] S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading Accuracy for Sparsity in Optimization\n\nProblems with Sparsity Constraints. SIAM Journal on Optimization, 2010.\n\n[24] R. Tomioka, K. Hayashi, and H. Kashima. Convex Tensor Decomposition via Structured\n\nSchatten Norm Regularization. NIPS, 2013.\n\n[25] T. Zhang. Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representa-\n\ntions. IEEE Trans Inf Theory, pages 4689\u20134708, 2011.\n\n[26] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global\n\n[27] H. Zhou, L. Li, and H. Zhu. Tensor regression with applications in neuroimaging data analysis.\n\nconsistency. In NIPS, 2003.\n\nJASA, 2013.\n\n[28] J. Zhou, J. Chen, and J. Ye. MALSAR: Multi-tAsk Learning via StructurAl Regularization.\n\nhttp://www.public.asu.edu/\u02dcjye02/Software/MALSAR/, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1829, "authors": [{"given_name": "Mohammad Taha", "family_name": "Bahadori", "institution": "U of Southern California"}, {"given_name": "Qi (Rose)", "family_name": "Yu", "institution": "University of Southern California"}, {"given_name": "Yan", "family_name": "Liu", "institution": "University of Southern California"}]}