{"title": "Variational Gaussian-process factor analysis for modeling spatio-temporal data", "book": "Advances in Neural Information Processing Systems", "page_first": 1177, "page_last": 1185, "abstract": "We present a probabilistic latent factor model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.", "full_text": "Variational Gaussian-process factor analysis for\n\nmodeling spatio-temporal data\n\nJaakko Luttinen\n\nAdaptive Informatics Research Center\n\nHelsinki University of Technology, Finland\n\nAlexander Ilin\n\nAdaptive Informatics Research Center\n\nHelsinki University of Technology, Finland\n\nJaakko.Luttinen@tkk.fi\n\nAlexander.Ilin@tkk.fi\n\nAbstract\n\nWe present a probabilistic factor analysis model which can be used for studying\nspatio-temporal datasets. The spatial and temporal structure is modeled by using\nGaussian process priors both for the loading matrix and the factors. The posterior\ndistributions are approximated using the variational Bayesian framework. High\ncomputational cost of Gaussian process modeling is reduced by using sparse ap-\nproximations. The model is used to compute the reconstructions of the global\nsea surface temperatures from a historical dataset. The results suggest that the\nproposed model can outperform the state-of-the-art reconstruction systems.\n\n1 Introduction\n\nFactor analysis and principal component analysis (PCA) are widely used linear techniques for \ufb01nd-\ning dominant patterns in multivariate datasets. These methods \ufb01nd the most prominent correlations\nin the data and therefore they facilitate studies of the observed system. The found principal pat-\nterns can also give an insight into the observed data variability. In many applications, the quality of\nthis kind of modeling can be signi\ufb01cantly improved if extra knowledge about the data structure is\nused. For example, taking into account the temporal information typically leads to more accurate\nmodeling of time series.\n\nIn this work, we present a factor analysis model which makes use of both temporal and spatial\ninformation for a set of collected data. The method is based on the standard factor analysis model\n\nY = WX + noise =\n\nD\n\nXd=1\n\nw:dxT\n\nd: + noise ,\n\n(1)\n\nwhere Y is a matrix of spatio-temporal data in which each row contains measurements in one spatial\nlocation and each column corresponds to one time instance. Here and in the following, we denote\nby ai: and a:i the i-th row and column of a matrix A, respectively (both are column vectors). Thus,\neach xd: represents the time series of one of the D factors whereas w:d is a vector of loadings which\nare spatially distributed. The matrix Y can contain missing values and the samples can be unevenly\ndistributed in space and time.1\nWe assume that both the factors xd: and the corresponding loadings w:d have prominent structures.\nWe describe them by using Gaussian processes (GPs) which is a \ufb02exible and theoretically solid\ntool for smoothing and interpolating non-uniform data [8]. Using separate GP models for xd: and\nw:d facilitates analysis of large spatio-temporal datasets. The application of the GP methodology\nto modeling data Y directly could be unfeasible in real-world problems because the computational\n\n1In practical applications, it may be desirable to diminish the effect of uneven sampling over space or time\n\nby, for example, using proper weights for different data points.\n\n1\n\n\fcomplexity of inference scales cubically w.r.t. the number of data points. The advantage of the\nproposed approach is that we perform GP modeling only either in the spatial or temporal domain at\na time. Thus, the dimensionality can be remarkably reduced and modeling large datasets becomes\nfeasible. Also, good interpretability of the model makes it easy to explore the results in the spatial\nand temporal domain and to set priors re\ufb02ecting our modeling assumptions. The proposed model is\nsymmetrical w.r.t. space and time.\n\nOur model bears similarities with the latent variable models presented in [13, 16]. There, GPs were\nused to describe the factors and the mixing matrix was point-estimated. Therefore the observations\nY were modeled with a GP. In contrast to that, our model is not a GP model for the observations\nbecause the marginal distribution of Y is not Gaussian. This makes the posterior distribution of\nthe unknown parameters intractable. Therefore we use an approximation based on the variational\nBayesian methodology. We also show how to use sparse variational approximations to reduce the\ncomputational load. Models which use GP priors for both W and X in (1) have recently been pro-\nposed in [10, 11]. The function factorization model in [10] is learned using a Markov chain Monte\nCarlo sampling procedure, which may be computationally infeasible for large-scale datasets. The\nnonnegative matrix factorization model in [11] uses point-estimates for the unknown parameters,\nthus ignoring posterior uncertainties. In our method, we take into account posterior uncertainties,\nwhich helps reduce over\ufb01tting and facilitates learning a more accurate model.\n\nIn the experimental part, we use the model to compute reconstruction of missing values in a real-\nworld spatio-temporal dataset. We use a historical sea surface temperature dataset which contains\nmonthly anomalies in the 1856-1991 period to reconstruct the global sea surface temperatures. The\nsame dataset was used in designing the state-of-the-art reconstruction methodology [5]. We show the\nadvantages of the proposed method as a Bayesian technique which can incorporate all assumptions\nin one model and which uses all available data. Since reconstruction of missing values can be an\nimportant application for the method, we give all the formulas assuming missing values in the data\nmatrix Y.\n\n2 Factor analysis model with Gaussian process priors\n\nWe use the factor analysis model (1) in which Y has dimensionality M \u00d7 N and the number of\nfactors D is much smaller than the number of spatial locations M and the number of time instances\nN . The m-th row of Y corresponds to a spatial location lm (e.g., a location on a two-dimensional\nmap) and the n-th column corresponds to a time instance tn.\nWe assume that each time signal xd: contains values of a latent function \u03c7(t) computed at time\ninstances tn. We use independent Gaussian process priors to describe each signal xd::\n\np(X) = N (X: |0, Kx ) =\n\nD\n\nYd=1\n\nN (xd: |0, Kd ) ,\n\n[Kd]ij = \u03c8d(ti, tj; \u03b8d) ,\n\n(2)\n\nwhere X: denotes a long vector formed by concatenating the columns of X, Kd is the part of the\nlarge covariance matrix Kx which corresponds to the d-th row of X and N (a |b, C ) denotes the\nGaussian probability density function for variable a with mean b and covariance C. The ij-th\nelement of Kd is computed using the covariance function \u03c8d with the kernel hyperparameters \u03b8d.\nThe priors for W are de\ufb01ned similarly assuming that each spatial pattern w:d contains measurements\nof a function \u03c9(l) at different spatial locations lm:\n\np(W) =\n\nD\n\nYd=1\n\nN (w:d |0, Kw\n\nd ) ,\n\n[Kw\n\nd ]ij = \u03d5d(li, lj; \u03c6d) ,\n\n(3)\n\nwhere \u03d5d is a covariance function with hyperparameters \u03c6d. Any valid (positive semide\ufb01nite) ker-\nnels can be used to de\ufb01ne the covariance functions \u03c8d and \u03d5d. A good list of possible covariance\nfunctions is given in [8]. The prior model reduces to the one used in probabilistic PCA [14] when\nKd = I and a uniform prior is used for W.\nThe noise term in (1) is modeled with a Gaussian distribution, resulting in a likelihood function\n\np(Y|W, X, \u03c3) = Ymn\u2208O\n\nN (cid:0)ymn(cid:12)(cid:12)\n\nwT\nm:\n\nx:n, \u03c32\n\nmn(cid:1) ,\n\n(4)\n\n2\n\n\fwhere the product is evaluated over the observed elements in Y whose indices are included in the set\nO. We will refer to the model (1)\u2013(4) as GPFA. In practice, the noise level can be assumed spatially\n(\u03c3mn = \u03c3m) or temporally (\u03c3mn = \u03c3n) varying. One can also use spatially and temporally varying\nnoise level \u03c32\nThere are two main dif\ufb01culties which should be addressed when learning the model: 1) The posterior\np(W, X|Y) is intractable and 2) the computational load for dealing with GPs can be too large for\nreal-world datasets. We use the variational Bayesian framework to cope with the \ufb01rst dif\ufb01culty and\nwe also adopt the variational approach when computing sparse approximations for the GP posterior.\n\nmn if this variability can be estimated somehow.\n\n3 Learning algorithm\n\nIn the variational Bayesian framework, the true posterior is approximated using some restricted class\nof possible distributions. An approximate distribution which factorizes as\n\np(W, X|Y) \u2248 q(W, X) = q(W)q(X) .\n\nis typically used for factor analysis models. The approximation q(W, X) can be found by minimiz-\ning the Kullback-Leibler divergence from the true posterior. This optimization is equivalent to the\nmaximization of the lower bound of the marginal log-likelihood:\n\nlog p(Y) \u2265 Z q(W)q(X) log\n\np(Y|W, X)p(W)p(X)\n\nq(W)q(X)\n\ndWdX .\n\n(5)\n\nFree-form maximization of (5) w.r.t. q(X) yields that\n\nq(X) \u221d p(X) exphlog p(Y|W,X)iq(W) ,\n\nwhere h\u00b7i refers to the expectation over the approximate posterior distribution q. Omitting the deriva-\ntions here, this boils down to the following update rule:\n\nwhere Z: is a DN \u00d7 1 vector formed by concatenation of vectors\n\nx + U(cid:1)\u22121\n\nZ:,(cid:0)K\u22121\n\nx + U(cid:1)\u22121(cid:17) ,\n\nq(X) = N (cid:16)X:(cid:12)(cid:12)(cid:12)(cid:0)K\u22121\nz:n = Xm\u2208On\n\n\u03c3\u22122\nmnhwm:iymn .\n\n(6)\n\n(7)\n\nThe summation in (7) is over a set On of indices m for which ymn is observed. Matrix U in (6) is a\nDN \u00d7 DN block-diagonal matrix with the following D \u00d7 D matrices on the diagonal:\n\nUn = Xm\u2208On\n\n\u03c3\u22122\n\nmn(cid:10)wm:wT\n\nm:(cid:11) ,\n\nn = 1, . . . , N .\n\n(8)\n\nNote that the form of the approximate posterior (6) is similar to the regular GP regression: One\nn .\nz:n as noisy observations with the corresponding noise covariance matrices U\u22121\ncan interpret U\u22121\nn\nThen, q(X) in (6) is simply the posterior distribution of the latent functions values \u03c7d(tn).\nThe optimal q(W) can be computed using formulas symmetrical to (6)\u2013(8) in which X and W are\nappropriately exchanged. The variational EM algorithm for learning the model consists of alternate\nupdates of q(W) and q(X) until convergence. The noise level can be estimated by using a point\nestimate or adding a factor factor q(\u03c3mn) to the approximate posterior distribution. For example,\nthe update rules for the case of isotropic noise \u03c32\n\nmn = \u03c32 are given in [2].\n\n3.1 Component-wise factorization\n\nIn practice, one may need to factorize further the posterior approximation in order to reduce the\ncomputational burden. This can be done in two ways: by neglecting the posterior correlations be-\ntween different factors xd: (and between spatial patterns w:d, respectively) or by neglecting the\nposterior correlations between different time instances x:n (and between spatial locations wm:, re-\nspectively). We suggest to use the \ufb01rst way which is computationally more expensive but allows to\n\n3\n\n\fMethod\nGP on Y\n\nGPFA\nGPFA\nGPFA\n\nApproximation\n\nUpdate rule\n\nq(X:)\nq(xd:)\n\nq(xd:), inducing inputs\n\n(6)\n(9)\n(12)\n\nComplexity\nO(N 3M 3)\n\nO(D3N 3 + D3M 3)\nO(DN 3 + DM 3)\nd=1 N 2\n\nd N +PD\n\nO(PD\n\nd=1 M 2\n\nd M )\n\nTable 1: The computational complexity of different algorithms\n\ncapture stronger posterior correlations. This yields a posterior approximation q(X) = QD\n\nwhich can be updated as follows:\n\nd=1 q(xd:)\n\nwhere cd is an N \u00d7 1 vector whose n-th component is\n\nd + Vd(cid:1)\u22121\n\ncd,(cid:0)K\u22121\n\nd + Vd(cid:1)\u22121(cid:17) ,\n\nd = 1, . . . , D ,\n\n(9)\n\nq(xd:) = N (cid:16)xd:(cid:12)(cid:12)(cid:12)(cid:0)K\u22121\n[cd]n = Xm\u2208On\n\n\u03c3\u22122\n\nmnhwmdi(cid:18)ymn \u2212Xj6=d\n\nhwmjihxjni(cid:19)\n\n(10)\n\nand Vd is an N \u00d7N diagonal matrix whose n-th diagonal element is [Vd]nn = Pm\u2208On\n\nThe main difference to (6) is that each component is \ufb01tted to the residuals of the reconstruction based\non the rest of the components. The computational complexity is now reduced compared to (9), as\nshown in Table 1.\n\nmn(cid:10)w2\n\nmd(cid:11) .\n\n\u03c3\u22122\n\nThe component-wise factorization may provide a meaningful representation of data because the\nmodel is biased in favor of solutions with dynamically and spatially decoupled components. When\nthe factors are modeled using rather general covariance functions, the proposed method is somewhat\nrelated to the blind source separation techniques using time structure (e.g., [1]). The advantage here\nis that the method can handle more sophisticated temporal correlations and it is easily applicable to\nincomplete data. In addition, one can use the method in semi-blind settings when prior knowledge\nis used to extract components with speci\ufb01c types of temporal or spatial features [9]. This problem\ncan be addressed using the proposed technique with properly chosen covariance functions.\n\n3.2 Variational learning of sparse GP approximations\n\nOne of the main issues with Gaussian processes is the high computational cost with respect to the\nnumber of observations. Although the variational learning of the GPFA model works only in either\nspatial or temporal domain at a time, the size of the data may still be too large in practice. A common\nway to reduce the computational cost is to use sparse approximations [7]. In this work, we follow\nthe variational formulation of sparse approximations presented in [15].\n\nThe main idea is to introduce a set of auxiliary variables {w, x} which contain the values of the\nlatent functions \u03c9d(l), \u03c7d(t) in some locations {l = \u03bbd\nn|n = 1, . . . , Nd}\ncalled inducing inputs. Assuming that the auxiliary variables {w, x} summarize the data well, it\nholds that p(W, X|w, x, Y) \u2248 p(W, X|w, x) , which suggests a convenient form of the approxi-\nmate posterior:\n\nm|m = 1, . . . , Md}, {t = \u03c4 d\n\n(11)\nwhere p(W|w), p(X|x) can be easily computed from the GP priors. Optimal q(w), q(x) can be\ncomputed by maximizing the variational lower bound of the marginal log-likelihood similar to (5).\n\nq(W, X, w, x) = p(W|w)p(X|x)q(w)q(x) ,\n\nFree-form maximization w.r.t. q(x) yields the following update rule:\n\nq(x) = N (cid:0)x(cid:12)(cid:12)\n\n\u03a3K\u22121\n\nx\n\nKxxZ:, \u03a3(cid:1) , \u03a3 = (cid:0)K\u22121\n\nwhere x is the vector of concatenated auxiliary variables for all factors, Kx is the GP prior co-\nvariance matrix of x and Kxx is the covariance between x and X:. This equation can be seen as a\nreplacement of (6). A similar formula is applicable to the update of q(w). The advantage here is that\nthe number of inducing inputs is smaller than then the number of data samples, that is, Md < M and\nNd < N , and therefore the required computational load can be reduced (see more details in [15]).\nEq. (12) can be quite easily adapted to the component-wise factorization of the posterior in order to\nreduce the computational load of (9). See the summary for the computational complexity in Table 1.\n\nx + K\u22121\n\nx\n\nKxxUKxxK\u22121\n\n,\n\n(12)\n\nx (cid:1)\u22121\n\n4\n\n\f3.3 Update of GP hyperparameters\n\nThe hyperparameters of the GP priors can be updated quite similarly to the standard GP regression\nby maximizing the lower bound of the marginal log-likelihood. Omitting the derivations here, this\nlower bound for the temporal covariance functions {\u03c8d(t)}D\n\nd=1 equals (up to a constant) to\n\nlog N (cid:0)U\u22121Z:(cid:12)(cid:12)0, U\u22121 + KxxK\u22121\n\nx\n\nKxx(cid:1) \u2212\n\n1\n2\n\ntrh\n\nN\n\nXn=1\n\nUnDi ,\n\n(13)\n\nwhere U and Z: have the same meaning as in (6) and D is a D \u00d7 D (diagonal) matrix of variances\nof x:n given the auxiliary variables x. The required gradients are shown in the appendix. The\nequations without the use of auxiliary variables are similar except that KxxK\u22121\nKxx = Kx and\nthe second term disappears. A symmetrical equation can be derived for the hyperparameters of the\nspatial functions \u03d5d(t). The extension of (13) to the case of component-wise factorial approximation\nis straightforward. The inducing inputs can also be treated as variational parameters and they can be\nchanged to optimize the lower bound (13).\n\nx\n\n4 Experiments\n\n4.1 Arti\ufb01cial example\n\nWe generated a dataset with M = 30 sensors (two-dimensional spatial locations) and N = 200\ntime instances using the generative model (1) with a moderate amount of observation noise, as-\nsuming \u03c3mn = \u03c3. D = 4 temporal signals xd: were generated by taking samples from GP priors\nwith different covariance functions: 1) a squared exponential function to model a slowly changing\ncomponent:\n\nk(r; \u03b81) = exp(cid:18)\u2212\n\nr2\n2\u03b82\n\n1(cid:19) ,\n\n2) a periodic function with decay to model a quasi-periodic component:\n\nk(r; \u03b81, \u03b82, \u03b83) = exp(cid:18)\u2212\n\n2 sin2(\u03c0r/\u03b81)\n\n\u03b82\n2\n\n\u2212\n\nr2\n2\u03b82\n\n3(cid:19) ,\n\n(14)\n\n(15)\n\nwhere r = |tj \u2212 ti|, and 3) a compactly supported piecewise polynomial function to model two fast\nchanging components with different timescales:\n\nk(r; \u03b81) =\n\n1\n3\n\n(1 \u2212 r)b+2(cid:0)(b2 + 4b + 3)r2 + (3b + 6)r + 3(cid:1) ,\n\nwhere r = min(1, |tj \u2212 ti|/\u03b81) and b = 3 for one-dimensional inputs with the hyperparameter\n\u03b81 de\ufb01ning a threshold such that k(r) = 0 for |tj \u2212 ti| \u2265 \u03b81. The loadings were generated from\nGPs over the two-dimensional space using the squared exponential covariance function (14) with an\nadditional scale parameter \u03b82:\n\n(16)\n\n(17)\n\nk(r; \u03b81, \u03b82) = \u03b82\n\n2 exp(cid:0)\u2212r2/(2\u03b82\n\n1)(cid:1) .\n\nWe randomly selected 452 data points from Y as being observed, thus most of the generated data\npoints were marked as missing (see Fig. 1a for examples). We also removed observations from all\nthe sensors for a relatively long time interval. Note a resulting gap in the data marked with vertical\nlines in Fig. 1a. The hyperparameters of the Gaussian processes were initialized randomly close\nto the values used for data generation, assuming that a good guess about the hidden signals can be\nobtained by exploratory analysis of data.\n\nFig. 1b shows the components recovered by GPFA using the update rule (6). Note that the algo-\nrithm separated the four signals with the different variability timescales. The posterior predictive\ndistributions of the missing values presented in Fig. 1a show that the method was able to capture\ntemporal correlations on different timescales. Note also that although some of the sensors contain\nvery few observations, the missing values are reconstructed pretty well. This is a positive effect of\nthe spatially smooth priors.\n\n5\n\n\f)\nt\n(\n\n9\n1\n\ny\n\n)\nt\n(\n\n1\n\ny\n\n)\nt\n(\n\n0\n2\n\ny\n\n)\nt\n(\n\n5\n\ny\n\n)\nt\n(\n\n1\n\nx\n\n)\nt\n(\n\n2\n\nx\n\n)\nt\n(\n\n3\n\nx\n\n)\nt\n(\n\n4\n\nx\n\ntime, t\n(a)\n\ntime, t\n(b)\n\nFigure 1: Results for the arti\ufb01cial experiment. (a) Posterior predictive distribution for four randomly\nselected locations with the observations shown as crosses, the gap with no training observations\nmarked with vertical lines and some test values shown as circles. (b) The posteriors of the four\nlatent signals xd:. In both \ufb01gures, the solid lines show the posterior mean and gray color shows two\nstandard deviations.\n\n4.2 Reconstruction of global SST using the MOHSST5 dataset\n\nWe demonstrate how the presented model can be used to reconstruct global sea surface temperatures\n(SST) from historical measurements. We use the U.K. Meteorological Of\ufb01ce historical SST data set\n(MOHSST5) [6] that contain monthly SST anomalies in the 1856-1991 period for 5\u25e6 \u00d75\u25e6 longitude-\nlatitude bins. The dataset contains in total approximately 1600 time instances and 1700 spatial\nlocations. The dataset is sparse, especially during the 19th century and the World Wars, having 55%\nof the values missing, and thus, consisting of more than 106 observations in total.\nWe used the proposed algorithm to estimate D = 80 components, the same number was used in [5].\nWe withdrew 20% of the data from the training set and used this part for testing the reconstruction\naccuracy. We used \ufb01ve time signals xd: with the squared exponential function (14) to describe cli-\nmate trends. Another \ufb01ve temporal components were modeled with the quasi-periodic covariance\nfunction (15) to capture periodic signals (e.g. related to the annual cycle). We also used \ufb01ve compo-\nnents with the squared exponential function to model prominent interannual phenomena such as El\nNi\u02dcno. Finally we used the piecewise polynomial functions to describe the rest 65 time signals xd:.\nThese dimensionalities were chosen ad hoc. The covariance function for each spatial pattern w:d\nwas the scaled squared exponential (17). The distance r between the locations li and lj was mea-\nsured on the surface of the Earth using the spherical law of cosines. The use of the extra parameter\n\u03b82 in (17) allowed automatic pruning of unnecessary factors, which happens when \u03b82 = 0.\nWe used the component-wise factorial approximation of the posterior described in Section 3.1. We\nalso introduced 500 inducing inputs for each spatial function \u03c9d(l) in order to use sparse variational\napproximations. Similar sparse approximations were used for the 15 temporal functions \u03c7(t) which\nmodeled slow climate variability: the slowest, quasi-periodic and interannual components had 80,\n300 and 300 inducing inputs, respectively. The inducing inputs were initialized by taking a random\nsubset from the original inputs and then kept \ufb01xed throughout learning because their optimization\nwould have increased the computational burden substantially. For the rest of the temporal phenom-\nena, we used the piecewise polynomial functions (16) that produce priors with a sparse covariance\nmatrix and therefore allow ef\ufb01cient computations.\n\nThe dataset was preprocessed by weighting the data points by the square root of the corresponding\nlatitudes in order to diminish the effect of denser sampling in the polar regions, then the same noise\nlevel was assumed for all measurements (\u03c3mn = \u03c3). Preprocessing by weighting data points ymn\nwith weights sm is essentially equivalent to assuming spatially varying noise level \u03c3mn = \u03c3/sm.\nThe GP hyperparameters were initialized taking into account the assumed smoothness of the spa-\ntial patterns and the variability timescale of the temporal factors. The factors X were initialized\n\n6\n\n\f \n\n \n\n\u22120.5\n\n0\n\n0.5\n\n\u22120.5\n\n0\n\n0.5\n\n1875\n\n1900\n\n \n\n \n\n\u22120.5\n\n0\n\n0.5\n\n\u22120.5\n\n0\n\n0.5\n\n \n\n \n\n \n\n \n\n \n\n \n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n \n\n\u22121\n\n1925\n\n1950\n\n1975\n\n \n\n \n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n \n\n\u22121\n\n \n\n \n\n1875\n\n1900\n\n1925\n\n1950\n\n1975\n\nFigure 2: Experimental results for the MOHSST5 dataset. The spatial and temporal patterns of the\nfour most dominating principal components for GPFA (above) and VBPCA (below). The solid lines\nand gray color in the time series show the mean and two standard deviations of the posterior distri-\nbution. The uncertainties of the spatial patterns are not shown, and we saturated the visualizations\nof the VBPCA spatial components to reduce the effect of the uncertain pole regions.\n\nrandomly by sampling from the prior and the weights W were initialized to zero. The variational\nEM-algorithm of GPFA was run for 200 iterations. We also applied the variational Bayesian PCA\n(VBPCA) [2] to the same dataset for comparison. VBPCA was initialized randomly as the initial-\nization did not have much effect on the VBPCA results. Finally, we rotated the GPFA components\nsuch that the orthogonal basis in the factor analysis subspace was ordered according to the amount of\nexplained data variance (where the variance was computed by averaging over time). Thus, \u201cGPFA\nprincipal components\u201d are mixtures of the original factors found by the algorithm. This was done\nfor comparison with the most prominent patterns found with VBPCA.\n\nFig. 2 shows the spatial and temporal patterns of the four most dominant principal components for\nboth models. The GPFA principal components and the corresponding spatial patterns are generally\nsmoother, especially in the data-sparse regions, for example, in the period before 1875. The \ufb01rst and\nthe second principal components of GPFA as well as the \ufb01rst and the third components of VBPCA\nare related to El Ni\u02dcno. We should make a note here that the rotation within the principal subspace\nmay be affected by noise and therefore the components may not be directly comparable. Another\nobservation was that the model ef\ufb01ciently used only some of the 15 slow components: about three\nvery slow and two interannual components had relatively large weights in the loading matrix W.\nTherefore the selected number of slow components did not affect the results signi\ufb01cantly. None\n\n7\n\n\fof the periodic components had large weights, which suggests that the fourth VBPCA component\nmight contain artifacts.\n\nFinally, we compared the two models by computing a weighted root mean square reconstruction\nerror on the test set, similarly to [4]. The prediction errors were 0.5714 for GPFA and 0.6180\nfor VBPCA. The improvement obtained by GPFA can be considered quite signi\ufb01cant taking into\naccount the substantial amount of noise in the data.\n\n5 Conclusions and discussion\n\nIn this work, we proposed a factor analysis model which can be used for modeling spatio-temporal\ndatasets. The model is based on using GP priors for both spatial patterns and time signals corre-\nsponding to the hidden factors. The method can be seen as a combination of temporal smoothing,\nempirical orthogonal functions (EOF) analysis and kriging. The latter two methods are popular in\ngeostatistics (see, e.g., [3]). We presented a learning algorithm that can be applicable to relatively\nlarge datasets.\n\nThe proposed model was applied to the problem of reconstruction of historical global sea surface\ntemperatures. The current state-of-the-art reconstruction methods [5] are based on the reduced space\n(i.e. EOF) analysis with smoothness assumptions for the spatial and temporal patterns. That ap-\nproach is close to probabilistic PCA [14] with \ufb01tting a simple auto-regressive model to the posterior\nmeans of the hidden factors. Our GPFA model is based on probabilistic formulation of essentially\nthe same modeling assumptions. The gained advantage is that GPFA takes into account the uncer-\ntainty about the unknown parameters, it can use all available data and it can combine all modeling\nassumptions in one estimation procedure. The reconstruction results obtained with GPFA are very\npromising and they suggest that the proposed model might be able to improve the existing SST\nreconstructions. The improvement is possible because the method is able to model temporal and\nspatial phenomena on different scales by using properly selected GPs.\n\nA The gradients for the updates of GP hyperparameters\n\nThe gradient of the \ufb01rst term of (13) w.r.t. a hyperparameter (or inducing input) \u03b8 of any covariance\nfunction is given by\n\n\u2202Kx\n\ntrh(cid:0)K\u22121\n\nx \u2212 A\u22121(cid:1)\n\n\u2202\u03b8 i \u2212 trhUKxxA\u22121 \u2202Kxx\n\n1\n2\nwhere A = Kx + KxxUKxx , b = A\u22121KxxZ: . This part is similar to the gradient reported in\n[12]. Without the sparse approximation, it holds that Kx = Kx = Kxx = Kxx and the equation\nsimpli\ufb01es to the regular gradient in GP regression for projected observations U\u22121Z: with the noise\ncovariance U\u22121. The second part of (13) results in the extra terms\n\nb + bT \u2202Kxx\n\u2202\u03b8\n\n\u2202\u03b8 i + \u2212\n\n1\n2\n\nbT \u2202Kx\n\u2202\u03b8\n\n(Z: \u2212 UKxxb)\n\ntr(cid:18) \u2202Kx\n\n\u2202\u03b8\n\nU(cid:19) + tr(cid:18) \u2202Kx\n\n\u2202\u03b8\n\nK\u22121\n\nx\n\nKxxUKxxK\u22121\n\nx (cid:19) \u2212 2 tr(cid:18) \u2202Kxx\n\n\u2202\u03b8\n\nK\u22121\n\nx\n\nKxxU(cid:19) .\n\n(18)\n\nThe terms in (18) cancel out when the sparse approximation is not used. Both parts of the gra-\ndient can be ef\ufb01ciently evaluated using the Cholesky decomposition. The positivity constraints of\nthe hyperparameters can be taken into account by optimizing with respect to the logarithms of the\nhyperparameters.\n\nAcknowledgments\n\nThis work was supported in part by the Academy of Finland under the Centers for Excellence in Research\nprogram and Alexander Ilin\u2019s postdoctoral research project. We would like to thank Alexey Kaplan for fruitful\ndiscussions and providing his expertise on the problem of sea surface temperature reconstruction.\n\nReferences\n\n[1] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, and E. Moulines. A blind source separation technique based\n\non second order statistics. IEEE Transactions on Signal Processing, 45(2):434\u2013444, 1997.\n\n8\n\n\f[2] C. M. Bishop. Variational principal components. In Proceedings of the 9th International Conference on\n\nArti\ufb01cial Neural Networks (ICANN\u201999), pages 509\u2013514, 1999.\n\n[3] N. Cressie. Statistics for Spatial Data. Wiley-Interscience, New York, 1993.\n[4] A. Ilin and A. Kaplan. Bayesian PCA for reconstruction of historical sea surface temperatures. In Pro-\nceedings of the International Joint Conference on Neural Networks (IJCNN 2009), pages 1322\u20131327,\nAtlanta, USA, June 2009.\n\n[5] A. Kaplan, M. Cane, Y. Kushnir, A. Clement, M. Blumenthal, and B. Rajagopalan. Analysis of global sea\n\nsurface temperatures 1856\u20131991. Journal of Geophysical Research, 103:18567\u201318589, 1998.\n\n[6] D. E. Parker, P. D. Jones, C. K. Folland, and A. Bevan. Interdecadal changes of surface temperature since\n\nthe late nineteenth century. Journal of Geophysical Research, 99:14373\u201314399, 1994.\n\n[7] J. Qui\u02dcnonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. Journal of Machine Learning Research, 6:1939\u20131959, Dec. 2005.\n\n[8] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[9] J. S\u00a8arel\u00a8a and H. Valpola. Denoising source separation. Journal of Machine Learning Research, 6:233\u2013\n\n272, 2005.\n\n[10] M. N. Schmidt. Function factorization using warped Gaussian processes. In L. Bottou and M. Littman,\neditors, Proceedings of the 26th International Conference on Machine Learning (ICML\u201909), pages 921\u2013\n928, Montreal, June 2009. Omnipress.\n\n[11] M. N. Schmidt and H. Laurberg. Nonnegative matrix factorization with Gaussian process priors. Compu-\n\ntational Intelligence and Neuroscience, 2008:1\u201310, 2008.\n\n[12] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian pro-\ncess regression. In Proceedings of the 9th International Workshop on Arti\ufb01cial Intelligence and Statistics\n(AISTATS\u201903), pages 205\u2013213, 2003.\n\n[13] Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models. In Proceedings of the 10th\n\nInternational Workshop on Arti\ufb01cial Intelligence and Statistics (AISTATS\u201905), pages 333\u2013340, 2005.\n\n[14] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal\n\nStatistical Society Series B, 61(3):611\u2013622, 1999.\n\n[15] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of\nthe 12th International Workshop on Arti\ufb01cial Intelligence and Statistics (AISTATS\u201909), pages 567\u2013574,\n2009.\n\n[16] B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani. Gaussian-process\nIn Advances in\n\nfactor analysis for low-dimensional single-trial analysis of neural population activity.\nNeural Information Processing Systems 21, pages 1881\u20131888. 2009.\n\n9\n\n\f", "award": [], "sourceid": 705, "authors": [{"given_name": "Jaakko", "family_name": "Luttinen", "institution": null}, {"given_name": "Alexander", "family_name": "Ilin", "institution": null}]}