{"title": "A scalable end-to-end Gaussian process adapter for irregularly sampled time series classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1804, "page_last": 1812, "abstract": "We present a general framework for classification of sparse and irregularly-sampled time series. The properties of such time series can result in substantial uncertainty about the values of the underlying temporal processes, while making the data difficult to deal with using standard classification methods that assume fixed-dimensional feature spaces. To address these challenges, we propose an uncertainty-aware classification framework based on a special computational layer we refer to as the Gaussian process adapter that can connect irregularly sampled time series data to any black-box classifier learnable using gradient descent. We show how to scale up the required computations based on combining the structured kernel interpolation framework and the Lanczos approximation method, and how to discriminatively train the Gaussian process adapter in combination with a number of classifiers end-to-end using backpropagation.", "full_text": "A scalable end-to-end Gaussian process adapter for\n\nirregularly sampled time series classi\ufb01cation\n\nSteven Cheng-Xian Li\nBenjamin Marlin\nCollege of Information and Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01003\n\n{cxl,marlin}@cs.umass.edu\n\nAbstract\n\nWe present a general framework for classi\ufb01cation of sparse and irregularly-sampled\ntime series. The properties of such time series can result in substantial uncertainty\nabout the values of the underlying temporal processes, while making the data\ndif\ufb01cult to deal with using standard classi\ufb01cation methods that assume \ufb01xed-\ndimensional feature spaces. To address these challenges, we propose an uncertainty-\naware classi\ufb01cation framework based on a special computational layer we refer to\nas the Gaussian process adapter that can connect irregularly sampled time series\ndata to any black-box classi\ufb01er learnable using gradient descent. We show how\nto scale up the required computations based on combining the structured kernel\ninterpolation framework and the Lanczos approximation method, and how to\ndiscriminatively train the Gaussian process adapter in combination with a number\nof classi\ufb01ers end-to-end using backpropagation.\n\n1\n\nIntroduction\n\nIn this paper, we propose a general framework for classi\ufb01cation of sparse and irregularly-sampled\ntime series. An irregularly-sampled time series is a sequence of samples with irregular intervals\nbetween their observation times. These intervals can be large when the time series are also sparsely\nsampled. Such time series data are studied in various areas including climate science [22], ecology\n[4], biology [18], medicine [15] and astronomy [21]. Classi\ufb01cation in this setting is challenging both\nbecause the data cases are not naturally de\ufb01ned in a \ufb01xed-dimensional feature space due to irregular\nsampling and variable numbers of samples, and because there can be substantial uncertainty about\nthe underlying temporal processes due to the sparsity of observations.\nRecently, Li and Marlin [13] introduced the mixture of expected Gaussian kernels (MEG) framework,\nan uncertainty-aware kernel for classifying sparse and irregularly sampled time series. Classi\ufb01cation\nwith MEG kernels is shown to outperform models that ignore uncertainty due to sparse and irregular\nsampling. On the other hand, various deep learning models including convolutional neural networks\n[12] have been successfully applied to \ufb01elds such as computer vision and natural language processing,\nand have been shown to achieve state-of-the-art results on various tasks. Some of these models\nhave desirable properties for time series classi\ufb01cation, but cannot be directly applied to sparse and\nirregularly sampled time series.\nInspired by the MEG kernel, we propose an uncertainty-aware classi\ufb01cation framework that enables\nlearning black-box classi\ufb01cation models from sparse and irregularly sampled time series data. This\nframework is based on the use of a computational layer that we refer to as the Gaussian process\n(GP) adapter. The GP adapter uses Gaussian process regression to transform the irregular time series\ndata into a uniform representation, allowing sparse and irregularly sampled data to be fed into any\nblack-box classi\ufb01er learnable using gradient descent while preserving uncertainty. However, the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fO(n3) time and O(n2) space of exact GP regression makes the GP adapter prohibitively expensive\nwhen scaling up to large time series.\nTo address this problem, we show how to speed up the key computation of sampling from a GP\nposterior based on combining the structured kernel interpolation (SKI) framework that was recently\nproposed by Wilson and Nickisch [25] with Lanczos methods for approximating matrix functions [3].\nUsing the proposed sampling algorithm, the GP adapter can run in linear time and space in terms of\nthe length of the time series, and O(m log m) time when m inducing points are used.\nWe also show that GP adapter can be trained end-to-end together with the parameters of the chosen\nclassi\ufb01er by backpropagation through the iterative Lanczos method. We present results using logistic\nregression, fully-connected feedforward networks, convolutional neural networks and the MEG kernel.\nWe show that end-to-end discriminative training of the GP adapter outperforms a variety of baselines\nin terms of classi\ufb01cation performance, including models based only on GP mean interpolation, or\nwith GP regression trained separately using marginal likelihood.\n\n2 Gaussian processes for sparse and irregularly-sampled time series\n\nOur focus in this paper is on time series classi\ufb01cation in the presence of sparse and irregular sampling.\nIn this problem, the data D contain N independent tuples consisting of a time series Si and a label\nyi. Thus, D = {(S1, y1), . . . , (SN , yN )}. Each time series Si is represented as a list of time points\nti = [ti1, . . . , ti|Si|](cid:62), and a list of corresponding values vi = [vi1, . . . , vi|Si|](cid:62). We assume that\neach time series is observed over a common time interval [0, T ]. However, different time series\nare not necessarily observed at the same time points (i.e. ti (cid:54)= tj in general). This implies that the\nnumber of observations in different time series is not necessary the same (i.e. |Si| (cid:54)= |Sj| in general).\nFurthermore, the time intervals between observation within a single time series are not assumed to be\nuniform.\nLearning in this setting is challenging because the data cases are not naturally de\ufb01ned in a \ufb01xed-\ndimensional feature space due to the irregular sampling. This means that commonly used classi\ufb01ers\nthat take \ufb01xed-length feature vectors as input are not applicable. In addition, there can be substantial\nuncertainty about the underlying temporal processes due to the sparsity of observations.\nTo address these challenges, we build on ideas from the MEG kernel [13] by using GP regression\n[17] to provide an uncertainty-aware representation of sparse and irregularly sampled time series. We\n\ufb01x a set of reference time points x = [x1, . . . , xd](cid:62) and represent a time series S = (t, v) in terms\nof its posterior marginal distribution at these time points. We use GP regression with a zero-mean\nGP prior and a covariance function k(\u00b7,\u00b7) parameterized by kernel hyperparameters \u03b7. Let \u03c32 be the\nindependent noise variance of the GP regression model. The GP parameters are \u03b8 = (\u03b7, \u03c32).\nUnder this model, the marginal posterior GP at x is Gaussian distributed with the mean and covariance\ngiven by\n\n\u00b5 = Kx,t(Kt,t + \u03c32I)\u22121v,\n\u03a3 = Kx,x \u2212 Kx,t(Kt,t + \u03c32I)\u22121Kt,x\n\n(1)\n(2)\nwhere Kx,t denotes the covariance matrix with [Kx,t]ij = k(xi, tj). We note that it takes O(n3 +nd)\ntime to exactly compute the posterior mean \u00b5, and O(n3 + n2d + nd2) time to exactly compute the\nfull posterior covariance matrix \u03a3, where n = |t| and d = |x|.\n\n3 The GP adapter and uncertainty-aware time series classi\ufb01cation\n\nIn this section we describe our framework for time series classi\ufb01cation in the presence of sparse\nand irregular sampling. Our framework enables any black-box classi\ufb01er learnable by gradient-based\nmethods to be applied to the problem of classifying sparse and irregularly sampled time series.\n\n3.1 Classi\ufb01cation frameworks and the Gaussian process adapter\n\nIn Section 2 we described how we can represent a time series through the marginal posterior it induces\nunder a Gaussian process regression model at any set of reference time points x. By \ufb01xing a common\n\n2\n\n\fset of reference time points x for all time series in a data set, every time series can be transformed\ninto a common representation in the form of a multivariate Gaussian N (z|\u00b5, \u03a3; \u03b8) with z being the\nrandom vector distributed according to the posterior GP marginalized over the time points x.1 Here\nwe assume that the GP parameters \u03b8 are shared across the entire data set.\nIf the z values were observed, we could simply apply a black-box classi\ufb01er. A classi\ufb01er can be\ngenerally de\ufb01ned by a mapping function f (z; w) parameterized by w, associated with a loss function\n(cid:96)(f (z; w), y) where y is a label value from the output space Y. However, in our case z is a Gaussian\nrandom variable, which means (cid:96)(f (z; w), y) is now itself a random variable given a label y. Therefore,\nwe use the expectation Ez\u223cN (\u00b5,\u03a3;\u03b8)\nseries S given its Gaussian representation N (\u00b5, \u03a3; \u03b8). The learning problem becomes minimizing\nthe expected loss over the entire data set:\n\n(cid:2)(cid:96)(f (z; w), y)(cid:3) as the overall loss between the label y and a time\nN(cid:88)\n\n(cid:2)(cid:96)(f (zi; w), yi)(cid:3).\n\nw\u2217, \u03b8\n\n\u2217\n\n= argmin\n\nEzi\u223cN (\u00b5i,\u03a3i;\u03b8)\n\n(3)\n\nw,\u03b8\n\ni=1\n\n\u2217, we can make predictions on unseen data. In\nOnce we have the optimal parameters w\u2217 and \u03b8\ngeneral, given an unseen time series S and its Gaussian representation N (\u00b5, \u03a3; \u03b8\n), we can predict\nits label using (4), although in many cases this can be simpli\ufb01ed into a function of f (z; w\u2217) with the\nexpectation taken on or inside of f (z; w\u2217).\n\n\u2217\n\ny\u2217 = argmin\ny\u2208Y\n\nEz\u223cN (\u00b5,\u03a3;\u03b8\u2217)\n\n(cid:2)(cid:96)(f (z; w\u2217), y)(cid:3)\n\n(4)\n\nWe name the above approach the Uncertainty-Aware Classi\ufb01cation (UAC) framework. Importantly,\nthis framework propagates the uncertainty in the GP posterior induced by each time series all the way\nthrough to the loss function. Besides, we call the transformation S (cid:55)\u2192 (\u00b5, \u03a3) the Gaussian process\nadapter, since it provides a uniform representation to connect the raw irregularly sampled time series\ndata to a black-box classi\ufb01er.\nVariations of the UAC framework can be derived by taking the expectation at various position of\nf (z; w) where z \u223c N (\u00b5, \u03a3; \u03b8). Taking the expectation at an earlier stage simpli\ufb01es the computation,\nbut the uncertainty information will be integrated out earlier as well.2 In the extreme case, if the\nexpectation is computed immediately followed by the GP adapter transformation, it is equivalent to\nusing a plug-in estimate \u00b5 for z in the loss function, (cid:96)(f (Ez\u223cN (\u00b5,\u03a3;\u03b8)[z]; w), y) = (cid:96)(f (\u00b5; w), y).\nWe refer to this as the IMPutation (IMP) framework. The IMP framework discards the uncertainty\ninformation completely, which further simpli\ufb01es the computation. This simpli\ufb01ed variation may be\nuseful when the time series are more densely sampled, where the uncertainty is less of a concern.\nIn practice, we can train the model using the UAC objective (3) and predict instead by IMP. In that\ncase, the predictions would be deterministic and can be computed ef\ufb01ciently without drawing samples\nfrom the posterior GP as described later in Section 4.\n\n3.2 Learning with the GP adapter\n\nIn the previous section, we showed that the UAC framework can be trained using (3). In this paper,\nwe use stochastic gradient descent to scalably optimize (3) by updating the model using a single time\nseries at a time, although it can be easily modi\ufb01ed for batch or mini-batch updates. From now on,\nwe will focus on the optimization problem minw,\u03b8 Ez\u223cN (\u00b5,\u03a3;\u03b8)\noutput of the GP adapter given a time series S = (t, v) and its label y. For many classi\ufb01ers, the\nexpected loss Ez\u223cN (\u00b5,\u03a3;\u03b8)\nthe Monte Carlo average to approximate the expected loss:\n\n(cid:2)(cid:96)(f (z; w), y)(cid:3) where \u00b5, \u03a3 are the\n(cid:2)(cid:96)(f (z; w), y)(cid:3) cannot be analytically computed. In such cases, we use\n\nEz\u223cN (\u00b5,\u03a3;\u03b8)\n\n(cid:2)(cid:96)(f (z; w), y)(cid:3) \u2248 1\n\nS(cid:88)\n\ns=1\n\nS\n\n(cid:96)(f (zs; w), y), where zs \u223c N (\u00b5, \u03a3; \u03b8).\n\n(5)\n\nTo learn the parameters of both the classi\ufb01er w and the Gaussian process regression model \u03b8 jointly\nunder the expected loss, we need to be able to compute the gradient of the expectation given in (5).\n1 The notation N (\u00b5, \u03a3; \u03b8) explicitly expresses that both \u00b5 and \u03a3 are functions of the GP parameters \u03b8.\n2 For example, the loss of the expected output of the classi\ufb01er (cid:96)(Ez\u223cN (\u00b5,\u03a3;\u03b8)[f (z; w)], y).\n\nBesides, they are also functions of S = (t, v) as shown in (1) and (2).\n\n3\n\n\fTo achieve this, we reparameterize the Gaussian random variable using the identity z = \u00b5 + R\u03be\nwhere \u03be \u223c N (0, I) and R satis\ufb01es \u03a3 = RR(cid:62) [11]. The gradients under this reparameterization\nare given below, both of which can be approximated using Monte Carlo sampling as in (5). We will\nfocus on ef\ufb01ciently computing the gradient shown in (7) since we assume that the gradient of the\nbase classi\ufb01er f (z; w) can be computed ef\ufb01ciently.\n\n(cid:2)(cid:96)(f (z; w), y)(cid:3) = E\u03be\u223cN (0,I)\n(cid:2)(cid:96)(f (z; w), y)(cid:3) = E\u03be\u223cN (0,I)\n\n(cid:20) \u2202\n(cid:34)(cid:88)\n\n\u2202w\n\n\u2202\n\u2202w\n\n\u2202\n\u2202\u03b8\n\nEz\u223cN (\u00b5,\u03a3;\u03b8)\n\nEz\u223cN (\u00b5,\u03a3;\u03b8)\n\n(cid:21)\n\n(cid:96)(f (z; w), y)\n\n\u2202(cid:96)(f (z; w), y)\n\ni\n\n\u2202zi\n\n(cid:35)\n\n\u2202zi\n\u2202\u03b8\n\n(6)\n\n(7)\n\nThere are several choices for R that satisfy \u03a3 = RR(cid:62). One common choice of R is the Cholesky\nfactor, a lower triangular matrix, which can be computed using Cholesky decomposition in O(d3) for\na d \u00d7 d covariance matrix \u03a3 [7]. We instead use the symmetric matrix square root R = \u03a31/2. We\nwill show that this particular choice of R leads to an ef\ufb01cient and scalable approximation algorithm\nin Section 4.2.\n\n4 Fast sampling from posterior Gaussian processes\n\nThe computation required by the GP adapter is dominated by the time needed to draw samples from\nthe marginal GP posterior using z = \u00b5 + \u03a31/2\u03be. In Section 2 we noted that the time complexity of\nexactly computing the posterior mean \u00b5 and covariance \u03a3 is O(n3 + nd) and O(n3 + n2d + nd2),\nrespectively. Once we have both \u00b5 and \u03a3 we still need to compute the square root of \u03a3, which\nrequires an additional O(d3) time to compute exactly. In this section, we show how to ef\ufb01ciently\ngenerate samples of z.\n\nKa,b \u2248 (cid:101)Ka,b = WaKu,uW(cid:62)\n\n4.1 Structured kernel interpolation for approximating GP posterior means\n\nand Nickisch [25] is to approximate a stationary kernel matrix Ka,b by the approximate kernel (cid:101)Ka,b\n\nThe main idea of the structured kernel interpolation (SKI) framework recently proposed by Wilson\nde\ufb01ned below where u = [u1, . . . , um](cid:62) is a collection of evenly-spaced inducing points.\n\nb .\n\n(8)\nLetting p = |a| and q = |b|, Wa \u2208 Rp\u00d7m is a sparse interpolation matrix where each row\ncontains only a small number of non-zero entries. We use local cubic convolution interpolation\n(cubic interpolation for short) [10] as suggested in Wilson and Nickisch [25]. Each row of the\ninterpolation matrices Wa, Wb has at most four non-zero entries. Wilson and Nickisch [25] showed\nthat when the kernel is locally smooth (under the resolution of u), cubic interpolation results in\naccurate approximation. This can be justi\ufb01ed as follows: with cubic interpolation, the SKI kernel is\nessentially the two-dimensional cubic interpolation of Ka,b using the exact regularly spaced samples\n\nstored in Ku,u, which corresponds to classical bicubic convolution. In fact, we can show that (cid:101)Ka,b\n\nasymptotically converges to Ka,b as m increases by following the derivation in Keys [10].\nPlugging the SKI kernel into (1), the posterior GP mean evaluated at x can be approximated by\n\n(cid:0)WtK\u22121\n\nu,uW(cid:62)\n\nt + \u03c32I(cid:1)\u22121\n\nv.\n\n(9)\n\n(cid:0)Kt,t + \u03c32I(cid:1)\u22121\n\n\u00b5 = Kx,t\n\nv \u2248 WxKu,uW(cid:62)\n\nt\n\nThe inducing points u are chosen to be evenly-spaced because Ku,u forms a symmetric Toeplitz\nmatrix under a stationary covariance function. A symmetric Toeplitz matrix can be embedded into a\ncirculant matrix to perform matrix vector multiplication using fast Fourier transforms [7].\nt +\u03c32I)\u22121v which only\nFurther, one can use the conjugate gradient method to solve for (WtK\u22121\ninvolves computing the matrix-vector product (WtK\u22121\nt + \u03c32I)v. In practice, the conjugate\ngradient method converges within only a few iterations. Therefore, approximating the posterior mean\n\u00b5 using SKI takes only O(n + d + m log m) time to compute. In addition, since a symmetric Toeplitz\nmatrix Ku,u can be uniquely characterized by its \ufb01rst column, and Wt can be stored as a sparse\nmatrix, approximating \u00b5 requires only O(n + d + m) space.\n\nu,uW(cid:62)\n\nu,uW(cid:62)\n\n4\n\n\fd = \u03a3dj \u2212 \u03b2jdj\u22121\n\u03b1j = d(cid:62)\nj d\nd = d \u2212 \u03b1jdj\n\u03b2j+1 = (cid:107)d(cid:107)\ndj+1 = d/\u03b2j+1\n\nH = tridiagonal(\u03b2, \u03b1, \u03b2) =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u03b11 \u03b22\n\u03b22 \u03b12 \u03b23\n\u03b23 \u03b13\n...\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n...\n... \u03b2k\n\u03b2k \u03b1k\n\nAlgorithm 1: Lanczos method for approximating \u03a31/2\u03be\nInput: covariance matrix \u03a3, dimension of the Krylov subspace k, random vector \u03be\n\u03b21 = 0 and d0 = 0\nd1 = \u03be/(cid:107)\u03be(cid:107)\nfor j = 1 to k do\n\nD = [d1, . . . , dk]\nH = tridiagonal(\u03b2, \u03b1, \u03b2)\nreturn (cid:107)\u03be(cid:107)DH1/2e1\n\n4.2 The Lanczos method for covariance square root-vector products\n\n// e1 = [1, 0, . . . , 0](cid:62)\n\nWith the SKI techniques, although we can ef\ufb01ciently approximate the posterior mean \u00b5, computing\n\u03a31/2\u03be is still challenging. If computed exactly, it takes O(n3 + n2d + nd2) time to compute \u03a3 and\nO(d3) time to take the square root. To overcome the bottleneck, we apply the SKI kernel to the\nLanczos method, one of the Krylov subspace approximation methods, to speed up the computation\nof \u03a31/2\u03be as shown in Algorithm 1. The advantage of the Lanczos method is that neither \u03a3 nor \u03a31/2\nneeds to be computed explicitly. Like the conjugate gradient method, another example of the Krylov\nsubspace method, it only requires the computation of matrix-vector products with \u03a3 as the matrix.\nThe idea of the Lanczos method is to approximate \u03a31/2\u03be in the Krylov subspace Kk(\u03a3, \u03be) =\nspan{\u03be, \u03a3\u03be, . . . , \u03a3k\u22121\u03be}. The iteration in Algorithm 1, usually referred to the Lanczos process,\nessentially performs the Gram-Schmidt process to transform the basis {\u03be, \u03a3\u03be, . . . , \u03a3k\u22121\u03be} into an\northonormal basis {d1, . . . , dk} for the subspace Kk(\u03a3, \u03be).\nThe optimal approximation of \u03a31/2\u03be in the Krylov subspace Kk(\u03a3, \u03be) that minimizes the (cid:96)2-norm\nof the error is the orthogonal projection of \u03a31/2\u03be onto Kk(\u03a3, \u03be) as y\u2217 = DD(cid:62)\u03a31/2\u03be. Since we\nchoose d1 = \u03be/(cid:107)\u03be(cid:107), the optimal projection can be written as y\u2217 = (cid:107)\u03be(cid:107)DD(cid:62)\u03a31/2De1 where\ne1 = [1, 0, . . . , 0](cid:62) is the \ufb01rst column of the identify matrix.\nOne can show that the tridiagonal matrix H de\ufb01ned in Algorithm 1 satis\ufb01es D(cid:62)\u03a3D = H [20]. Also,\nwe have D(cid:62)\u03a31/2D \u2248 (D(cid:62)\u03a3D)1/2 since the eigenvalues of H approximate the extremal eigenvalues\nof \u03a3 [19]. Therefore we have y\u2217 = (cid:107)\u03be(cid:107)DD(cid:62)\u03a31/2De1 \u2248 (cid:107)\u03be(cid:107)DH1/2e1.\nThe error bound of the Lanczos method is analyzed in Ili\u00b4c et al. [9]. Alternatively one can show that\nthe Lanczos approximation converges superlinearly [16]. In practice, for a d \u00d7 d covariance matrix\n\u03a3, the approximation is suf\ufb01cient for our sampling purpose with k (cid:28) d. As H is now a k \u00d7 k matrix,\nwe can use any standard method to compute its square root in O(k3) time [2], which is considered\nO(1) when k is chosen to be a small constant. Now the computation of the Lanczos method for\napproximating \u03a31/2\u03be is dominated by the matrix-vector product \u03a3d during the Lanczos process.\nHere we apply the SKI kernel trick again to ef\ufb01ciently approximate \u03a3d by\n\n\u03a3d \u2248 WxKu,uW(cid:62)\n\nx d \u2212 WxKu,uW(cid:62)\n\nt\n\n(10)\nSimilar to the posterior mean, \u03a3d can be approximated in O(n + d + m log m) time and linear space.\nTherefore, for k = O(1) basis vectors, the entire Algorithm 1 takes O(n + d + m log m) time and\nO(n + d + m) space, which is also the complexity to draw a sample from the posterior GP.\nTo reduce the variance when estimating the expected loss (5), we can draw multiple samples from the\nposterior GP: {\u03a31/2\u03bes}s=1,...,S where \u03bes \u223c N (0, I). Since all of the samples are associated with the\nsame covariance matrix \u03a3, we can use the block Lanczos process [8], an extension to the single-vector\nLanczos method presented in Algorithm 1, to simultaneously approximate \u03a31/2\u039e for all S random\n\nx d.\n\n(cid:0)WtKu,uW(cid:62)\n\nt + \u03c32I(cid:1)\u22121\n\nWtKu,uW(cid:62)\n\n5\n\n\fvectors \u039e = [\u03be1, . . . , \u03beS]. Similarly, during the block Lanczos process, we use the block conjugate\ngradient method [6, 5] to simultaneously solve the linear equation (WtKu,uW(cid:62)\nt + \u03c32I)\u22121\u03b1 for\nmultiple \u03b1.\n\n5 End-to-end learning with the GP adapter\n\nThe most common way to train GP parameters is through maximizing the marginal likelihood [17]\n\nv(cid:62)(cid:0)Kt,t + \u03c32I(cid:1)\u22121\n\nlog p(v|t, \u03b8) = \u2212 1\n2\n\nlog(cid:12)(cid:12)Kt,t + \u03c32I(cid:12)(cid:12) \u2212 n\n\n2\n\nv \u2212 1\n2\n\nlog 2\u03c0.\n\n(11)\n\nIf we follow this criterion, training the UAC framework becomes a two-stage procedure: \ufb01rst we\nlearn GP parameters by maximizing the marginal likelihood. We then compute \u00b5 and \u03a3 given each\ntime series S and the learned GP parameters \u03b8\n\u2217. Both \u00b5 and \u03a3 are then \ufb01xed and used to train the\nclassi\ufb01er using (6).\nIn this section, we describe how to instead train the GP parameters discriminatively end-to-end using\nbackpropagation. As mentioned in Section 3, we train the UAC framework by jointly optimizing the\nGP parameters \u03b8 and the parameters of the classi\ufb01er w according to (6) and (7).\nThe most challenging part in (7) is to compute \u2202z = \u2202\u00b5 + \u2202(\u03a31/2\u03be).3 For \u2202\u00b5, we can derive the\ngradient of the approximating posterior mean (9) as given in Appendix A. Note that the gradient \u2202\u00b5\ncan be approximated ef\ufb01ciently by repeatedly applying fast Fourier transforms and the conjugate\ngradient method in the same time and space complexity as computing (9).\nOn the other hand, \u2202(\u03a31/2\u03be) can be approximated by backpropagating through the Lanczos method\ndescribed in Algorithm 1. To carry out backpropagation, all operations in the Lanczos method must\nbe differentiable. For the approximation of \u03a3d during the Lanczos process, we can similarly compute\nthe gradient of (10) ef\ufb01ciently using the SKI techniques as in computing \u2202\u00b5 (see Appendix A).\nThe gradient \u2202H1/2 for the last step of Algorithm 1 can be derived as follows. From H = H1/2H1/2,\nwe have \u2202H = (\u2202H1/2)H1/2 + H1/2(\u2202H1/2). This is known as the Sylvester equation, which has\nthe form of AX + XB = C where A, B, C are matrices and X is the unknown matrix to solve\nfor. We can compute the gradient \u2202H1/2 by solving the Sylvester equation using the Bartels-Stewart\nalgorithm [1] in O(k3) time for a k \u00d7 k matrix H, which is considered O(1) for a small constant k.\nOverall, training the GP adapter using stochastic optimization with the aforementioned approach\ntakes O(n + d + m log m) time and O(n + d + m) space for m inducing points, n observations in\nthe time series, and d features generated by the GP adapter.\n\n6 Related work\n\nThe recently proposed mixtures of expected Gaussian kernels (MEG) [13] for classi\ufb01cation of\nirregular time series is probably the closest work to ours. The random feature representation of the\n\n(cid:2)cos(w(cid:62)\n\ni z + bi)(cid:3), which the algorithm described\n\nMEG kernel is in the form of(cid:112)2/m Ez\u223cN (\u00b5,\u03a3)\n(cid:112)2/m exp(\u2212w(cid:62)\n\ni \u03a3wi and w(cid:62)\n\ni \u03a3wi/2) cos(w(cid:62)\n\nin Section 4 can be applied to directly. However, by exploiting the spectral property of Gaussian\nkernels, the expected random feature of the MEG kernel is shown to be analytically computable by\ni \u00b5 + bi). With the SKI techniques, we can ef\ufb01ciently approximate\nboth w(cid:62)\ni \u00b5 in the same time and space complexity as the GP adapter. Moreover, the\nrandom features of the MEG kernel can be viewed as a stochastic layer in the classi\ufb01cation network,\nwith no trainable parameters. All {wi, bi}i=1,...,m are randomly initialized once in the beginning and\nassociated with the output of the GP adapter in a nonlinear way described above.\nMoreover, the MEG kernel classi\ufb01cation is originally a two-stage method: one \ufb01rst estimates the\nGP parameters by maximizing the marginal likelihood and then uses the optimized GP parameters\nto compute the MEG kernel for classi\ufb01cation. Since the random feature is differentiable, with the\napproximation of \u2202\u00b5 and \u2202(\u03a3d) described in Section 5, we can form a similar classi\ufb01cation network\nthat can be ef\ufb01ciently trained end-to-end using the GP adapter. In Section 7.2, we will show that\ntraining the MEG kernel end-to-end leads to better classi\ufb01cation performance.\n\n3 For brevity, we drop 1/\u2202\u03b8 from the gradient notation in this section.\n\n6\n\n\fFigure 1: Left: Sample approximation error versus the number of inducing points. Middle: Sample\napproximation error versus the number of Lanczos iterations. Right: Running time comparisons (in\nseconds). BP denotes computing the gradient of the sample using backpropagation.\n\n7 Experiments\n\nIn this section, we present experiments and results exploring several facets of the GP adapter\nframework including the quality of the approximations and the classi\ufb01cation performance of the\nframework when combined with different base classi\ufb01ers.\n\n7.1 Quality of GP sampling approximations\n\nThe key to scalable learning with the GP adapter relies on both fast and accurate approximation\nfor drawing samples from the posterior GP. To assess the approximation quality, we \ufb01rst generate\na synthetic sparse and irregularly-sampled time series S by sampling from a zero-mean Gaussian\nprocess at random time points. We use the squared exponential kernel k(ti, tj) = a exp(\u2212b(ti\u2212 tj)2)\ndenote our approximation of z = \u00b5 + \u03a31/2\u03be. In this experiment, we set the output size z to be |S|,\n\nwith randomly chosen hyperparameters. We then infer \u00b5 and \u03a3 at some reference x given S. Let(cid:101)z\nthat is, d = n. We evaluate the approximation quality by assessing the error (cid:107)(cid:101)z \u2212 z(cid:107) computed with\n\na \ufb01xed random vector \u03be.\nThe leftmost plot in Figure 1 shows the approximation error under different numbers of inducing\npoints m with k = 10 Lanczos iterations. The middle plot compares the approximation error as the\nnumber of Lanczos iterations k varies, with m = 256 inducing points. These two plots show that the\napproximation error drops as more inducing points and Lanczos iterations are used. In both plots,\nthe three lines correspond to different sizes for z: 1000 (bottom line), 2000 (middle line), 3000 (top\nline). The separation between the curves is due to the fact that the errors are compared under the\nsame number of inducing points. Longer time series leads to lower resolution of the inducing points\nand hence the higher approximation error.\nNote that the approximation error comes from both the cubic interpolation and the Lanczos method.\nTherefore, to achieve a certain normalized approximation error across different data sizes, we should\nsimultaneously use more inducing points and Lanczos iterations as the data grows. In practice, we\n\ufb01nd that k \u2265 3 is suf\ufb01cient for estimating the expected loss for classi\ufb01cation.\nThe rightmost plot in Figure 1 compares the time to draw a sample using exact computation versus\nthe approximation method described in Section 4 (exact and Lanczos in the \ufb01gure). We also compare\nthe time to compute the gradient with respect to the GP parameters by both the exact method and\nthe proposed approximation (exact BP and Lanczos BP in the \ufb01gure) because this is the actual\ncomputation carried out during training. In this part of the experiment, we use k = 10 and m = 256.\nThe plot shows that Lanczos approximation with the SKI kernel yields speed-ups of between 1 and\n3 orders of magnitude. Interestingly, for the exact approach, the time for computing the gradient\nroughly doubles the time of drawing samples. (Note that time is plotted in log scale.) This is because\ncomputing gradients requires both forward and backward propagation, whereas drawing samples\ncorresponds to only the forward pass. Both the forward and backward passes take roughly the same\ncomputation in the exact case. However, the gap is relatively larger for the approximation approach\ndue to the recursive relationship of the variables in the Lanczos process. In particular, dj is de\ufb01ned\nrecursively in terms of all of d1, . . . , dj\u22121, which makes the backpropagation computation more\ncomplicated than the forward pass.\n\n7\n\n345678910log2(#inducingpoints)10\u2212310\u2212210\u22121100101error05101520#Lanczositerations10\u2212310\u2212210\u22121100errorlengthoftimeseries:10002000300051015202530lengthoftimeseries(\u00d7100)100101102103time(logscale)exactexactBPLanczosLanczosBP\fTable 1: Comparison of classi\ufb01cation accuracy (in percent). IMP and UAC refer to the loss functions\nfor training described in Section 3.1, and we use IMP predictions throughout. Although not belonging\nto the UAC framework, we put the MEG kernel in UAC since it is also uncertainty-aware.\n\nMarginal likelihood\n\nEnd-to-end\n\nIMP\nUAC\nIMP\nUAC\n\nLogReg\n77.90\n78.23\n79.12\n79.24\n\nMLP\n85.49\n87.05\n86.49\n87.95\n\nConvNet\n\nMEG kernel\n\n87.61\n88.17\n89.84\n91.41\n\n\u2013\n\n84.82\n\n\u2013\n\n86.61\n\n7.2 Classi\ufb01cation with GP adapter\n\nIn this section, we evaluate the performance of classifying sparse and irregularly-sampled time series\nusing the UAC framework. We test the framework on the uWave data set,4 a collection of gesture\nsamples categorized into eight gesture patterns [14]. The data set has been split into 3582 training\ninstances and 896 test instances. Each time series contains 945 fully observed samples. Following\nthe data preparation procedure in the MEG kernel work [13], we randomly sample 10% of the\nobservations from each time series to simulate the sparse and irregular sampling scenario. In this\nexperiment, we use the squared exponential covariance function k(ti, tj) = a exp(\u2212b(ti \u2212 tj)2) for\na, b > 0. Together with the independent noise parameter \u03c32 > 0, the GP parameters are {a, b, \u03c32}.\nTo bypass the positive constraints on the GP parameters, we reparameterize them by {\u03b1, \u03b2, \u03b3} such\nthat a = e\u03b1, b = e\u03b2, and \u03c32 = e\u03b3.\nTo demonstrate that the GP adapter is capable of working with various classi\ufb01ers, we use the UAC\nframework to train three different classi\ufb01ers: a multi-class logistic regression (LogReg), a fully-\nconnected feedforward network (MLP), and a convolutional neural network (ConvNet). The detailed\narchitecture of each model is described in Appendix C.\nWe use m = 256 inducing points, d = 254 features output by the GP adapter, k = 5 Lanczos\niterations, and S = 10 samples. We split the training set into two partitions: 70% for training and\n30% for validation. We jointly train the classi\ufb01er with the GP adapter using stochastic gradient\ndescent with Nesterov momentum. We apply early stopping based on the validation set. We also\ncompare to classi\ufb01cation with the MEG kernel implemented using our GP adapter as described in\nSection 6. We use 1000 random features trained with multi-class logistic regression.\nTable 1 shows that among all three classi\ufb01ers, training GP parameters discriminatively always leads\nto better accuracy than maximizing the marginal likelihood. This claim also holds for the results\nusing the MEG kernel. Further, taking the uncertainty into account by sampling from the posterior\nGP always outperforms training using only the posterior means. Finally, we can also see that the\nclassi\ufb01cation accuracy improves as the model gets deeper.\n\n8 Conclusions and future work\n\nWe have presented a general framework for classifying sparse and irregularly-sampled time series\nand have shown how to scale up the required computations using a new approach to generating\napproximate samples. We have validated the approximation quality, the computational speed-ups,\nand the bene\ufb01t of the proposed approach relative to existing baselines.\nThere are many promising directions for future work including investigating more complicated\ncovariance functions like the spectral mixture kernel [24], different classi\ufb01ers including the encoder\nLSTM [23], and extending the framework to multi-dimensional time series and GPs with multi-\ndimensional index sets (e.g., for spatial data). Lastly, the GP adapter can also be applied to other\nproblems such as dimensionality reduction by combining it with an autoencoder.\n\nAcknowledgements\n\nThis work was supported by the National Science Foundation under Grant No. 1350522.\n\n4 The data set UWaveGestureLibraryAll is available at http://timeseriesclassification.com.\n\n8\n\n\fReferences\n[1] Richard H. Bartels and GW Stewart. Solution of the matrix equation AX + XB = C. Communications\n\nof the ACM, 15(9):820\u2013826, 1972.\n\n[2] \u00c5ke Bj\u00f6rck and Sven Hammarling. A Schur method for the square root of a matrix. Linear algebra and its\n\napplications, 52:127\u2013140, 1983.\n\n[3] Edmond Chow and Yousef Saad. Preconditioned krylov subspace methods for sampling multivariate\n\ngaussian distributions. SIAM Journal on Scienti\ufb01c Computing, 36(2):A588\u2013A608, 2014.\n\n[4] J.S. Clark and O.N. Bj\u00f8rnstad. Population time series: process variability, observation errors, missing\n\nvalues, lags, and hidden states. Ecology, 85(11):3140\u20133150, 2004.\n\n[5] Augustin A Dubrulle. Retooling the method of block conjugate gradients. Electronic Transactions on\n\nNumerical Analysis, 12:216\u2013233, 2001.\n\n[6] YT Feng, DRJ Owen, and D Peri\u00b4c. A block conjugate gradient method applied to linear systems with\n\nmultiple right-hand sides. Computer methods in applied mechanics and engineering, 1995.\n[7] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n[8] Gene Howard Golub and Richard Underwood. The block Lanczos method for computing eigenvalues.\n\nMathematical software, 3:361\u2013377, 1977.\n\n[9] M Ili\u00b4c, Ian W Turner, and Daniel P Simpson. A restarted Lanczos approximation to functions of a\n\nsymmetric matrix. IMA journal of numerical analysis, page drp003, 2009.\n\n[10] Robert G Keys. Cubic convolution interpolation for digital image processing. Acoustics, Speech and Signal\n\nProcessing, IEEE Transactions on, 29(6):1153\u20131160, 1981.\n\n[11] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. Proceedings of the 2nd Interna-\n\ntional Conference on Learning Representations (ICLR), 2014.\n\n[12] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with\ninvariance to pose and lighting. In Proceedings of Computer Vision and Pattern Recognition (CVPR), 2004.\n[13] Steven Cheng-Xian Li and Benjmain M. Marlin. Classi\ufb01cation of sparse and irregularly sampled time\nseries with mixtures of expected Gaussian kernels and random features. In 31st Conference on Uncertainty\nin Arti\ufb01cial Intelligence, 2015.\n\n[14] Jiayang Liu, Lin Zhong, Jehan Wickramasuriya, and Venu Vasudevan. uwave: Accelerometer-based\n\npersonalized gesture recognition and its applications. Pervasive and Mobile Computing, 2009.\n\n[15] Benjamin M. Marlin, David C. Kale, Robinder G. Khemani, and Randall C. Wetzel. Unsupervised pattern\ndiscovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd\nACM SIGHIT International Health Informatics Symposium, pages 389\u2013398, 2012.\n[16] Beresford N Parlett. The symmetric eigenvalue problem, volume 7. SIAM, 1980.\n[17] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.\n\n[18] T. Ruf. The lomb-scargle periodogram in biological rhythm research: analysis of incomplete and unequally\n\nspaced time-series. Biological Rhythm Research, 30(2):178\u2013201, 1999.\n\n[19] Yousef Saad. On the rates of convergence of the Lanczos and the block-Lanczos methods. SIAM Journal\n\non Numerical Analysis, 17(5):687\u2013706, 1980.\n\n[20] Yousef Saad. Iterative methods for sparse linear systems. Siam, 2003.\n[21] Jeffrey D Scargle. Studies in astronomical time series analysis. ii-statistical aspects of spectral analysis of\n\nunevenly spaced data. The Astrophysical Journal, 263:835\u2013853, 1982.\n\n[22] M. Schulz and K. Stattegger. Spectrum: Spectral analysis of unevenly spaced paleoclimatic time series.\n\nComputers & Geosciences, 23(9):929\u2013945, 1997.\n\n[23] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[24] Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery and\n\nextrapolation. In Proceedings of the 30th International Conference on Machine Learning, 2013.\n\n[25] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian\nprocesses (KISS-GP). In Proceedings of the 32nd International Conference on Machine Learning, 2015.\n\n9\n\n\f", "award": [], "sourceid": 976, "authors": [{"given_name": "Steven Cheng-Xian", "family_name": "Li", "institution": "UMass Amherst"}, {"given_name": "Benjamin", "family_name": "Marlin", "institution": "University of Massachusetts Amherst"}]}