{"title": "Incremental Variational Sparse Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 4410, "page_last": 4418, "abstract": "Recent work on scaling up Gaussian process regression (GPR) to large datasets has primarily focused on sparse GPR, which leverages a small set of basis functions to approximate the full Gaussian process during inference.  However, the majority of these approaches are batch methods that operate on the entire training dataset at once, precluding the use of datasets that are streaming or too large to fit into memory. Although previous work has considered incrementally solving variational sparse GPR, most algorithms fail to update the basis functions and therefore perform suboptimally. We propose a novel incremental learning algorithm for variational sparse GPR based on stochastic mirror ascent of probability densities in reproducing kernel Hilbert space. This new formulation allows our algorithm to update basis functions online in accordance with the manifold structure of probability densities for fast convergence. We conduct several experiments and show that our proposed approach achieves better empirical performance in terms of prediction error than  the recent state-of-the-art incremental solutions to variational sparse GPR.", "full_text": "Incremental Variational Sparse Gaussian Process\n\nRegression\n\nChing-An Cheng\n\nByron Boots\n\nInstitute for Robotics and Intelligent Machines\n\nInstitute for Robotics and Intelligent Machines\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\ncacheng@gatech.edu\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nbboots@cc.gatech.edu\n\nAbstract\n\nRecent work on scaling up Gaussian process regression (GPR) to large datasets has\nprimarily focused on sparse GPR, which leverages a small set of basis functions\nto approximate the full Gaussian process during inference. However, the majority\nof these approaches are batch methods that operate on the entire training dataset\nat once, precluding the use of datasets that are streaming or too large to \ufb01t into\nmemory. Although previous work has considered incrementally solving variational\nsparse GPR, most algorithms fail to update the basis functions and therefore\nperform suboptimally. We propose a novel incremental learning algorithm for\nvariational sparse GPR based on stochastic mirror ascent of probability densities\nin reproducing kernel Hilbert space. This new formulation allows our algorithm\nto update basis functions online in accordance with the manifold structure of\nprobability densities for fast convergence. We conduct several experiments and\nshow that our proposed approach achieves better empirical performance in terms of\nprediction error than the recent state-of-the-art incremental solutions to variational\nsparse GPR.\n\nIntroduction\n\n1\nGaussian processes (GPs) are nonparametric statistical models widely used for probabilistic reasoning\nabout functions. Gaussian process regression (GPR) can be used to infer the distribution of a latent\nfunction f from data. The merit of GPR is that it \ufb01nds the maximum a posteriori estimate of\nthe function while providing the pro\ufb01le of the remaining uncertainty. However, GPR also has\ndrawbacks: like most nonparametric learning techniques the time and space complexity of GPR\nscale polynomially with the amount of training data. Given N observations, inference of GPR\ninvolves inverting an N \u00d7 N covariance matrix which requires O(N 3) operations and O(N 2) storage.\nTherefore, GPR for large N is infeasible in practice.\nSparse Gaussian process regression is a pragmatic solution that trades accuracy against computa-\ntional complexity. Instead of parameterizing the posterior using all N observations, the idea is\nto approximate the full GP using the statistics of \ufb01nite M (cid:28) N function values and leverage the\ninduced low-rank structure to reduce the complexity to O(M 2N + M 3) and the memory to O(M 2).\nOften sparse GPRs are expressed in terms of the distribution of f (\u02dcxi), where \u02dcX = {\u02dcxi \u2208 X}M\ni=1 are\ncalled inducing points or pseudo-inputs [21, 23, 18, 26]. A more general representation leverages the\ninformation about the inducing function (Lif )(\u02dcxi) de\ufb01ned by indirect measurement of f through a\nbounded linear operator Li (e.g. integral) to more compactly capture the full GP [27, 8]. In this work,\nwe embrace the general notion of inducing functions, which trivially includes f (\u02dcxi) by choosing Li\nto be identity. With abuse of notation, we reuse the term inducing points \u02dcX to denote the parameters\nthat de\ufb01ne the inducing functions.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fLearning a sparse GP representation in regression can be summarized as inference of the hyperpa-\nrameters, the inducing points, and the statistics of inducing functions. One approach to learning is\nto treat all of the parameters as hyperparameters and \ufb01nd the solution that maximizes the marginal\nlikelihood [21, 23, 18]. An alternative approach is to view the inducing points and the statistics of\ninducing functions as variational parameters of a class of full GPs, to approximate the true posterior of\nf, and solve the problem via variational inference, which has been shown robust to over-\ufb01tting [26, 1].\nAll of the above methods are designed for the batch setting, where all of the data is collected in\nadvance and used at once. However, if the training dataset is extremely large or the data are streaming\nand encountered in sequence, we may want to incrementally update the approximate posterior of the\nlatent function f. Early work by Csat\u00f3 and Opper [6] proposed an online version of GPR, which\ngreedily performs moment matching of the true posterior given one sample instead of the posterior of\nall samples. More recently, several attempts have been made to modify variational batch algorithms\nto incremental algorithms for learning sparse GPs [1, 9, 10]. Most of these methods rely on the\nfact that variational sparse GPR with \ufb01xed inducing points and hyperparameters is equivalent to\ninference of the conjugate exponential family: Hensman et al. [9] propose a stochastic approximation\nof the variational sparse GPR problem [26] based on stochastic natural gradient ascent [11]; Hoang\net al. [10] generalizes this approach to the case with general Gaussian process priors. Unlike the\noriginal variational algorithm for sparse GPR [26], which \ufb01nds the optimal inducing points and\nhyperparameters, these algorithms only update the statistics of the inducing functions f \u02dcX.\nIn this paper, we propose an incremental learning algorithm for variational sparse GPR, which\nwe denote as iVSGPR. Leveraging the dual formulation of variational sparse GPR in reproducing\nkernel Hilbert space (RKHS), iVSGPR performs stochastic mirror ascent in the space of probability\ndensities [17] to update the approximate posterior of f, and stochastic gradient ascent to update the\nhyperparameters. Stochastic mirror ascent, similar to stochastic natural gradient ascent, considers the\nmanifold structure of probability functions and therefore converges faster than the naive gradient ap-\nproach. In each iteration, iVSGPR solves a variational sparse GPR problem of the size of a minibatch.\nAs a result, iVSGPR has constant complexity per iteration and can learn all the hyperparameters, the\ninducing points, and the associated statistics online.\n\n2 Background\nIn this section, we provide a brief summary of Gaussian process regression and sparse Gaussian\nprocess regression for ef\ufb01cient inference before proceeding to introduce our incremental algorithm\nfor variational sparse Gaussian process regression in Section 3.\n\n2.1 Gaussian Processes Regression\nLet F be a family of real-valued continuous functions f : X (cid:55)\u2192 R. A GP is a distribution of\nfunctions f in F such that, for any \ufb01nite set X \u2282 X , {f (x)|x \u2208 X} is Gaussian distributed\nN (f (x)|m(x), k(x, x(cid:48))): for any x, x(cid:48) \u2208 X , m(x) and k(x, x(cid:48)) represent the mean of f (x) and the\ncovariance between f (x) and f (x(cid:48)), respectively. In shorthand, we write f \u223c GP(m, k).\nThe mean m(x) and the covariance k(x, x(cid:48)) (the kernel function) are often parametrized by a set of\nhyperparameters which encode our prior belief of the unknown function f. In this work, for simplicity,\nwe assume that m(x) = 0 and the kernel can be parameterized as k(x, x(cid:48)) = \u03c12gs(x, x(cid:48)), where\ngs(x, x(cid:48)) is a positive de\ufb01nite kernel, \u03c12 is a scaling factor and s denotes other hyperparameters [20].\nThe objective of GPR is to infer the posterior probability of the function f given data D =\n{(xi, yi)}N\ni=1. In learning, the function value f (xi) is treated as a latent variable and the obser-\nvation yi = f (xi) + \u0001i is modeled as the function corrupted by i.i.d. noise \u0001i \u223c N (\u0001|0, \u03c32). Let\nX = {xi}N\ni=1. The posterior probability distribution p(f|y) can be compactly summarized as\nGP(m|D, k|D):\n\nm|D(x) = kx,X (KX + \u03c32I)\u22121y\n\n(1)\n(2)\ni=1 \u2208 RN , kx,X \u2208 R1\u00d7N denotes the vector of the cross-covariance between x and X,\nwhere y = (yi)N\nand KX \u2208 RN\u00d7N denotes the empirical covariance matrix of the training set. The hyperparameters\n\nk|D(x, x(cid:48)) = kx,x(cid:48) \u2212 kx,X (KX + \u03c32I)\u22121kX,x(cid:48)\n\n2\n\n\f\u03b8 := (s, \u03c1, \u03c3) in the GP are learned by maximizing the log-likelihood of the observation y\n\nmax\n\n\u03b8\n\nlog p(y) = max\n\n\u03b8\n\nlog N (y|0, KX + \u03c32I).\n\n(3)\n\n2.2 Sparse Gaussian Processes Regression\nA straightforward approach to sparse GPR is to approximate the GP prior of interest with a degenerate\nGP [21, 23, 18]. Formally, for any xi, xj \u2208 X , it assumes that\n\nf (xi) \u22a5 f (xj)|f \u02dcX ,\n\nf (xi) \u22a5 yi|f \u02dcX ,\n(4)\ni=1 and \u22a5 denotes probabilistic independence between two random\nwhere f \u02dcX denotes ((Lif )(\u02dcxi))M\nvariables. That is, the original empirical covariance matrix KX is replaced by a rank-M approxi-\nK \u02dcX,X, where K \u02dcX is the covariance of f \u02dcX and KX, \u02dcX \u2208 RN\u00d7M is the\nmation \u02c6KX := KX, \u02dcX K\u22121\ncross-covariance between fX and f \u02dcX. Let \u039b \u2208 RN\u00d7N be diagonal. The inducing points \u02dcX are\ntreated as hyperparameters and can be found by jointly maximizing the log-likelihood with \u03b8\n\n\u02dcX\n\nlog N (y|0, \u02c6KX + \u03c32I + \u039b),\n\nmax\n\u03b8, \u02dcX\n\n(5)\n\nSeveral approaches to sparse GPR can be viewed as special cases of this problem [18]: the Determin-\nistic Training Conditional (DTC) [21] approximation sets \u039b as zero. To heal the degeneracy in p(fX ),\nthe Fully Independent Training Conditional (FITC) approximation [23] includes heteroscedastic\nnoise, setting \u039b = diag(KX \u2212 \u02c6KX ). As a result, their sum \u039b + \u02c6KX matches the true covariance\nKX in the diagonal term. This general maximum likelihood scheme for \ufb01nding the inducing points\nis adopted with variations in [24, 27, 8, 2]. A major drawback of all of these approaches is that they\ncan over-\ufb01t due to the high degrees-of-freedom \u02dcX in the prior parametrization [26].\nVariational sparse GPR can alternatively be formulated to approximate the posterior of the latent\nfunction by a full GP parameterized by the inducing points and the statistics of inducing functions [1,\n26]. Speci\ufb01cally, Titsias [26] proposes to use\n\nq(fX , f \u02dcX ) = p(fX|f \u02dcX )q(f \u02dcX )\n\n(6)\nto approximate p(fX , f \u02dcX|y), where q(f \u02dcX ) = N (f \u02dcX| \u02dcm, \u02dcS) is the Gaussian approximation of p(f \u02dcX|y)\nf \u02dcX , KX \u2212 \u02c6KX ) is the conditional probability in the full GP. The\nand p(fX|f \u02dcX ) = N (fX|KX, \u02dcX K\u22121\nnovelty here is that q(fX , f \u02dcX ), despite parametrization by \ufb01nite parameters, is still a full GP, which,\nunlike its predecessor [21], can be in\ufb01nite-dimensional.\nThe inference problem of variational sparse GPR is solved by minimizing the KL-divergence\nKL[q(fX , f \u02dcX )||p(fX , f \u02dcX|y)]. In practice, the minimization problem is transformed into the maxi-\nmization of the lower bound of the log-likelihood [26]:\n\n\u02dcX\n\n(cid:90)\n(cid:90)\n\nmax\n\n\u03b8\n\nlog p(y) \u2265 max\n\u03b8, \u02dcX, \u02dcm, \u02dcS\n\n= max\n\n\u03b8, \u02dcX, \u02dcm, \u02dcS\n\np(y|fX )p(fX|f \u02dcX )p(f \u02dcX )\n\nq(fX , f \u02dcX ) log\np(fX|f \u02dcX )q(f \u02dcX ) log\n\ndfX df \u02dcX\n\nq(fX , f \u02dcX )\np(y|fX )p(f \u02dcX )\n\nq(f \u02dcX )\n\ndfX df \u02dcX\n\n= max\n\u03b8, \u02dcX\n\nlog N (y|0, \u02c6KX + \u03c32I) \u2212 1\n\n2\u03c32 Tr(KX \u2212 \u02c6KX ).\n\n(7)\n\nThe last equality results from exact maximization over \u02dcm and \u02dcS; for treatment of non-conjugate\nlikelihoods, see [22]. We note that q(f \u02dcX ) is a function of \u02dcm and \u02dcS, whereas p(f \u02dcX ) and p(fX|f \u02dcX )\nare functions of \u02dcX. As a result, \u02dcX become variational parameters that can be optimized without\nover-\ufb01tting. Compared with (5), the variational approach in (7) regularizes the learning with penalty\nTr(KX \u2212 \u02c6KX ) and therefore exhibits better generalization performance. Several subsequent works\nemploy similar strategies: Alvarez et al. [3] adopt the same variational approach in the multi-output\nregression setting with scaled basis functions, and Abdel-Gawad et al. [1] use expectation propagation\nto solve for the approximate posterior under the same factorization.\n\n3\n\n\fIncremental Variational Sparse Gaussian Process Regression\n\n3\nDespite leveraging sparsity, the batch solution to the variational objective in (7) requires O(M 2N )\noperations and access to all of the training data during each optimization step [26], which means\nthat learning from large datasets is still infeasible. Recently, several attempts have been made to\nincrementally solve the variational sparse GPR problem in order to learn better models from large\ndatasets [1, 9, 10]. The key idea is to rewrite (7) explicitly into the sum of individual observations:\n\n(cid:90)\n(cid:90)\n\nmax\n\n\u03b8, \u02dcX, \u02dcm, \u02dcS\n\n= max\n\n\u03b8, \u02dcX, \u02dcm, \u02dcS\n\n(cid:32) N(cid:88)\n\ni=1\n\np(fX|f \u02dcX )q(f \u02dcX ) log\n\np(y|fX )p(f \u02dcX )\n\nq(f \u02dcX )\n\ndfX df \u02dcX\n\nq(f \u02dcX )\n\nEp(fxi|f \u02dcX )[log p(yi|fxi)] + log\n\np(f \u02dcX )\nq(f \u02dcX )\n\n(cid:33)\n\ndf \u02dcX .\n\n(8)\n\nThe objective function in (8), with \ufb01xed \u02dcX, is identical to the problem of stochastic variational\ninference [11] of conjugate exponential families. Hensman et al. [9] exploit this idea to incrementally\nupdate the statistics \u02dcm and \u02dcS via stochastic natural gradient ascent,1 which, at the tth iteration, takes\nthe direction derived from the limit of maximizing (8) subject to KLsym(qt(f \u02dcX )||qt\u22121(f \u02dcX )) < \u0001 as\n\u0001 \u2192 0. Natural gradient ascent considers the manifold structure of probability distribution derived\nfrom KL divergence and is known to be Fisher ef\ufb01cient [4]. Although the optimal inducing points \u02dcX,\nlike the statistics \u02dcm and \u02dcS, should be updated given new observations, it is dif\ufb01cult to design natural\ngradient ascent for learning the inducing points \u02dcX online. Because p(fX|f \u02dcX ) in (8) depends on all\nthe observations, evaluating the divergence with respect to p(fX|f \u02dcX )q(f \u02dcX ) over iterations becomes\ninfeasible.\nWe propose a novel approach to incremental variational sparse GPR, iVSGPR, that works by re-\nformulating (7) in its RKHS dual form. This avoids the issue of the posterior approximation\np(fX|f \u02dcX )q(f \u02dcX ) referring to all observations. As a result, we can perform stochastic approximation\nof (7) while monitoring the KL divergence between the posterior approximates due to the change\nof \u02dcm, \u02dcS, and \u02dcX across iterations. Speci\ufb01cally, we use stochastic mirror ascent [17] in the space\nof probability densities in RKHS, which was recently proven to be as ef\ufb01cient as stochastic natural\ngradient ascent [19]. In each iteration, iVSGPR solves a subproblem of fractional Bayesian inference,\nwhich we show can be formulated into a standard variational sparse GPR of the size of a minibatch in\nO(M 2Nm + M 3) operations, where Nm is the size of a minibatch.\n\n3.1 Dual Representations of Gaussian Processes in RKHS\nAn RKHS H is a Hilbert space of functions satisfying the reproducing property: \u2203kx \u2208 H such that\n\u2200f \u2208 H, f (x) = (cid:104)f, kx(cid:105)H. In general, H can be in\ufb01nite-dimensional and uniformly approximate\nx f for (cid:104)f, kx(cid:105)H, and\ncontinuous functions on a compact set [16]. To simplify the notation we write kT\nf T Lg for (cid:104)f, Lg(cid:105), where f, g \u2208 H and L : H (cid:55)\u2192 H, even if H is in\ufb01nite-dimensional.\nA Gaussian process GP(m, k) has a dual representation in an RKHS H [12]: there exists \u00b5 \u2208 H and\na positive semi-de\ufb01nite linear operator \u03a3 : H (cid:55)\u2192 H such that for any x, x(cid:48) \u2208 X , \u2203\u03c6x, \u03c6x(cid:48) \u2208 H,\n\nk(x, x(cid:48)) = \u03c12\u03c6T\n\nm(x) = \u03c1\u03c6T\n\nx \u00b5,\n\nx \u03a3\u03c6x(cid:48).\n\n(9)\nThat is, the mean function has a realization \u00b5 in H, which is de\ufb01ned by the reproducing kernel\n\u02dck(x, x(cid:48)) = \u03c12\u03c6T\nx \u03c6x(cid:48); the covariance function can be equivalently represented by a linear operator\n\u03a3. In shorthand, with abuse of notation, we write N (f|\u00b5, \u03a3).2 Note that we do not assume the\nsamples from GP(m, k) are in H. In the following, without loss of generality, we assume the GP\nprior considered in regression has \u00b5 = 0 and \u03a3 = I. That is, m(x) = 0 and k(x, x(cid:48)) = \u03c12\u03c6T\n3.1.1 Subspace Parametrization of the Approximate Posterior\nThe full GP posterior approximation p(fX|f \u02dcX )q(f \u02dcX ) in (7) can be written equivalently in a subspace\nparametrization using {\u03c8\u02dcxi \u2208 H|\u02dcxi \u2208 \u02dcX}M\ni=1:\n\u02dc\u00b5 = \u03a8 \u02dcX a,\n\n\u02dc\u03a3 = I + \u03a8 \u02dcX A\u03a8T\n\u02dcX ,\n\nx \u03c6x(cid:48).\n\n(10)\n\n1Although \u02dcX was \ufb01xed in their experiments, it can potentially be updated by stochastic gradient ascent.\n2Because a GP can be in\ufb01nite-dimensional, it cannot de\ufb01ne a density but only a Gaussian measure. The\nnotation N (f|\u00b5, \u03a3) is used to indicate that the Gaussian measure can be de\ufb01ned, equivalently, by \u00b5 and \u03a3.\n\n4\n\n\f(cid:80)M\nwhere a \u2208 RM , A \u2208 RM\u00d7M such that \u02dc\u03a3 (cid:23) 0, and \u03a8 \u02dcX : RM (cid:55)\u2192 H is de\ufb01ned as \u03a8 \u02dcX a =\ni=1 ai\u03c8\u02dcxi. Suppose q(f \u02dcX ) = N (f \u02dcX| \u02dcm, \u02dcS) and de\ufb01ne \u03c8\u02dcxi to satisfy \u03a8T\n\u02dc\u00b5 = \u02dcm. By (10),\n\n\u02dcX\n\n\u02dcm = K \u02dcX a and \u02dcS = K \u02dcX + K \u02dcX AK \u02dcX, which implies the relationship\n\u2212 K\u22121\n\u02dcX\n\n\u02dcm, A = K\u22121\n\u02dcX\n\na = K\u22121\n\u02dcX\n\n\u02dcSK\u22121\n\u02dcX\n\n,\n\n(11)\n\n\u02dcm, kx,x + kx, \u02dcX (K\u22121\n\nX \u03a8 \u02dcX. The sparse structure results in f (x) \u223c GP(kx, \u02dcX K\u22121\n\n)k \u02dcX,x), which is the same as(cid:82) p(f (x)|f \u02dcX )q(f \u02dcX )df \u02dcX, the posterior GP found in (7), where\n\nwhere the covariances related to the inducing functions are de\ufb01ned as K \u02dcX = \u03a8T\n\u02dcX\n\u03c1\u03a6T\nK\u22121\n\u02dcX\nkx,x = k(x, x) and kx, \u02dcX = \u03c1\u03c6T\nx \u03a8 \u02dcX. We note that the scaling factor \u03c1 is associated with the\nevaluation of f (x), not the inducing functions f \u02dcX. In addition, we distinguish the hyperparameter s\n(e.g. length scale) that controls the measurement basis \u03c6x from the parameters in inducing points \u02dcX.\nA subspace parametrization corresponds to a full GP if \u02dc\u03a3 (cid:31) 0. More precisely, because (10) is\ncompletely determined by the statistics \u02dcm, \u02dcS, and the inducing points \u02dcX, the family of subspace\nparametrized GPs lie on a nonlinear submanifold in the space of all GPs (the degenerate GP in (4) is\na special case if we allow I in (10) to be ignored).\n\n\u03a8 \u02dcX and KX, \u02dcX =\n\u2212\n\n\u02dcSK\u22121\n\u02dcX\n\n\u02dcX\n\n\u02dcX\n\n3.1.2 Sparse Gaussian Processes Regression in RKHS\nWe now reformulate the variational inference problem (7) in RKHS3. Following the previous section,\nthe sparse GP structure on the posterior approximate q(fX , f \u02dcX ) in (6) has a corresponding dual\nrepresentation in RKHS q(f ) = N (f|\u02dc\u00b5, \u02dc\u03a3). Specially, q(f ) and q(fX , f \u02dcX ) are related as follows:\n(12)\nin which the determinant is due to the change of measure. The equality (12) allows us to rewrite (7)\nin terms of q(f ) simply as\n\nq(f ) \u221d p(fX|f \u02dcX )q(f \u02dcX )|K \u02dcX|1/2|KX \u2212 \u02c6KX|1/2,\n\np(y|f )p(f )\n\nL(q(f )) = max\n\nq(f )\n\nmax\nq(f )\n\nq(f ) log\n\n(13)\nor equivalently as minq(f ) KL[q(f )||p(f|y)]. That is, the heuristically motivated variational prob-\nlem (7) is indeed minimizing a proper KL-divergence between two Gaussian measures. A similar\njusti\ufb01cation on (7) is given rigorously in [14] in terms of KL-divergence minimization between\nGaussian processes, which can be viewed as a dual of our approach. Due to space limitations, the\nproofs of (12) and the equivalence between (7) and (13) can be found in the Appendix.\nThe bene\ufb01t of the formulation of (13) is that in its sampling form,\n\nq(f )\n\ndf,\n\n(cid:90)\n\n(cid:90)\n\n(cid:32) N(cid:88)\n\ni=1\n\n(cid:33)\n\nmax\nq(f )\n\nq(f )\n\nlog p(yi|f ) + log\n\np(f )\nq(f )\n\ndf,\n\n(14)\n\nthe approximate posterior q(f ) nicely summarizes all the variational parameters \u02dcX, \u02dcm, and \u02dcS without\nreferring to the samples as in p(fX|f \u02dcX )q(f \u02dcX ). Therefore, the KL-divergence of q(f ) across iterations\ncan be used to regulate online learning.\n\nIncremental Learning\n\n3.2\nStochastic mirror ascent [17] considers (non-)Euclidean structure on variables induced by a Bregman\ndivergence (or prox-function) [5] in convex optimization. We apply it to solve the variational\ninference problem in (14), because (14) is convex in the space of probabilities [17]. Here, we ignore\nthe dependency of q(f ) on f for simplicity. At the tth iteration, stochastic mirror ascent solves the\nsubproblem\n\nqt+1 = arg max\n\nq\n\n\u03b3t\n\n\u02c6\u2202L(qt, yt)q(f )df \u2212 KL[q||qt],\n\n(15)\n\n(cid:90)\n\n3Here we assume the set X is \ufb01nite and countable. This assumption suf\ufb01ces in learning and allows us to\nrestrict H be the \ufb01nite-dimensional span of \u03a6X. Rigorously, for in\ufb01nite-dimensional H, the equivalence can be\nwritten in terms of Radon\u2013Nikodym derivative between q(f )df and normal Gaussian measure, where q(f )df\ndenotes a Gaussian measure that has an RKHS representation given as q(f )\n\n5\n\n\fwhere \u03b3t is the step size, \u02c6\u2202L(qt, yt) is the sampled subgradient of L with respect to q when the\nobservation is (xt, yt). The algorithm converges in O(t\u22121/2) if (15) is solved within numerical error\n\n\u0001t such that(cid:80) \u0001t \u223c O((cid:80) \u03b32\n\nt ) [7].\n\nThe subproblem (15) is actually equivalent to sparse variational GP regression with a general Gaussian\nprior. By de\ufb01nition of L(q) in (14), (15) can be derived as\n\n(cid:90)\n\n(cid:18)\n\n(cid:19)\n\ndf \u2212 KL[q||qt]\n\nqt+1 = arg max\n\nq\n\n= arg max\n\nq\n\n\u03b3t\n\n(cid:90)\n\nq(f )\n\nq(f ) log\n\nN log p(yt|f ) + log\np(yt|f )N \u03b3tp(f )\u03b3tq1\u2212\u03b3t\n\np(f )\nqt(f )\n(f )\n\nt\n\nq(f )\n\ndf.\n\n(16)\n\nt\n\nThis equation is equivalent to (13) but with the prior modi\ufb01ed to p(f )\u03b3tqt(f )1\u2212\u03b3t and the likelihood\nmodi\ufb01ed to p(yi|f )N \u03b3t. Because p(f ) is an isotropic zero-mean Gaussian, p(f )\u03b3tqt(f )1\u2212\u03b3t has the\nsubspace parametrization expressed in the same basis functions as qt. Suppose qt has mean \u02dc\u00b5t and\n. Then p(f )\u03b3tqt(f )1\u2212\u03b3t is equivalent to N (f|\u02c6\u00b5, \u02c6\u03a3) up to a constant factor, where\nprecision \u02dc\u03a3\u22121\n\u02c6\u00b5t = (1\u2212 \u03b3t) \u02c6\u03a3t \u02dc\u03a3\u22121\nt \u02dc\u00b5t and \u02c6\u03a3\u22121\nt + K \u02dcX )\u22121\u03a8 \u02dcX\nfor some At, and therefore \u02c6\u03a3\u22121\nt + K \u02dcX )\u22121\u03a8 \u02dcX, which is expressed in the\nsame basis. In addition, by (12), (16) can be further written in the form of (7) and therefore solved by\na standard sparse variational GPR program with modi\ufb01ed \u02dcm and \u02dcS (Please see Appendix for details).\nAlthough we derived the equations for a single observation, minibatchs can be used with the same\nN \u03b3t\nNm . The\nhyperparameters \u03b8 = (s, \u03c1, \u03c3) in the GP can be updated along with the variational parameters using\n\nconvergence rate and reduced variance by changing the factor p(yt|f )N \u03b3t to(cid:81)Nm\nstochastic gradient ascent along the gradient of(cid:82) qt(f ) log p(yt|f )p(f )\n\nt = (1\u2212 \u03b3t) \u02dc\u03a3\u22121\nt = I \u2212 (1 \u2212 \u03b3t)\u03a8 \u02dcX (A\u22121\n\nt + \u03b3tI. By (10), \u02dc\u03a3\u22121\n\nt = I \u2212 \u03a8 \u02dcX (A\u22121\n\ni=1 p(yti|f )\n\ndf.\n\nqt(f )\n\nt] in order to project q(cid:48)\n\n3.3 Related Work\nThe subproblem (16) is equivalent to \ufb01rst performing stochastic natural gradient ascent [11] of q(f )\nin (14) and then projecting the distribution back to the low-dimensional manifold speci\ufb01ed by the\nt(f ) := p(yt|f )N \u03b3tp(f )\u03b3tqt(f )1\u2212\u03b3t. Because\nsubspace parametrization. At the tth iteration, de\ufb01ne q(cid:48)\na GP can be viewed as Gaussian measure in an in\ufb01nite-dimensional RKHS, q(cid:48)\nt(f ) (16) can be viewed\nas the result of taking natural stochastic gradient ascent with step size \u03b3t from qt(f ). Then (16)\nbecomes minq KL[q||q(cid:48)\nt back to subspace parametrization speci\ufb01ed by M basis\nfunctions. Therefore, (16) can also be viewed as performing stochastic natural gradient ascent with a\nKL divergence projection. From this perspective, we can see that if \u02dcX, which controls the inducing\nfunctions, are \ufb01xed in the subproblem (16), iVSGPR degenerates to the algorithm of Hensman et\nal. [9].\nRecently, several researches have considered the manifold structure induced by KL divergence in\nBayesian inference [7, 25, 13]. Theis and Hoffman [25] use trust regions to mitigate the sensitivity of\nstochastic variational inference to choices of hyperparameters and initialization. Let \u03bet be the size\nof the trust region. At the tth iteration, it solves the objective maxq L(q) \u2212 \u03betKL[q||qt], which is\nthe same as subproblem (16) if applied to (14). The difference is that in (16) \u03b3t is a decaying step\nsequence in stochastic mirror ascent, whereas \u03bet is manually selected. A similar formulation also\nappears in [13], where the part of L(q) non-convex to the variational parameters is linearized. Dai et\nal. [7] use particles or a kernel density estimator to approximate the posterior of \u02dcX in the setting with\nlow-rank GP prior. By contrast, we follow Titsias\u2019s variational approach [26] to adopt a full GP as\nthe approximate posterior, and therefore avoid the dif\ufb01culties in estimating the posterior of \u02dcX and\nfocus on the approximate posterior q(f ) related to the function values.\nThe stochastic mirror ascent framework sheds light on the convergence condition of the algorithm. As\n\npointed out in Dai et al. [7], the subproblem (15) can be solved up to \u0001t accuracy as long as(cid:80) \u0001t is\norder O((cid:80) \u03b32\n\nt) [17]. Also, Khan et al. [13] solves a linearized approximation\nof (15) in each step and reports satisfactory empirical results. Although variational sparse GPR (16) is\na nonconvex optimization in \u02dcX and is often solved by nonlinear conjugate gradient ascent, empirically\nthe objective function increases most signi\ufb01cantly in the \ufb01rst few iterations. Therefore, based on the\nresults in [7], we argue that in online learning (16) can be solved approximately by performing a\nsmall \ufb01xed number of line searches.\n\nt ), where \u03b3t \u223c O(1/\n\n\u221a\n\n6\n\n\f4 Experiments\nWe compare our method iVSGPR with VSGPRsvi the state-of-the-art variational sparse GPR method\nbased on stochastic variational inference [9], in which i.i.d. data are sampled from the training\ndataset to update the models. We consider a zero-mean GP prior generated by a squared-exponential\n),\nwhere sd > 0 is the length scale of dimension d and D is the dimensionality of the input. For the\ninducing functions, we modi\ufb01ed the multi-scale kernel in [27] to\n\nwith automatic relevance determination (SE-ARD) kernel [20] k(x, x(cid:48)) = \u03c12(cid:81)D\n(cid:33)\n\n(cid:33)1/2\n\n\u2212(xd\u2212x(cid:48)\n2s2\nd\n\nd=1 exp(\n\n(cid:32)\n\nd)2\n\n,\n\n(17)\n\nD(cid:89)\n\ni=d\n\n\u03c8T\n\nx \u03c8x(cid:48) =\n\n2lx,dlx(cid:48),d\nl2\nx,d + l2\nx(cid:48),d\n\nexp\n\n(cid:32)\n\u2212 D(cid:88)\n\nd=1\n\n(cid:107)xd \u2212 x(cid:48)\nl2\nx,d + l2\n\nd(cid:107)2\nx(cid:48),d\n\n\u221a\n\nd=1 = (sd)D\n\nwhere lx,d is the length-scale parameter. The de\ufb01nition (17) includes the SE-ARD kernel as a special\nd=1, and hence their cross\ncase, which can be recovered by identifying \u03c8x = \u03c6x and (lx,d)D\ncovariance can be computed.\nIn the following experiments, we set the number inducing functions to 512. All models were\ninitialized with the same hyperparameters and inducing points: the hyperparameters were selected as\nthe optimal ones in the batch variational sparse GPR [26] trained on subset of the training dataset\nof size 2048; the inducing points were initialized as random samples from the \ufb01rst minibatch. We\nt)\u22121, for stochastic mirror ascent to update the posterior\nchose the learning rate to be \u03b3t = (1 +\napproximation; the learning rate for the stochastic gradient ascent to update the hyperparameters is\nset to 10\u22124\u03b3t . We evaluate the models in terms of the normalized mean squared error (nMSE) on a\nheld-out test set after 500 iterations.\nWe performed experiments on three real-world robotic datasets datasets, kin40k4, SARCOS5, KUKA6,\nand three variations of iVSGPR: iVSGPR5, iVSGPR10, and iVSGPRada.7 For the kin40k and SARCOS\ndatasets, we also implemented VSGPR\u2217\nsvi, which uses stochastic variational inference to update \u02dcm\nand \u02dcS but \ufb01xes hyperparameters and inducing points as the solution to the batch variational sparse\nGPR [26] with all of the training data. Because VSGPR\u2217\nsvi re\ufb02ects the perfect scenario of performing\nstochastic approximation under the selected learning rate, we consider it as the optimal goal we want\nto approach.\nThe experimental results of kin40k and SARCOS are summarized in Table 1a. In general, the adaptive\nscheme iVSGPRada performs the best, but we observe that even performing a small \ufb01xed number of\niterations ( iVSGPR5, iVSGPR10) results in performance that is close to, if not better than VSGPR\u2217\nsvi.\nPossible explanations are that the change of objective function in gradient-based algorithms is\ndominant in the \ufb01rst few iterations and that the found inducing points and hyper-parameters have\n\ufb01nite numerical resolution in batch optimization. For example, Figure 1a shows the change of test\nerror over iterations in learning joint 2 of SARCOS dataset. For all methods, the convergence rate\nimproves with a larger minibatch. In addition, from Figure 1b, we observe that the required number\nof steps iVSGPRada needed to solve (16) decays with the number of iterations; only a small number\nline searches is required after the \ufb01rst few iterations.\nTable 1b and Table 1c show the experimental results on two larger datasets. In the experiments, we\nmixed the of\ufb02ine and online partitions in the original KUKA dataset and then split 90% into training\nand 10% into testing datasets in order to create an online i.i.d. streaming scenario. We did not\ncompare to VSGPR\u2217\nsvi on these datasets, since computing the inducing points and hyperparameters\nin batch is infeasible. As above, iVSGPRada stands out from other models, closely followed by\niVSGPR10. We found that the difference between VSGPRsvi and iVSGPRs is much greater on these\nlarger real-world benchmarks.\nAuxiliary experimental results illustrating convergence for all experiments summarized in Ta-\nbles 1a, 1b, and 1c are included in the Appendix.\n\n4kin40k: 10000 training data, 30000 testing data, 8 attributes [23]\n5SARCOS: 44484 training data, 4449 testing data, 28 attributes. http://www.gaussianprocess.org/gpml/data/\n6KUKA1&KUKA2: 17560 of\ufb02ine data, 180360 online data, 28 attributes. [15]\n7The number in the subscript denotes the number of function calls allowed in nonlinear conjugate gradient\ndescent [20] to solve subproblems (16) and ada denotes (16) is solved until the relative function change is less\nthan 10\u22125.\n\n7\n\n\fkin40k\nSARCOS J1\nSARCOS J2\nSARCOS J3\nSARCOS J4\nSARCOS J5\nSARCOS J6\nSARCOS J7\n\nVSGPRsvi\n0.0959\n0.0247\n0.0193\n0.0125\n0.0048\n0.0267\n0.0300\n0.0101\n\niVSGPR5\n0.0648\n0.0228\n0.0176\n0.0112\n0.0044\n0.0243\n0.0259\n0.0090\n\niVSGPR10\n0.0608\n0.0214\n0.0159\n0.0104\n0.0040\n0.0229\n0.0235\n0.0082\n\niVSGPRada\n0.0607\n0.0210\n0.0156\n0.0103\n0.0038\n0.0226\n0.0229\n0.0081\n\nsvi\n\nVSGPR\u2217\n0.0535\n0.0208\n0.0156\n0.0104\n0.0039\n0.0230\n0.0230\n0.0101\n\n(a) kin40k and SARCOS\n\nVSGPRsvi\n0.1699\n0.1530\n0.1873\n0.1376\n0.1955\n0.1766\n0.1374\n\nJ1\nJ2\nJ3\nJ4\nJ5\nJ6\nJ7\n\niVSGPR5\n0.1455\n0.1305\n0.1554\n0.1216\n0.1668\n0.1645\n0.1357\n\niVSGPR10\n0.1257\n0.1221\n0.1403\n0.1151\n0.1487\n0.1573\n0.1342\n\niVSGPRada\n0.1176\n0.1138\n0.1252\n0.1108\n0.1398\n0.1506\n0.1333\n\nVSGPRsvi\n0.1737\n0.1517\n0.2108\n0.1357\n0.2082\n0.1925\n0.1329\n\nJ1\nJ2\nJ3\nJ4\nJ5\nJ6\nJ7\n\niVSGPR5\n0.1452\n0.1312\n0.1818\n0.1171\n0.1846\n0.1890\n0.1309\n\niVSGPR10\n0.1284\n0.1183\n0.1652\n0.1104\n0.1697\n0.1855\n0.1287\n\niVSGPRada\n0.1214\n0.1081\n0.1544\n0.1046\n0.1598\n0.1809\n0.1275\n\n(b) KUKA1\n\n(c) KUKA2\n\nTable 1: Testing error (nMSE) after 500 iterations. Nm = 2048; Ji denotes the ith joint.\n\n(a) Test error\n\n(b) Functions calls of iVSGPRada\n\nFigure 1: Online learning results of SARCOS joint 2. (a) nMSE evaluated on the held out test set; the\ndash lines and the solid lines denote the results with Nm = 512 and Nm = 2048, respectively. (b)\nNumber of function calls used by iVSGPRada in solving (16) (A maximum of 100 calls is imposed )\n5 Conclusion\nWe propose a stochastic approximation of variational sparse GPR [26], iVSGPR. By reformulating\nthe variational inference in RKHS, the update of the statistics of the inducing functions and the\ninducing points can be uni\ufb01ed as stochastic mirror ascent on probability densities to consider the\nmanifold structure. In our experiments, iVSGPR shows better performance than the direct adoption of\nstochastic variational inference to solve variational sparse GPs. As iVSGPR executes a \ufb01xed number\nof operations for each minibatch, it is suitable for applications where training data is abundant, e.g.\nsensory data in robotics. In future work, we are interested in applying iVSGPR to extensions of sparse\nGaussian processes such as GP-LVMs and dynamical system modeling.\n\nReferences\n[1] Ahmed H Abdel-Gawad, Thomas P Minka, et al. Sparse-posterior gaussian processes for general likeli-\n\nhoods. arXiv preprint arXiv:1203.3507, 2012.\n\n[2] Mauricio Alvarez and Neil D Lawrence. Sparse convolved gaussian processes for multi-output regression.\n\nIn Advances in neural information processing systems, pages 57\u201364, 2009.\n\n[3] Mauricio A Alvarez, David Luengo, Michalis K Titsias, and Neil D Lawrence. Ef\ufb01cient multioutput gaus-\nsian processes through variational inducing kernels. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 25\u201332, 2010.\n\n8\n\n\f[4] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\n[5] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman\n\ndivergences. The Journal of Machine Learning Research, 6:1705\u20131749, 2005.\n\n[6] Lehel Csat\u00f3 and Manfred Opper. Sparse on-line gaussian processes. Neural computation, 14(3):641\u2013668,\n\n2002.\n\n[7] Bo Dai, Niao He, Hanjun Dai, and Le Song. Scalable bayesian inference via particle mirror descent. arXiv\n\npreprint arXiv:1506.03101, 2015.\n\n[8] Anibal Figueiras-vidal and Miguel L\u00e1zaro-gredilla. Inter-domain gaussian processes for sparse inference\nusing inducing features. In Advances in Neural Information Processing Systems, pages 1087\u20131095, 2009.\n[9] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint\n\narXiv:1309.6835, 2013.\n\n[10] Trong Nghia Hoang, Quang Minh Hoang, and Kian Hsiang Low. A unifying framework of anytime sparse\ngaussian process regression models with stochastic variational inference for big data. In Proc. ICML, pages\n569\u2013578, 2015.\n\n[11] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[12] Irina Holmes and Ambar N Sengupta. The gaussian radon transform and machine learning. In\ufb01nite\n\nDimensional Analysis, Quantum Probability and Related Topics, 18(03):1550019, 2015.\n\n[13] Mohammad E Khan, Pierre Baqu\u00e9, Fran\u00e7ois Fleuret, and Pascal Fua. Kullback-leibler proximal variational\n\ninference. In Advances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n[14] Alexander G de G Matthews, James Hensman, Richard E Turner, and Zoubin Ghahramani. On sparse\nvariational methods and the kullback-leibler divergence between stochastic processes. In Proceedings of\nthe Nineteenth International Conference on Arti\ufb01cial Intelligence and Statistics, 2016.\n\n[15] Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local gaussian regression. In Advances\n\nin Neural Information Processing Systems, pages 972\u2013980, 2014.\n\n[16] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. The Journal of Machine\n\nLearning Research, 7:2651\u20132667, 2006.\n\n[17] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic ap-\nproximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609,\n2009.\n\n[18] Joaquin Qui\u00f1onero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian\n\nprocess regression. The Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[19] Garvesh Raskutti and Sayan Mukherjee. The information geometry of mirror descent. Information Theory,\n\nIEEE Transactions on, 61(3):1451\u20131457, 2015.\n\n[20] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. 2006.\n[21] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse\ngaussian process regression. In Arti\ufb01cial Intelligence and Statistics 9, number EPFL-CONF-161318, 2003.\n[22] Rishit Sheth, Yuyang Wang, and Roni Khardon. Sparse variational inference for generalized gp models. In\nProceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1302\u20131311,\n2015.\n\n[23] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in\n\nneural information processing systems, pages 1257\u20131264, 2005.\n\n[24] Edward Snelson and Zoubin Ghahramani. Local and global sparse gaussian process approximations. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 524\u2013531, 2007.\n\n[25] Lucas Theis and Matthew D Hoffman. A trust-region method for stochastic variational inference with\n\napplications to streaming data. arXiv preprint arXiv:1505.07649, 2015.\n\n[26] Michalis K Titsias. Variational learning of inducing variables in sparse gaussian processes. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 567\u2013574, 2009.\n\n[27] Christian Walder, Kwang In Kim, and Bernhard Sch\u00f6lkopf. Sparse multiscale gaussian process regression.\nIn Proceedings of the 25th international conference on Machine learning, pages 1112\u20131119. ACM, 2008.\n\n9\n\n\f", "award": [], "sourceid": 2167, "authors": [{"given_name": "Ching-An", "family_name": "Cheng", "institution": "Georgia Institute of Technolog"}, {"given_name": "Byron", "family_name": "Boots", "institution": "Georgia Tech"}]}