{"title": "Implicit Posterior Variational Inference for Deep Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 14502, "page_last": 14513, "abstract": "A multi-layer deep Gaussian process (DGP) model is a hierarchical composition of GP models with a greater expressive power. Exact DGP inference is intractable, which has motivated the recent development of deterministic and stochastic approximation methods. Unfortunately, the deterministic approximation methods yield a biased posterior belief while the stochastic one is computationally costly. This paper presents an implicit posterior variational inference (IPVI) framework for DGPs that can ideally recover an unbiased posterior belief and still preserve time efficiency. Inspired by generative adversarial networks, our IPVI framework achieves this by casting the DGP inference problem as a two-player game in which a Nash equilibrium, interestingly, coincides with an unbiased posterior belief. This consequently inspires us to devise a best-response dynamics algorithm to search for a Nash equilibrium (i.e., an unbiased posterior belief). Empirical evaluation shows that IPVI outperforms the state-of-the-art approximation methods for DGPs.", "full_text": "Implicit Posterior Variational Inference for\n\nDeep Gaussian Processes\n\nHaibin Yu\u2217, Yizhou Chen\u2217, Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet\u2020\n\nDept. of Computer Science, National University of Singapore, Republic of Singapore\n\nDept. of Electrical Engineering and Computer Science, MIT, USA\u2020\n\n{haibin,ychen041,daiz,lowkh}@comp.nus.edu.sg, jaillet@mit.edu\u2020\n\nAbstract\n\nA multi-layer deep Gaussian process (DGP) model is a hierarchical composition\nof GP models with a greater expressive power. Exact DGP inference is intractable,\nwhich has motivated the recent development of deterministic and stochastic ap-\nproximation methods. Unfortunately, the deterministic approximation methods\nyield a biased posterior belief while the stochastic one is computationally costly.\nThis paper presents an implicit posterior variational inference (IPVI) framework\nfor DGPs that can ideally recover an unbiased posterior belief and still preserve\ntime ef\ufb01ciency. Inspired by generative adversarial networks, our IPVI framework\nachieves this by casting the DGP inference problem as a two-player game in which\na Nash equilibrium, interestingly, coincides with an unbiased posterior belief. This\nconsequently inspires us to devise a best-response dynamics algorithm to search for\na Nash equilibrium (i.e., an unbiased posterior belief). Empirical evaluation shows\nthat IPVI outperforms the state-of-the-art approximation methods for DGPs.\n\n1\n\nIntroduction\n\nThe expressive power of the Bayesian non-parametric Gaussian process (GP) [46] models can be\nsigni\ufb01cantly boosted by composing them hierarchically into a multi-layer deep GP (DGP) model,\nas shown in the seminal work of [12]. Though the DGP model can likewise exploit the notion\nof inducing variables [5, 24, 25, 36, 40, 45, 55, 57] to improve its scalability, doing so does not\nimmediately entail tractable inference, unlike the GP model. This has motivated the development\nof deterministic and stochastic approximation methods, the former of which have imposed varying\nstructural assumptions across the DGP hidden layers and assumed a Gaussian posterior belief of\nthe inducing variables [3, 10, 12, 20, 48]. However, the work of [18] has demonstrated that with at\nleast one DGP hidden layer, the posterior belief of the inducing variables is usually non-Gaussian,\nhence potentially compromising the performance of the deterministic approximation methods due to\ntheir biased posterior belief. To resolve this, the stochastic approximation method of [18] utilizes\nstochastic gradient Hamiltonian Monte Carlo (SGHMC) sampling to draw unbiased samples from\nthe posterior belief. But, generating such samples is computationally costly in both training and\nprediction due to its sequential sampling procedure [54] and its convergence is also dif\ufb01cult to assess.\nSo, the challenge remains in devising a time-ef\ufb01cient approximation method that can recover an\nunbiased posterior belief.\nThis paper presents an implicit posterior variational inference (IPVI) framework for DGPs (Section 3)\nthat can ideally recover an unbiased posterior belief and still preserve time ef\ufb01ciency, hence combining\nthe best of both worlds (respectively, stochastic and deterministic approximation methods). Inspired\nby generative adversarial networks [17] that can generate samples to represent complex distributions\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhich are hard to model using an explicit likelihood [31, 53], our IPVI framework achieves this by\ncasting the DGP inference problem as a two-player game in which a Nash equilibrium, interestingly,\ncoincides with an unbiased posterior belief. This consequently inspires us to devise a best-response\ndynamics algorithm to search for a Nash equilibrium [2] (i.e., an unbiased posterior belief). In\nSection 4, we discuss how the architecture of the generator in our IPVI framework is designed\nto enable parameter tying for a DGP model to alleviate over\ufb01tting. We empirically evaluate the\nperformance of IPVI on several real-world datasets in supervised (e.g., regression and classi\ufb01cation)\nand unsupervised learning tasks (Section 5).\n\n2 Background and Related Work\nGaussian Process (GP). Let a random function f : RD \u2192 R be distributed by a GP with a zero\nprior mean and covariance function k : RD \u00d7 RD \u2192 R. That is, suppose that a set y (cid:44) {yn}N\nof N noisy observed outputs yn (cid:44) f (xn) + \u03b5(xn) (i.e., corrupted by an i.i.d. Gaussian noise \u03b5(xn)\nwith noise variance \u03bd2) are available for some set X (cid:44) {xn}N\nn=1 of N training inputs. Then, the\nset f (cid:44) {f (xn)}N\nn=1 of latent outputs follow a Gaussian prior belief p(f ) (cid:44) N (f|0, KXX) where\nKXX denotes a covariance matrix with components k(xn, xn(cid:48)) for n, n(cid:48) = 1, . . . , N. It follows that\n(cid:82) p(f (cid:63)|f ) p(f|y) df but incurs cubic time in N, hence scaling poorly to massive datasets.\np(y|f ) = N (y|f , \u03bd2I). The GP predictive/posterior belief of the latent outputs f (cid:63) (cid:44) {f (x(cid:63))}x(cid:63)\u2208X(cid:63)\nfor any set X(cid:63) of test inputs can be computed in closed form [46] by marginalizing out f: p(f (cid:63)|y) =\n\nn=1\n\nTo improve its scalability to linear time in N, the sparse GP (SGP) models spanned by the unifying\nview of [45] exploit a set u (cid:44) {um (cid:44) f (zm)}M\nm=1 of inducing output variables for some small set\nZ (cid:44) {zm}M\n\nm=1 of inducing inputs (i.e., M (cid:28) N). Then,\n\np(y, f , u) = p(y|f ) p(f|u) p(u)\nZZu, KXX \u2212 KXZK\u22121\n\n(1)\nsuch that p(f|u) = N (f|KXZK\u22121\nZZKZX) where, with a slight abuse of nota-\ntion, u is treated as a column vector here, KXZ (cid:44) K(cid:62)ZX, and KZZ and KZX denote covariance\nmatrices with components k(zm, zm(cid:48)) for m, m(cid:48) = 1, . . . , M and k(zm, xn) for m = 1, . . . , M\nand n = 1, . . . , N, respectively. The SGP predictive belief can also be computed in closed form by\n\nmarginalizing out u: p(f (cid:63)|y) =(cid:82) p(f (cid:63)|u) p(u|y) du.\n\nELBO (cid:44) E\n\nThe work of [50] has proposed a principled variational inference (VI) framework that approximates\nthe joint posterior belief p(f , u|y) with a variational posterior q(f , u) (cid:44) p(f|u) q(u) by minimizing\nthe Kullback-Leibler (KL) distance between them, which is equivalent to maximizing a lower bound\nof the log-marginal likelihood (i.e., also known as the evidence lower bound (ELBO)):\n\nq(f )[log p(y|f )] \u2212 KL[q(u)(cid:107)p(u)]\n\nwhere q(f ) (cid:44)(cid:82) p(f|u) q(u) du. A common choice in VI is the Gaussian variational posterior q(u) (cid:44)\n\nN (u|m, S) of the inducing variables u [14, 16, 19, 24, 25, 51] which results in a Gaussian marginal\nq(f ) = N (f|\u00b5, \u03a3) where \u00b5 (cid:44) KXZK\u22121\nDeep Gaussian Process (DGP). A multi-layer DGP model is a hierarchical composition of GP\nmodels. Consider a DGP with a depth of L such that each DGP layer is associated with a set F(cid:96)\u22121 of\ninputs and a set F(cid:96) of outputs for (cid:96) = 1, . . . , L and F0 (cid:44) X. Let F (cid:44) {F(cid:96)}L\n(cid:96)=1, and the inducing\ninputs and corresponding inducing output variables for DGP layers (cid:96) = 1, . . . , L be denoted by the\nrespective sets Z (cid:44) {Z(cid:96)}L\n(cid:96)=1. Similar to the joint probability distribution of the\n\nZZm and \u03a3 (cid:44) KXX \u2212 KXZK\u22121\n\nZZ(KZZ \u2212 S)K\u22121\n\nZZKZX.\n\nSGP model (1),\n\n(cid:96)=1 and U (cid:44) {U(cid:96)}L\n(cid:124) (cid:123)(cid:122) (cid:125)\np(y,F ,U ) = p(y|FL)\n\n(cid:35)\np(F(cid:96)|U(cid:96))\n(cid:123)(cid:122)\nSimilarly, the variational posterior is assumed to be q(F ,U ) (cid:44)(cid:104)(cid:81)L\n\n(cid:34) L(cid:89)\n(cid:124)\n\ndata likelihood\n\nDGP prior\n\ning in the following ELBO for the DGP model:\n\n(cid:96)=1\n\np(U )\n\n.\n\n(cid:125)\n\n(cid:105)\n(cid:96)=1 p(F(cid:96)|U(cid:96))\n\n(cid:90)\n\nELBO (cid:44)\n\nq(U ), thus result-\n\nq(FL) log p(y|FL) dFL \u2212 KL[q(U )(cid:107)p(U )]\n\n(2)\n\n2\n\n\fwhere q(FL) (cid:44) (cid:82)(cid:81)L\n\ufb01eld approximation q(U ) (cid:44)(cid:81)L\n\n(cid:96)=1 p(F(cid:96)|U(cid:96), F(cid:96)\u22121) q(U ) dF1 . . . dFL\u22121dU. To compute q(FL), the work\n\nof [48] has proposed the use of the reparameterization trick [32] and Monte Carlo sampling, which\nare adopted in this work.\nRemark 1. To the best of our knowledge, the DGP models exploiting the inducing variables2 and\nthe VI framework [10, 12, 20, 48] have imposed the highly restrictive assumptions of (i) mean\n(cid:96)=1 q(U(cid:96)) and (ii) biased Gaussian variational posterior q(U(cid:96)). In\nfact, the true posterior belief usually exhibits a high correlation across the DGP layers and is non-\nGaussian [18], hence potentially compromising the performance of such deterministic approximation\nmethods for DGP models. To remove these assumptions, we will propose a principled approximation\nmethod that can generate unbiased posterior samples even under the VI framework, as detailed in\nSection 3.\n\n3\n\nImplicit Posterior Variational Inference (IPVI) for DGPs\n\nUnlike the conventional VI framework for existing DGP models [10, 12, 20, 48], our proposed IPVI\nframework does not need to impose their highly restrictive assumptions (Remark 1) and can still\npreserve the time ef\ufb01ciency of VI. Inspired by previous works on adversarial-based inference [30, 42],\nIPVI achieves this by \ufb01rst generating posterior samples U (cid:44) g\u03a6(\u0001) with a black-box generator g\u03a6(\u0001)\nparameterized by \u03a6 and a random noise \u0001 \u223c N (0, I). By representing the variational posterior as\n\nq\u03a6(U ) (cid:44)(cid:82) p(U|\u0001)d\u0001, the ELBO in (2) can be re-written as\n\nq(FL)[log p(y|FL)] \u2212 KL[q\u03a6(U )(cid:107)p(U )] .\n\n(3)\nAn immediate advantage of the generator g\u03a6(\u0001) is that it can generate the posterior samples in parallel\nby feeding it a batch of randomly sampled \u0001\u2019s. However, representing the variational posterior q\u03a6(U )\nimplicitly makes it impossible to evaluate the KL distance in (3) since q\u03a6(U ) cannot be calculated\nexplicitly. By observing that the KL distance is equal to the expectation of the log-density ratio\nq\u03a6(U )[log q\u03a6(U ) \u2212 log p(U )], we can circumvent an explicit calculation of the KL distance term by\nE\nimplicitly representing the log-density ratio as a separate function T to be optimized, as shown in our\n\ufb01rst result below:\nProposition 1. Let \u03c3(x) (cid:44) 1/(1 + exp(\u2212x)). Consider the following maximization problem:\n\nELBO = E\n\n(4)\n\n(5)\n\nmax\n\nT\n\np(U )[log(1 \u2212 \u03c3(T (U ))] + E\nE\n\nq\u03a6(U )[log \u03c3(T (U ))] .\n\nIf p(U ) and q\u03a6(U ) are known, then the optimal T \u2217 with respect to (4) is the log-density ratio:\n\nT \u2217(U ) = log q\u03a6(U ) \u2212 log p(U ) .\n\nIts proof (Appendix A) is similar to that of Proposition 1 in [17] except that we use a sigmoid\nfunction \u03c3 to reveal the log-density ratio. Note that (4) de\ufb01nes a binary cross-entropy between\nsamples from the variational posterior q\u03a6(U ) and prior p(U ). Intuitively, T in (4), which we refer\nto as a discriminator, tries to distinguish between q\u03a6(U ) and p(U ) by outputting \u03c3(T (U )) as the\nprobability of U being a sample from q\u03a6(U ) rather than p(U ).\nUsing Proposition 1 (i.e., (5)), the ELBO in (3) can be re-written as\n\nq\u03a6(U )[L(\u03b8, X, y,U ) \u2212 T \u2217(U )]\n\nwhere L(\u03b8, X, y,U ) (cid:44) E\n\nELBO = E\n(6)\np(FL|U )[log p(y|FL)] and \u03b8 denotes the DGP model hyperparameters. The\nELBO can now be calculated given the optimal discriminator T \u2217. In our implementation, we adopt a\nparametric representation for discriminator T . In principle, the parametric representation is required to\nbe expressive enough to be able to represent the optimal discriminator T \u2217 accurately. Motivated by the\nfact that deep neural networks are universal function approximators [29], we represent discriminator\nT\u03a8 by a neural network with parameters \u03a8; the optimal T\u03a8\u2217 is thus parameterized by \u03a8\u2217. The\narchitecture of the generator and discriminator in our IPVI framework will be discussed in Section 4.\nThe ELBO in (6) can be optimized with respect to \u03a6 and \u03b8 via gradient ascent, provided that the\noptimal T\u03a8\u2217 (with respect to q\u03a6) can be obtained in every iteration. One way to achieve this is to cast\n2An alternative is to modify the DGP prior directly and perform inference with a parametric model. The\nwork of [9] has approximated the DGP prior with the spectral density of a kernel [22] such that the kernel has an\nanalytical spectral density.\n\n3\n\n\fAlgorithm 1: Main\n\n1 Randomly initialize \u03b8, \u03a8, \u03a6\n2 while not converged do\n3\n4\n\nRun Algorithm 2\nRun Algorithm 3\n\n{\u03a80}\n{\u03b80, \u03a60}\n\nPlayer 1\n\n{\u03a81}\n{\u03b80, \u03a60}\n\nPlayer 2\n\n{\u03a81}\n{\u03b81, \u03a61}\n\nPlayer 1\n\nPlayer 2\n\n. . .\n\n{\u03a82}\n{\u03b81, \u03a61}\n\nAlgorithm 2: Player 1\n1 Sample {V 1, . . . ,V K} from p(U )\n2 Sample {U 1, . . . ,U K} from q\u03a6(U )\n\n3 Compute gradient w.r.t. \u03a8 from (7):\n\n(cid:34)\ng\u03a8 (cid:44) \u2207\u03a8\n\nK(cid:88)\n(cid:35)\nlog(1 \u2212 \u03c3(T\u03a8(V k))\nK(cid:88)\n\nlog \u03c3(T\u03a8(U k))\n\nk=1\n\n(cid:34)\n\n1\nK\n+\u2207\u03a8\n\n4\n\nk=1\n1\nK\n5 SGA update for \u03a8:\n6 \u03a8 \u2190 \u03a8 + \u03b1\u03a8 g\u03a8\n\n(cid:35)\n\n3 Compute gradients w.r.t. \u03b8 and \u03a6 from (7):\n\nAlgorithm 3: Player 2\n1 Sample mini-batch (Xb, yb) from (X, y)\n2 Sample {U 1, . . . ,U K} from q\u03a6(U )\n(cid:34)\n(cid:35)\n(cid:34)\ng\u03b8 (cid:44) \u2207\u03b8\nL(\u03b8, Xb, yb,U k)\nL(\u03b8, Xb, yb,U k)\u2212T\u03a8(U k)\n\nK(cid:88)\nK(cid:88)\n\n1\nK\n\n1\nK\n\ng\u03a6 (cid:44) \u2207\u03a6\n5 SGA updates for \u03b8 and \u03a6:\n6\n\n\u03b8 \u2190 \u03b8 + \u03b1\u03b8 g\u03b8 , \u03a6 \u2190 \u03a6 + \u03b1\u03a6 g\u03a6\n\nk=1\n\nk=1\n\n4\n\n(cid:35)\n\nFigure 1: Best-response dynamics (BRD) algorithm based on our IPVI framework for DGPs.\n\nthe optimization of the ELBO as a two-player pure-strategy game between Player 1 (representing\ndiscriminator with strategy {\u03a8}) vs. Player 2 (jointly representing generator and DGP model with\nstrategy {\u03a6, \u03b8}) that is de\ufb01ned based on the following payoffs:\np(U )[log(1 \u2212 \u03c3(T\u03a8(U ))] + E\nE\nq\u03a6(U )[L(\u03b8, X, y,U ) \u2212 T\u03a8(U )] .\nE\n\nq\u03a6(U )[log \u03c3(T\u03a8(U ))] ,\n\nPlayer 1 : max\n{\u03a8}\nPlayer 2 : max\n{\u03b8,\u03a6}\n\n(7)\n\nProposition 2. Suppose that the parametric representations of T\u03a8 and g\u03a6 are expressive enough to\nrepresent any function. If ({\u03a8\u2217},{\u03b8\u2217, \u03a6\u2217}) is a Nash equilibrium of the game in (7), then {\u03b8\u2217, \u03a6\u2217}\nis a global maximizer of the ELBO in (3) such that (a) \u03b8\u2217 is the maximum likelihood assignment for\nthe DGP model, and (b) q\u03a6\u2217 (U ) is equal to the true posterior belief p(U|y).\n\nIts proof is similar to that of Proposition 3 in [42] except that we additionally provide a proof of\nexistence of a Nash equilibrium for the case of known/\ufb01xed DGP model hyperparameters, as detailed\nin Appendix B. Proposition 2 reveals that any Nash equilibrium coincides with a global maximizer\nof the original ELBO in (3). This consequently inspires us to play the game using best-response\ndynamics3 (BRD) which is a commonly adopted procedure [2] to search for a Nash equilibrium.\nFig. 1 illustrates our BRD algorithm: In each iteration of Algorithm 1, each player takes its turn to\nimprove its strategy to achieve a better (but not necessarily the best) payoff by performing a stochastic\ngradient ascent (SGA) update on its payoff (7).\nRemark 2. While BRD guarantees to converge to a Nash equilibrium in some classes of games (e.g.,\na \ufb01nite potential game), we have not shown that our game falls into any of these classes and hence\ncannot guarantee that BRD converges to a Nash equilibrium (i.e., global maximizer {\u03b8\u2217, \u03a6\u2217}) of our\ngame. Nevertheless, as mentioned previously, obtaining the optimal discriminator in every iteration\nguarantees the game play (i.e., gradient ascent update for {\u03b8, \u03a6}) to reach at least a local maximum\nof ELBO. To better approximate the optimal discriminator, we perform multiple calls of Algorithm 2\nin every iteration of the main loop in Algorithm 1 and also apply a larger learning rate \u03b1\u03a8. We have\nobserved in our own experiments that these tricks improve the predictive performance of IPVI.\nRemark 3. Existing implicit VI frameworks [52, 56] avoid the estimation of the log-density ratio.\nUnfortunately, the semi-implicit VI framework of [56] requires taking a limit at in\ufb01nity to recover\nthe ELBO, while the unbiased implicit VI framework of [52] relies on a Markov chain Monte Carlo\nsampler whose hyperparameters need to be carefully tuned.\n\n3This procedure is sometimes called \u201cbetter-response dynamics\u201d (http://timroughgarden.org/f13/l/l16.pdf).\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a) Illustration of a naive design of the generator for each layer (cid:96). Parameter-tying\narchitecture of the (b) generator and (c) discriminator for each layer (cid:96) where \u2018+\u2019 denotes addition\nand \u2018\u2295\u2019 denotes concatenation. (d) Parameter-tying architecture of the generator and discriminator in\nour IPVI framework for DGPs. See the main text for the de\ufb01nitions of notations.\n\n4 Parameter-Tying Architecture of Generator and Discriminator for DGPs\n\nIn this section, we will discuss how the architecture of the generator in our IPVI framework is\ndesigned to enable parameter tying for a DGP model to alleviate over\ufb01tting. Recall from Section 2\nthat U = {U(cid:96)}L\n(cid:96)=1 is a collection of inducing variables for DGP layers (cid:96) = 1, . . . , L. We consider a\nlayer-wise design of the generator (parameterized by \u03a6 (cid:44) {\u03c6(cid:96)}L\ninput to induce dependency between layers and T\u03a8(U ) (cid:44)(cid:80)L\n(cid:96)=1) and discriminator (parameterized\nby \u03a8 (cid:44) {\u03c8(cid:96)}L\n(cid:96)=1) such that g\u03a6(\u0001) (cid:44) {g\u03c6(cid:96)(\u0001)}L\n(cid:96)=1 with the random noise \u0001 serving as a common\n(cid:96)=1 T\u03c8(cid:96)(U(cid:96)), respectively. For each\nlayer (cid:96), a naive design is to generate posterior samples U(cid:96) (cid:44) g\u03c6(cid:96)(\u0001) from the random noise \u0001 as input.\nHowever, such a design suffers from two critical issues:\n\u2022 Fig. 2a illustrates that\n\nto generate posterior samples of M different\n\nU(cid:96)1, . . . , U(cid:96)M (U(cid:96) (cid:44) {U(cid:96)m}M\nmetric settings \u03c6(cid:96)1, . . . , \u03c6(cid:96)M (\u03c6(cid:96) (cid:44) {\u03c6(cid:96)m}M\nparameters and is thus prone to over\ufb01tting (Section 5.3).\n\ninducing variables\nm=1), it is natural for the generator to adopt M different para-\nm=1), which introduces a relatively large number of\n\u2022 Such a design of the generator fails to adequately capture the dependency of the inducing output\nvariables U(cid:96) on the corresponding inducing inputs Z(cid:96), hence restricting its capability to output the\nposterior samples of U accurately.\n\nTo resolve the above issues, we propose a novel parameter-tying architecture of the generator and\ndiscriminator for a DGP model, as shown in Figs. 2b and 2c. For each layer (cid:96), since U(cid:96) depends\non Z(cid:96), we design the generator g\u03c6(cid:96) to generate posterior samples U(cid:96) (cid:44) g\u03c6(cid:96)(\u0001 \u2295 Z(cid:96)) from not just \u0001\nbut also Z(cid:96) as inputs. Recall that the same \u0001 is fed as an input to g\u03c6l in each layer (cid:96), which can be\nobserved from the left-hand side of Fig. 2d. In addition, compared with the naive design in Fig. 2a,\nthe posterior samples of M different inducing variables U(cid:96)1, . . . , U(cid:96)M are generated based on only\na single shared parameter setting (instead of M), which reduces the number of parameters by O(M )\ntimes (Fig. 2b). We adopt a similar design for the discriminator, as shown in Fig. 2c. Fig. 2d illustrates\nthe design of the overall parameter-tying architecture of the generator and discriminator.\nWe have observed in our own experiments that our proposed parameter-tying architecture not only\nspeeds up the training and prediction, but also improves the predictive performance of IPVI consid-\nerably (Section 5.3). We will empirically evaluate our IPVI framework with this parameter-tying\narchitecture in Section 5.\n\n5 Experiments and Discussion\n\nWe empirically evaluate and compare the performance of our IPVI framework4 against that of the\nstate-of-the-art SGHMC [18] and doubly stochastic VI5 (DSVI) [48] for DGPs based on their publicly\n\n4Our implementation is built on GP\ufb02ow [41] which is an open-source GP framework based on TensorFlow\n\n[1]. It is publicly available at https://github.com/HeroKillerEver/ipvi-dgp.\n\n5It is reported in [48] that DSVI has outperformed the approximate expectation propagation method of [3]\n\nfor DGPs. Hence, we do not empirically compare with the latter [3] here.\n\n5\n\n+\u2026\u2a01\u2a01\u2a01\u2a01\u2026\u2026\u2026\u2a01\u2a01\u2026\u2a01\u2a01\u2026+\u2026\u2026\u2026\u2026\u2026\u2026\u270fg`1`MU`MU`1U`+\u2026\u2a01\u2a01\u2a01\u2a01\u2026\u2026\u2026\u2a01\u2a01\u2026\u2a01\u2a01\u2026+\u2026\u2026\u2026\u270f\u270fg`Z`Z`1Z`MU`MU`1U`+\u2026\u2a01\u2a01\u2a01\u2a01\u2026\u2026\u2026\u2a01\u2a01\u2026\u2a01\u2a01\u2026+\u2026\u2026\u2026TZ`Z`MZ`1U`MU`1U` `+\u2026\u2a01\u2a01\u2a01\u2a01\u2026\u2026\u2026\u2a01\u2a01\u2026\u2a01\u2a01\u2026+\u2026\u2026\u2026\u2026\u2026\u2026\u270fZLZ11LU1ULZ1ZL 1 LgUT\f(a)\n\n(b)\n\n(d)\n\n(e)\n\nFigure 3: (a) The probability density function (PDF) plot of the ground-truth posterior belief p(f|y).\n(b) Performances of IPVI and SGHMC in terms of estimated Jenson-Shannon divergence (JSD) and\nmean log-likelihood (MLL) metrics under the respective settings of varying learning rates \u03b1\u03a8 and\nstep sizes \u03b7. (c) Graph of MLL vs. JSD achieved by IPVI with varying number of parameters in the\ngenerator: Different shapes indicate varying number of modes learned by the generator. (d-e) PDF\nplots of variational posterior q(f ; x = 0) learned using (d) IPVI with generators of varying learning\nrates \u03b1\u03a8 and (e) SGHMC with varying step sizes \u03b7.\n\n(c)\n\navailable implementations using synthetic and real-world datasets in supervised (e.g., regression and\nclassi\ufb01cation) and unsupervised learning tasks.\n\n5.1 Synthetic Experiment: Learning a Multi-Modal Posterior Belief\n\nTo demonstrate the capability of IPVI in learning a complex multi-modal posterior belief, we generate\na synthetic \u201cdiamond\u201d dataset and adopt a multi-modal mixture of Gaussian prior belief p(f ) (see\nAppendix C.1 for its description) to yield a multi-modal posterior belief p(f|y) for a single-layer\nGP. Fig. 3a illustrates this dataset and ground-truth posterior belief. Speci\ufb01cally, we focus on the\nmulti-modal posterior belief p(f|y; x = 0) at input x = 0 whose ground truth is shown in Fig. 3d.\nFig. 3c shows that as the number of parameters in the generator increases, the expressive power of\nIPVI increases such that its variational posterior q(f ; x = 0) can capture more modes in the true\nposterior, thus resulting in a closer estimated Jensen-Shannon divergence (JSD) between them and a\nhigher mean log-likelihood (MLL).\nNext, we compare the robustness of IPVI and SGHMC in learning the true multi-modal posterior\nbelief p(f|y; x = 0) under different hyperparameter settings6: The generators in IPVI use the same\narchitecture with about 300 parameters but different learning rates \u03b1\u03a8, while the SGHMC samplers\nuse different step sizes \u03b7. The results in Figs. 3b and 3e have veri\ufb01ed a remark made in [58] that\nSGHMC is sensitive to the step size which cannot be set automatically [49] and requires some prior\nknowledge to do so: Sampling with a small step size is prone to getting trapped in local modes while\na slight increase of the step size may lead to an over-\ufb02attened posterior estimate. Additional results\nfor different hyperparameter settings of SGHMC can be found in Appendix C.1. In contrast, the\nresults in Figs. 3b and 3d reveal that, given enough parameters, IPVI performs robustly under a wide\nrange of learning rates.\n\n6We adopt scale-adapted SGHMC which is a robust variant used in Bayesian neural networks and DGP\ninference [18]. A recent work of [58] has proposed the cyclical stochastic gradient MCMC method to improve\nthe accuracy of sampling highly complex distributions. However, it is not obvious to us how this method can be\nincorporated into DGP models, which is beyond the scope of this work.\n\n6\n\n-1x=01-8-4y=048p(f|y)Data points-8-4f=048PDFGround Truth p(f|y;x=0)LR=1e-4 (Setting A)LR=1e-3 (Setting B)LR=1e-2 (Setting C)SettingJSDMLLIPVIA(LR=1e4)1.0e2-1.15IPVIB(LR=1e3)8.3e3-0.99IPVIC(LR=1e2)8.6e3-1.02SGHMCA(\u2318=0.1)2.1e2-2.36SGHMCB(\u2318=0.3)1.2e2-1.10SGHMCC(\u2318=0.5)7.5e2-2.8313-8-4f=048PDFGround Truth p(f|y;x=0)\u03b7=0.1 (Setting A)\u03b7=0.3 (Setting B)\u03b7=0.5 (Setting C)4e-22e-20JSD[q(f;x=0)||p(f|y;x=0)]\u2212100MLLVarying number of arameters in the generator1 Mode2 Modes3 Modes4 Modes5 Modes050100150200250300Number of arameters\fFigure 4: Test MLL and standard deviation achieved by our IPVI framework (red), SGHMC (blue),\nand DSVI (black) for DGPs for UCI benchmark and large-scale regression datasets. Higher test MLL\n(i.e., to the right) is better. See Appendix C.3 for a discussion on the performance gap between SGPs.\n\n5.2 Supervised Learning: Regression and Classi\ufb01cation\n\nFor our experiments in the regression tasks, the depth L of the DGP models are varied from 1 to\n5 with 128 inducing inputs per layer. The dimension of each hidden DGP layer is set to be (i) the\nsame as the input dimension for the UCI benchmark regression and Airline datasets, (ii) 16 for the\nYearMSD dataset, and (iii) 98 for the classi\ufb01cation tasks. Additional details and results for our\nexperiments (including that for IPVI with and without parameter tying) are found in Appendix C.3.\nUCI Benchmark Regression. Our experiments are \ufb01rst conducted on 7 UCI benchmark regression\ndatasets. We have performed a random 0.9/0.1 train/test split.\nLarge-Scale Regression. We then evaluate the performance of IPVI on two real-world large-\nscale regression datasets: (i) YearMSD dataset with a large input dimension D = 90 and data size\nN \u2248 500000, and (ii) Airline dataset with input dimension D = 8 and a large data size N \u2248 2 million.\nFor YearMSD dataset, we use the \ufb01rst 463715 examples as training data and the last 51630 examples\nas test data7. For Airline dataset, we set the last 100000 examples as test data.\nIn the above regression tasks, the performance metric is the MLL of the test data (or test MLL). Fig. 4\nshows results of the test MLL and standard deviation over 10 runs. It can be observed that IPVI\ngenerally outperforms SGHMC and DSVI and the ranking summary shows that our IPVI framework\nfor a 2-layer DGP model (IPVI DGP 2) performs the best on average across all regression tasks. For\nlarge-scale regression tasks, the performance of IPVI consistently increases with a greater depth.\nEven for a small dataset, the performance of IPVI improves up to a certain depth.\nTime Ef\ufb01ciency. Table 1 and Fig. 5 show the better time ef\ufb01ciency of IPVI over the state-of-the-art\nSGHMC for a 4-layer DGP model that is trained using the Airline dataset. The learning rates are\n0.005 and 0.02 for IPVI and SGHMC (default setting adopted from [18]), respectively. Due to\n\n7This avoids the \u2018producer\u2019 effect by ensuring that no song from an artist appears in both training & test data.\n\n7\n\n\u22123.0\u22122.8DSVI SGPDSVI DGP 2DSVI DGP 3DSVI DGP 4DSVI DGP 5SGHMC SGPSGHMC DGP 2SGHMC DGP 3SGHMC DGP 4SGHMC DGP 5IPVI SGPIPVI DGP 2IPVI DGP 3IPVI DGP 4IPVI DGP 5Power D=4,N=9568\u22123.2\u22123.0\u22122.8Concrete D=8,N=1030\u22122.2\u22122.1Boston D=13,N=5061.21.4Kin8nm D=8,N=8192\u22121.0\u22120.8Wine Red D=11,N=1599\u22120.50.0DSVI SGPDSVI DGP 2DSVI DGP 3DSVI DGP 4DSVI DGP 5SGHMC SGPSGHMC DGP 2SGHMC DGP 3SGHMC DGP 4SGHMC DGP 5IPVI SGPIPVI DGP 2IPVI DGP 3IPVI DGP 4IPVI DGP 5Energy D=8,N=768\u22122.8\u22122.6Protein D=9,N=45730\u22124.9\u22124.8\u22124.7Airline D=8,N=2055733\u22123.6\u22123.5\u22123.4YearMSD D=90,N=5153451st8th15thRanking Summary\fTable 1: Time incurred by a 4-layer DGP model for Airline dataset.\n\nAverage training time (per iter.)\nU generation (100 samples)\n\nIPVI\n\n0.35 sec.\n0.28 sec.\n\nSGHMC\n3.18 sec.\n143.7 sec.\n\nFigure 5: Graph of test MLL\nvs. total incurred time to train\na 4-layer DGP model for the\nAirline dataset.\n\nTable 2: Mean test accuracy (%) achieved by IPVI, SGHMC, and DSVI for 3 classi\ufb01cation datasets.\n\nDataset\n\nMNIST\n\nSGP DGP 4\n97.32\nDSVI\n97.41\n97.55\nSGHMC 96.41\nIPVI\n97.80\n97.02\n\nMNIST (M = 800)\nSGP\n97.92\n97.07\n97.85\n\nDGP 4\n98.05\n97.91\n98.23\n\nFashion-MNIST\nDGP 4\nSGP\n86.98\n87.99\n87.08\n85.84\n87.29\n88.90\n\nCIFAR-10\n\nSGP DGP 4\n47.15\n51.79\n52.81\n47.32\n48.07\n53.27\n\nparallel sampling (Section 3) and a parameter-tying architecture (Section 4), our IPVI framework\nenables posterior samples to be generated 500 times faster. Although IPVI has more parameters than\nSGHMC, it runs 9 times faster during training due to ef\ufb01ciency in sample generation.\nClassi\ufb01cation. We evaluate the performance of IPVI in three classi\ufb01cation tasks using the real-\nworld MNIST, fashion-MNIST, and CIFAR-10 datasets. Both MNIST and fashion-MNIST datasets\nare grey-scale images of 28 \u00d7 28 pixels. The CIFAR-10 dataset consists of colored images of\n32 \u00d7 32 pixels. We utilize a 4-layer DGP model with 100 inducing inputs per layer and a robust-max\nmulticlass likelihood [21]; for MNIST dataset, we also consider utilizing a 4-layer DGP model with\n800 inducing inputs per layer to assess if its performance improves with more inducing inputs. Table 2\nreports the mean test accuracy over 10 runs, which shows that our IPVI framework for a 4-layer DGP\nmodel performs the best in all three datasets. Additional results for IPVI with and without parameter\ntying are found in Appendix C.3.\n\n5.3 Parameter-Tying vs. No Parameter Tying\n\nTable 3 reports the train/test MLL achieved by IPVI with and without parameter tying for 2 small\ndatasets: Boston (N = 506) and Energy (N = 768). For Boston dataset, it can be observed that no\ntying consistently yields higher train MLL and lower test MLL, hence indicating over\ufb01tting. This\nis also observed for Energy dataset when the number of layers exceeds 2. For both datasets, as the\nnumber of layers (hence number of parameters) increases, over\ufb01tting becomes more severe for no\ntying. In contrast, parameter tying alleviates the over\ufb01tting considerably.\n\nTable 3: Train/test MLL achieved by IPVI with and without parameter tying over 10 runs.\n\nBoston (N = 506)\n\nDataset\nDGP Layers\nNo Tying\nTying\nDataset\nDGP Layers\nNo Tying\nTying\n\n1\n\n-1.86/-2.21\n-1.91/-2.09\n\n1\n\n-0.12/-0.44\n-0.16/-0.32\n\n2\n\n2\n\n-1.76/-2.37\n-1.79/-2.08\n\n-1.64/-2.48\n-1.77/-2.13\n\n-1.52/-2.51\n-1.84/-2.09\n\nEnergy (N = 768)\n\n0.03/-0.31\n-0.11/-0.34\n\n0.18/-0.34\n-0.02/-0.23\n\n0.20/-0.47\n0.10/-0.01\n\n4\n\n4\n\n5\n\n-1.51/-2.57\n-1.83/-2.10\n\n5\n\n0.21/-0.58\n0.17/ 0.13\n\n3\n\n3\n\n5.4 Unsupervised Learning: FreyFace Reconstruction\n\nA DGP can naturally be generalized to perform unsupervised learning. The representation of a\ndataset in a low-dimensional manifold can be learned in an unsupervised manner by the GP latent\nvariable model (GPLVM) [33] where only the observations Y (cid:44) {yn}N\nn=1 are given and the hidden\nrepresentation X is unobserved and treated as latent variables. The objective is to infer the posterior\n\n8\n\n\fFigure 6: Unsupervised learning with FreyFace dataset. (a) Latent representation interpolation and\nthe corresponding reconstruction. (b) True posterior p(x(cid:63)|y(cid:63)\nO (left),\nvariational posterior q(x(cid:63)) learned by IPVI (middle), and Gaussian approximation (right). The PDF\nfor p(x(cid:63)|y(cid:63)\nO) is calculated using Bayes rule where the marginal likelihood is computed using Monte\nCarlo integration. (c) The partial observation (with the ground truth re\ufb02ected in the dark region) and\ntwo reconstructed samples from q(x(cid:63)).\n\nO) given the partial observation y(cid:63)\n\np(X|Y). The GPLVM is a single-layer GP that casts X as an unknown distribution and can naturally\nbe extended to a DGP. So, we construct a 2-layer DGP (X \u2192 F1 \u2192 F2 \u2192 Y) and use the generator\nsamples to represent p(X|Y).\nWe consider the FreyFace dataset [47] taken from a video sequence that consists of 1965 images\nwith a size of 28 \u00d7 20. We select the \ufb01rst 1000 images to train our DGP. To ease visualization, the\ndimension of latent variables X is chosen to be 2. Additional details for our experiments are found in\nAppendix C.2. Fig. 6a shows the reconstruction of faces across the latent space. Interestingly, the\n\ufb01rst dimension of the latent variables X determines the expression from happy to calm while the the\nsecond dimension controls the view angle of the face.\nWe then explore the capability of IPVI in reconstructing partially observed test data. Fig. 6b illustrates\nthat given only the upper half of the face, the real face may exhibit a multi-modal property, as re\ufb02ected\nin the latent space; intuitively, one cannot always tell whether a person is happy or sad by looking at\nthe upper half of the face. Our variational posterior accurately captures the multi-modal posterior\nbelief whereas the Gaussian approximation can only recover one mode (mode A) under this test\nscenario. So, IPVI can correctly recover two types of expressions: calm (mode A) and happy (mode\nB). We did not empirically compare with SGHMC here because it is not obvious to us whether their\nsampler setting can be carried over to this unsupervised learning task.\n\n6 Conclusion\n\nThis paper describes a novel IPVI framework for DGPs that can ideally recover an unbiased posterior\nbelief of the inducing variables and still preserve the time ef\ufb01ciency of VI. To achieve this, we cast\nthe DGP inference problem as a two-player game and search for a Nash equilibrium (i.e., an unbiased\nposterior belief) of this game using best-response dynamics. We propose a novel parameter-tying\narchitecture of the generator and discriminator in our IPVI framework for DGPs to alleviate over\ufb01tting\nand speed up training and prediction. Empirical evaluation shows that IPVI outperforms the state-of-\nthe-art approximation methods for DGPs in regression and classi\ufb01cation tasks and accurately learns\ncomplex multi-modal posterior beliefs in our synthetic experiment and an unsupervised learning\ntask. For future work, we plan to use our IPVI framework for DGPs to accurately represent the\nbelief of the unknown target function in active learning [4, 28, 35, 37\u201339, 44, 60] and Bayesian\noptimization [11, 13, 26, 34, 59, 61] when the available budget of function evaluations is moderately\nlarge. We also plan to develop distributed/decentralized variants [5\u20138, 23, 25, 27, 40, 43] of IPVI.\n\n9\n\nABreconstruct(a)(b)(c)Reconstruction from latent representation interpolationAAABB\fAcknowledgements. This research is supported by the National Research Foundation, Prime Min-\nister\u2019s Of\ufb01ce, Singapore under its Campus for Research Excellence and Technological Enterprise\n(CREATE) program, Singapore-MIT Alliance for Research and Technology (SMART) Future Urban\nMobility (FM) IRG, National Research Foundation Singapore under its AI Singapore Programme\nAward No. AISG-GC-2019-002, and the Singapore Ministry of Education Academic Research Fund\nTier 2, MOE2016-T2-2-156.\n\nReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,\nM. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,\nM. Wicke, Y. Yu, and X. Zheng. TensorFlow: A system for large-scale machine learning. In Proc. OSDI,\npages 265\u2013283, 2016.\n\n[2] B. Awerbuch, Y. Azar, A. Epstein, V. S. Mirrokni, and A. Skopalik. Fast convergence to nearly optimal\n\nsolutions in potential games. In Proc. ACM EC, pages 264\u2013273, 2008.\n\n[3] T. Bui, D. Hern\u00e1ndez-Lobato, J. Hernandez-Lobato, Y. Li, and R. Turner. Deep Gaussian processes for\n\nregression using approximate expectation propagation. In Proc. ICML, pages 1472\u20131481, 2016.\n\n[4] N. Cao, K. H. Low, and J. M. Dolan. Multi-robot informative path planning for active sensing of\n\nenvironmental phenomena: A tale of two algorithms. In Proc. AAMAS, pages 7\u201314, 2013.\n\n[5] J. Chen, N. Cao, K. H. Low, R. Ouyang, C. K.-Y. Tan, and P. Jaillet. Parallel Gaussian process regression\n\nwith low-rank covariance matrix approximations. In Proc. UAI, pages 152\u2013161, 2013.\n\n[6] J. Chen, K. H. Low, P. Jaillet, and Y. Yao. Gaussian process decentralized data fusion and active sensing\nfor spatiotemporal traf\ufb01c modeling and prediction in mobility-on-demand systems. IEEE Transactions on\nAutomation Science and Engineering, 12(3):901\u2013921, 2015.\n\n[7] J. Chen, K. H. Low, and C. K.-Y. Tan. Gaussian process-based decentralized data fusion and active sensing\nfor mobility-on-demand system. In Proceedings of the Robotics: Science and Systems Conference (RSS),\n2013.\n\n[8] J. Chen, K. H. Low, C. K.-Y. Tan, A. Oran, P. Jaillet, J. M. Dolan, and G. S. Sukhatme. Decentralized\ndata fusion and active sensing with mobile sensors for modeling and predicting spatiotemporal traf\ufb01c\nphenomena. In Proc. UAI, pages 163\u2013173, 2012.\n\n[9] K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Random feature expansions for deep Gaussian\n\nprocesses. In Proc. ICML, pages 884\u2013893, 2017.\n\n[10] Z. Dai, A. Damianou, J. Gonz\u00e1lez, and N. Lawrence. Variational auto-encoded deep Gaussian processes.\n\nIn Proc. ICLR, 2016.\n\n[11] Z. Dai, H. Yu, K. H. Low, and P. Jaillet. Bayesian optimization meets Bayesian optimal stopping. In Proc.\n\nICML, pages 1496\u20131506, 2019.\n\n[12] A. Damianou and N. Lawrence. Deep Gaussian processes. In Proc. AISTATS, pages 207\u2013215, 2013.\n\n[13] E. Daxberger and K. H. Low. Distributed batch Gaussian process optimization. In Proc. ICML, pages\n\n951\u2013960, 2017.\n\n[14] M. P. Deisenroth and J. W. Ng. Distributed Gaussian processes. In Proc. ICML, pages 1481\u20131490, 2015.\n\n[15] D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani. Avoiding pathologies in very deep networks. In\n\nProc. AISTATS, pages 202\u2013210, 2014.\n\n[16] Y. Gal, M. van der Wilk, and C. E. Rasmussen. Distributed variational inference in sparse Gaussian process\n\nregression and latent variable models. In Proc. NeurIPS, pages 3257\u20133265, 2014.\n\n[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In Proc. NeurIPS, pages 2672\u20132680, 2014.\n\n[18] M. Havasi, J. M. Hern\u00e1ndez-Lobato, and J. J. Murillo-Fuentes. Inference in deep Gaussian processes using\n\nstochastic gradient Hamiltonian Monte Carlo. In Proc. NeurIPS, pages 7517\u20137527, 2018.\n\n[19] J. Hensman, N. Fusi, and N. Lawrence. Gaussian processes for big data. In Proc. UAI, pages 282\u2013290,\n\n2013.\n\n10\n\n\f[20] J. Hensman and N. D. Lawrence. Nested variational compression in deep Gaussian processes.\n\narXiv:1412.1370, 2014.\n\n[21] D. Hern\u00e1ndez-Lobato, J. M. Hern\u00e1ndez-Lobato, and P. Dupont. Robust multi-class Gaussian process\n\nclassi\ufb01cation. In Proc. NeuIPS, pages 280\u2013288, 2011.\n\n[22] Q. M. Hoang, T. N. Hoang, and K. H. Low. A generalized stochastic variational Bayesian hyperparameter\nlearning framework for sparse spectrum Gaussian process regression. In Proc. AAAI, pages 2007\u20132014,\n2017.\n\n[23] Q. M. Hoang, T. N. Hoang, K. H. Low, and C. Kingsford. Collective model fusion for multiple black-box\n\nexperts. In Proc. ICML, pages 2742\u20132750, 2019.\n\n[24] T. N. Hoang, Q. M. Hoang, and K. H. Low. A unifying framework of anytime sparse Gaussian process\nregression models with stochastic variational inference for big data. In Proc. ICML, pages 569\u2013578, 2015.\n\n[25] T. N. Hoang, Q. M. Hoang, and K. H. Low. A distributed variational inference framework for unifying\n\nparallel sparse Gaussian process regression models. In Proc. ICML, pages 382\u2013391, 2016.\n\n[26] T. N. Hoang, Q. M. Hoang, and K. H. Low. Decentralized high-dimensional Bayesian optimization with\n\nfactor graphs. In Proc. AAAI, pages 3231\u20133238, 2018.\n\n[27] T. N. Hoang, Q. M. Hoang, K. H. Low, and J. P. How. Collective online learning of Gaussian processes in\n\nmassive multi-agent systems. In Proc. AAAI, 2019.\n\n[28] T. N. Hoang, K. H. Low, P. Jaillet, and M. Kankanhalli. Nonmyopic \u0001-Bayes-optimal active learning of\n\nGaussian processes. In Proc. ICML, pages 739\u2013747, 2014.\n\n[29] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.\n\nNeural networks, 2(5):359\u2013366, 1989.\n\n[30] F. Husz\u00e1r. Variational inference using implicit distributions. arxiv:1702.08235, 2017.\n\n[31] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability,\n\nand variation. In Proc. ICLR, 2018.\n\n[32] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proc. ICLR, 2013.\n\n[33] N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In\n\nProc. NeurIPS, pages 329\u2013336, 2004.\n\n[34] C. K. Ling, K. H. Low, and P. Jaillet. Gaussian process planning with Lipschitz continuous reward\nfunctions: Towards unifying Bayesian optimization, active learning, and beyond. In Proc. AAAI, pages\n1860\u20131866, 2016.\n\n[35] K. H. Low, J. Chen, J. M. Dolan, S. Chien, and D. R. Thompson. Decentralized active robotic exploration\nIn Proc. AAMAS, pages\n\nand mapping for probabilistic \ufb01eld classi\ufb01cation in environmental sensing.\n105\u2013112, 2012.\n\n[36] K. H. Low, J. Chen, T. N. Hoang, N. Xu, and P. Jaillet. Recent advances in scaling up Gaussian process\npredictive models for large spatiotemporal data. In S. Ravela and A. Sandu, editors, Proc. Dynamic\nData-driven Environmental Systems Science Conference (DyDESS\u201914). LNCS 8964, Springer, 2015.\n\n[37] K. H. Low, J. M. Dolan, and P. Khosla. Adaptive multi-robot wide-area exploration and mapping. In Proc.\n\nAAMAS, pages 23\u201330, 2008.\n\n[38] K. H. Low, J. M. Dolan, and P. Khosla. Information-theoretic approach to ef\ufb01cient adaptive path planning\n\nfor mobile robotic environmental sensing. In Proc. ICAPS, pages 233\u2013240, 2009.\n\n[39] K. H. Low, J. M. Dolan, and P. Khosla. Active Markov information-theoretic path planning for robotic\n\nenvironmental sensing. In Proc. AAMAS, pages 753\u2013760, 2011.\n\n[40] K. H. Low, J. Yu, J. Chen, and P. Jaillet. Parallel Gaussian process regression for big data: Low-rank\n\nrepresentation meets Markov approximation. In Proc. AAAI, pages 2821\u20132827, 2015.\n\n[41] A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1, Z. Ghahra-\n\nmani, and J. Hensman. GP\ufb02ow: A Gaussian process library using TensorFlow. JMLR, 18:1\u20136, 2017.\n\n[42] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational Bayes: Unifying variational autoencoders\n\nand generative adversarial networks. In Proc. ICML, pages 2391\u20132400, 2017.\n\n11\n\n\f[43] R. Ouyang and K. H. Low. Gaussian process decentralized data fusion meets transfer learning in large-scale\n\ndistributed cooperative perception. In Proc. AAAI, pages 3876\u20133883, 2018.\n\n[44] R. Ouyang, K. H. Low, J. Chen, and P. Jaillet. Multi-robot active sensing of non-stationary Gaussian\n\nprocess-based environmental phenomena. In Proc. AAMAS, pages 573\u2013580, 2014.\n\n[45] J. Qui\u00f1onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. JMLR, 6:1939\u20131959, 2005.\n\n[46] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006.\n\n[47] S. T. Roweis, L. K. Saul, and G. E. Hinton. Global coordination of local linear models. In Proc. NeurIPS,\n\npages 889\u2013896, 2002.\n\n[48] H. Salimbeni and M. Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In\n\nProc. NeurIPS, pages 4588\u20134599, 2017.\n\n[49] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust Bayesian neural\n\nnetworks. In Proc. NeurIPS, pages 4134\u20134142, 2016.\n\n[50] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proc. AISTATS,\n\npages 567\u2013574, 2009.\n\n[51] M. K. Titsias. Variational model selection for sparse Gaussian process regression. Technical report, School\n\nof Computer Science, University of Manchester, 2009.\n\n[52] M. K. Titsias and F. J. R. Ruiz. Unbiased implicit variational inference. In Proc. AISTATS, pages 167\u2013176,\n\n2019.\n\n[53] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W.\n\nSenior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.\n\n[54] K.-C. Wang, P. Vicol, J. Lucas, L. Gu, R. Grosse, and R. Zemel. Adversarial distillation of Bayesian neural\n\nnetwork posteriors. In Proc. ICML, pages 5177\u20135186, 2018.\n\n[55] N. Xu, K. H. Low, J. Chen, K. K. Lim, and E. B. \u00d6zg\u00fcl. GP-Localize: Persistent mobile robot localization\n\nusing online sparse Gaussian process observation model. In Proc. AAAI, pages 2585\u20132592, 2014.\n\n[56] M. Yin and M. Zhou. Semi-implicit variational inference. In Proc. ICML, pages 5660\u20135669, 2018.\n\n[57] H. Yu, T. N. Hoang, K. H. Low, and P. Jaillet. Stochastic variational inference for Bayesian sparse Gaussian\n\nprocess regression. In Proc. IJCNN, 2019.\n\n[58] R. Zhang, C. Li, J. Zhang, C. Chen, and A. G. Wilson. Cyclical stochastic gradient MCMC for Bayesian\n\ndeep learning. arXiv:1902.03932, 2019.\n\n[59] Y. Zhang, Z. Dai, and K. H. Low. Bayesian optimization with binary auxiliary information. In Proc. UAI,\n\n2019.\n\n[60] Y. Zhang, T. N. Hoang, K. H. Low, and M. Kankanhalli. Near-optimal active learning of multi-output\n\nGaussian processes. In Proc. AAAI, pages 2351\u20132357, 2016.\n\n[61] Y. Zhang, T. N. Hoang, K. H. Low, and M. Kankanhalli. Information-based multi-\ufb01delity Bayesian\n\noptimization. In Proc. NIPS Workshop on Bayesian Optimization, 2017.\n\n12\n\n\f", "award": [], "sourceid": 8210, "authors": [{"given_name": "Haibin", "family_name": "YU", "institution": "National University of Singapore"}, {"given_name": "Yizhou", "family_name": "Chen", "institution": "National University of Singapore"}, {"given_name": "Bryan Kian Hsiang", "family_name": "Low", "institution": "National University of Singapore"}, {"given_name": "Patrick", "family_name": "Jaillet", "institution": "MIT"}, {"given_name": "Zhongxiang", "family_name": "Dai", "institution": "National University of Singapore"}]}