{"title": "Bayesian Warped Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1619, "page_last": 1627, "abstract": "Warped Gaussian processes (WGP) [1] model output observations in regression tasks as a parametric nonlinear transformation of a Gaussian process (GP). The use of this nonlinear transformation, which is included as part of the probabilistic model, was shown to enhance performance by providing a better prior model on several data sets. In order to learn its parameters, maximum likelihood was used. In this work we show that it is possible to use a non-parametric nonlinear transformation in WGP and variationally integrate it out. The resulting Bayesian WGP is then able to work in scenarios in which the maximum likelihood WGP failed: Low data regime, data with censored values, classification, etc. We demonstrate the superior performance of Bayesian warped GPs on several real data sets.", "full_text": "Bayesian Warped Gaussian Processes\n\nMiguel L\u00b4azaro-Gredilla\n\nDept. Signal Processing & Communications\n\nUniversidad Carlos III de Madrid - Spain\n\nmiguel@tsc.uc3m.es\n\nAbstract\n\nWarped Gaussian processes (WGP) [1] model output observations in regression\ntasks as a parametric nonlinear transformation of a Gaussian process (GP). The\nuse of this nonlinear transformation, which is included as part of the probabilistic\nmodel, was shown to enhance performance by providing a better prior model on\nseveral data sets. In order to learn its parameters, maximum likelihood was used.\nIn this work we show that it is possible to use a non-parametric nonlinear trans-\nformation in WGP and variationally integrate it out. The resulting Bayesian WGP\nis then able to work in scenarios in which the maximum likelihood WGP failed:\nLow data regime, data with censored values, classi\ufb01cation, etc. We demonstrate\nthe superior performance of Bayesian warped GPs on several real data sets.\n\n1 Introduction\n\nIn a Bayesian setting, the Gaussian process (GP) is commonly used to de\ufb01ne a prior probability\ndistribution over functions. This leads to a simple and elegant probabilistic framework that allows\nto solve, among others, regression and classi\ufb01cation tasks, achieving state-of-the-art performance\n[2, 3]. For a thorough treatment on GPs, the reader is referred to [4].\n\nIn the regression setting, output data are often modelled directly as observations from a GP. However,\nit is shown in [1] that for some data sets, better models can be built if the observed outputs are\nregarded as a nonlinear distortion (the so-called warping) of a GP instead. For a warped GP (WGP),\nthe warping function can take any parametric form, and in [1] the sum of a linear function and several\ntanh functions is used. The parameters de\ufb01ning the transformation are then learned using maximum\nlikelihood. WGPs have the advantage of having a closed-form expression for the evidence and have\nbeen applied in a number of works [5, 6], but also have several shortcomings: Maximum likelihood\nlearning might result in over\ufb01tting if a warping function with too many parameters is used (or if too\nfew data are available), it does not model additional output noise after the warping, it cannot model\n\u201c\ufb02at\u201d warping functions for reasons explained below and, as a consequence, runs into problems\nwhen observations are clustered (many output data take the same value). In this work we set out to\nshow that it is possible to place another GP prior on the warping function and variationally integrate\nit out. By doing so, all of the aforementioned problems disappear and we can enjoy the bene\ufb01ts of\nWGPs on a wider selection of scenarios.\n\nThe remainder of this work is organised as follows: In Section 2 we introduce the Bayesian WGP\nmodel, which is analytically intractable. In Section 3, a variational lower bound on the exact evi-\ndence of the model is derived, which allows for approximate inference and hyperparameter learning.\nWe show the advantages of integrating out the warping function in Section 4, where we compare\nthe performance of the maximum likelihood and the Bayesian versions of warped GPs. Finally, we\nwrap-up with some concluding remarks in Section 5.\n\n1\n\n\f2 The Bayesian warped Gaussian process model\n\nGiven a set of input values {xi \u2208 RD}n\nBayesian warped Gaussian process (BWGP) model as\n\ni=1 and their associated targets {yi \u2208 R}n\n\ni=1, we de\ufb01ne the\n\nyi = g(f (xi)) + \u03b5i\n\n(1a)\n\nwhere f (x) is a (possibly noisy) latent function with D-dimensional inputs, g(f ) is an arbitrary\nwarping function with scalar inputs and \u03b5 is a Gaussian noise term. Proceeding in a Bayesian\nfashion, we place priors on g, f , and \u03b5i. We use Gaussian process and normal priors\n\nf (x) \u223c GP(\u00b50, k(x, x\u2032)),\n\ng(f ) \u223c GP(f, c(f, f \u2032)),\n\n\u03b5i \u223c N (0, \u03c32).\n\n(1b)\n\nNotice that by setting the prior mean on g(f ) to f , we assume that the warping is \u201cby default\u201d\nthe identity. For f , any valid covariance function k(x, x\u2032) can be used, whereas for the warping\nfunction g we use a squared exponential: c(f, f \u2032) = \u03c32\ng exp(\u2212(f \u2212 f \u2032)2/(2\u21132)). The mentioned\nhyperparameters as well as those included in k(x, x\u2032) can be collected in \u03b8 \u2261 {\u03b8k, \u03c3g, \u2113, \u03c3, \u00b50}.\n\nIt might seem that since f (x) is already an arbitrary nonlinear function, further distorting its output\nthrough g(f ) is an additional unnecessary complication. However, even though g(f (x)) can model\narbitrary functions just as f (x) is able to, the implied prior is very different since the composition\nof two GPs g(f (x)) is no longer a GP. This is the same idea as with copulas, but here the warping\nfunction g(f ) is treated in non-parametric form.\n\n2.1 Relationship with maximum likelihood warped Gaussian processes\n\nThough the idea of distorting a standard GP is common to WGP and BWGP, there are several\nrelevant differences worth clarifying:\n\nIn [1], noise is present only in latent function f (x) and observed data corresponds exactly to the\nwarping of f (x). BWGP has an additional noise term \u03b5 that can account for extra noise in the\nobservations after warping. This term can be neglected by setting \u03c32 = 0.\n\nBWGP places a prior on the warping function, instead of using a parametric de\ufb01nition, which allows\nfor maximum \ufb02exibility while avoiding over\ufb01tting. On the other hand, by choosing the number of\ntanh functions in their parametric warping function, WGP sets a trade-off between both.\n\nFinally, the de\ufb01nition of the warping function is reversed between BWGP and WGP. If no noise is\npresent, our warping function y = g(f ) maps latent space f to output space y. In contrast, in [1]\nthe inverse mapping f = w(y) is de\ufb01ned due to analytical tractability reasons. Because of this,\nthe warping function in [1] is restricted to be monotonic, so that it is possible to unambiguously\nidentify its inverse y = w\u22121(f ) = g(f ) and thus de\ufb01ne a valid probability distribution in output\nspace. Since we already work with the direct warping function g(f ), we do not need to impose any\nconstraint on it and thus can use a GP prior. Also, as discussed in [1], WGPs cannot deal properly\nwith models that involve a \u201c\ufb02at\u201d region (i.e., g\u2032(f ) = 0) in the warping function (such as ordinal\nregression or classi\ufb01cation), since the inverse w(y) = g\u22121(y) is not well de\ufb01ned. These \ufb02at regions\nresult in probability masses in output space. In those cases, the probability density of data under\nthe WGP model (the evidence) will be in\ufb01nity, so that it cannot be used for model selection and\nnumerical computation becomes unstable. None of this problems arise on BWGP, which can handle\nboth continuous and discrete observations and model warping functions with \ufb02at regions.\n\n2.2 Relationship with other Gaussian processes models\n\nFor a given warping function g(f ), BWGP can be seen as a standard GP model with likelihood\np(yi|f (xi)) = N (yi|g(f (xi)), \u03c32). Different choices for g(f ) result in different GP models:\n\n\u2022 GP regression [7]: Corresponds to setting g(f ) = f (the mean in our prior).\n\u2022 GP classi\ufb01cation [3]: Corresponds to setting g(f ) = sign(f ) with yi \u2208 {\u22121, +1} and\n\u03c32 = 0. Using a noisy latent function f (x) as prior and a step function as likelihood is\nequivalent to using a noiseless latent function as prior and normal cdf sigmoid function as\nlikelihood [4], so this model corresponds exactly with GP probit classi\ufb01cation.\n\n2\n\n\f\u2022 Ordinal (noisy) regression [8]: Corresponds to setting g(f ) = PK\n\nk=1 H(f \u2212 bk) and op-\ntionally setting \u03c32 = 0. H(f ) is the Heaviside step function and bk are parameters de\ufb01ning\nthe widths and locations of the K bins in latent space.\n\n\u2022 Maximum likelihood WGP [1]: Corresponds to setting g(f ) = w\u22121(f ) and \u03c32 = 0.\n\nBecause g(f ) is integrated out, all of the above models, and possibly many others, can be learned\nusing BWGP. We will see examples of problems requiring other likelihoods in Section 4. Thus, to\nsome extent, BWGP can be regarded as likelihood learning tool.\n\n3 Variational inference for BWGP\n\nAnalytical inference in the BWGP model (1) is intractable. Instead of resorting to expensive Monte\nCarlo methods, we will develop an ef\ufb01cient variational approximation of comparable computational\ncost to that of WGP. We follow ideas discussed in [9] in order to gain tractability.\n\n3.1 Augmented model\n\nFirst, let us rewrite (1) instantiated only at the available observations y = [y1 . . . yn]\u22a4. We omit\nconditioning on inputs {xi}n\n\ni=1 and hyperparameters \u03b8. We have\n\np(y|g) = N (y|g, \u03c32I)\n\np(g|f ) = N (g|f , Cf f )\n\np(f ) = N (f |\u00b50, K),\n\n(2)\n\nwhere f = [f1 . . . fn]\u22a4 is the latent function evaluated at the training inputs {x1 . . . xn} and g =\n[g1 . . . gn]\u22a4 is the warping function evaluated at f . We use K to refer to the n \u00d7 n covariance matrix\nof the latent function, with entries [K]ij = k(xi, xj), whereas similarly [Cf f ]ij = c(fi, fj) is the\nn \u00d7 n warping covariance matrix. In general, we use [Cab]ij = c(ai, bj).\n\nNow we proceed as in sparse GPs [10] and augment this model with a set of m inducing variables\nu = [u1 . . . um]\u22a4 that correspond to evaluating function u(v) = g(v) \u2212 v at some auxiliary values\nv1 . . . vm. We can expand p(g|f ) by \ufb01rst conditioning on u to obtain p(g|u, f ), and then including\nthe prior p(u). This yields the augmented model\n\np(y|g) = N (y|g, \u03c32I)\np(u) = N (u|0, Cvv)\n\np(g|u, f ) = N (g|f + Cf vC\u22121\n\nvv u, Cf f \u2212 Cf vC\u22121\n\nvv C\u22a4\n\nf v)\n\np(f ) = N (f |0, K)\n\n(3a)\n\n(3b)\n\nNote that the original model (2) and the augmented model (3) are exactly identical, since we can\nmarginalise u out from (3) to get exactly (2). In other words, we introduced u in a consistent manner,\n\nso that R p(g|u, f )p(u)du = p(g|f ). The inclusion of the inducing variables does not change the\n\nmodel, independently of their number m or their locations v1 . . . vm.\n\nInducing variables u have a physical interpretation in this model. Expressing the warping function\nas g(v) = u(v) + v, the inducing variables correspond to evaluating GP u(v) at locations v1 . . . vm,\nwhich live in latent space (just as f does). Observe that u provides a probabilistic description of the\nwarping function. In particular, as m grows and the sampling in latent space becomes more and more\ndense1, the covariance Cf f \u2212Cf vC\u22121\nf v gets closer to zero2 and p(g|u, f ) becomes a Dirac delta,\nthus making the warping function deterministic given u, g(f ) = f + [c(f, v1) . . . c(f, vm)]C\u22121\nvv u.\n\nvv C\u22a4\n\n3.2 Variational lower bound\n\nThe exact posterior of BWGP model (3) is analytically intractable. We can proceed by select-\ning, within a given family of distributions, the approximate posterior q(g, u, f ) that minimises the\nKullback-Leibler (KL) divergence to the true posterior p(g, u, f |y). We can write\n\nlog p(y) \u2265 log p(y) \u2212 KL(q(g, u, f )||p(g, u, f |y)) = Z q(g, u, f ) log\n\np(y, g, u, f )\n\nq(g, u, f )\n\ndgdf du = F ,\n\n1We can make m, which is the number of inducing inputs and associated inducing variables, as big as we\n\ndesire (and thus make the sampling arbitrarily dense), independently of the number of available samples n.\n\n2Note that Cf vC\u22121\n\nvv C\u22a4\n\nf v is a Nystr\u00a8om approximation to Cf f , whose quality grows with m.\n\n3\n\n\fwhere F is a variational lower bound on the evidence log p(y). Since log p(y) is constant for any\nchoice of q, it is obvious that maximising F wrt q yields the best approximation in the mentioned\nKL sense within the considered family of distributions. We should choose a family that can model\nthe posterior as well as possible while keeping the computation of F tractable. If no constraints on\nq are imposed, maximisation retrieves the exact posterior.\n\nWe expand q(g, u, f ) = q(g|u, f )q(u|f )q(f ) and constrain it as follows: q(f ) = N (f |\u00b5, \u03a3),\nq(u|f ) = q(u), q(g|u, f ) = p(g|u, f ). We argue that this constraints should still allow for a good\napproximation: The exact posterior over f for any monotonic warping function is Gaussian (see\n[1]), so it is reasonable to set q(f ) to be a Gaussian; GPs u(v) and f (x) are independent a priori\nand encode different parts of the model, so it is reasonable to approximate them as independent a\nposteriori q(u|f ) = q(u); and \ufb01nally, given a dense sampling of the latent space (which is feasible,\nsince it is one-dimensional), p(g|u, f ) is virtually a Dirac delta, so conditioning on the observations\nhas no effect and we can set q(g|u, f ) = p(g|u, f ). Using the constrained expansion for q we get\n\nF (q(u), \u00b5, \u03a3) = Z q(u)q(f )(cid:18)Z p(g|u, f ) log p(y|g)dg + log\n\np(u)\n\nq(u)(cid:19) df du \u2212 KL(q(f )||p(f ))\n\nThe inner integral yields\n\nZ p(g|u, f ) log p(y|g)dg = \u2212\n\nn\n2\n\nlog(2\u03c0\u03c32) \u2212\n\n1\n2\u03c32 {trace(Cf f \u2212 Cf vC\u22121\n\nvv C\u22a4\n\nf v) + ||y \u2212 f ||2\n\n\u2212 2y\u22a4Cf vC\u22121\n\nvv u + u\u22a4C\u22121\n\nvv C\u22a4\n\nf vCf vC\u22121\n\nvv u + 2u\u22a4C\u22121\n\nvv C\u22a4\n\nf vf },\n\nwhich can be averaged analytically over q(f ) = N (f |\u00b5, \u03a3). To this end, we de\ufb01ne \u03c80 =\nhtrace(Cf f )iq(f ), \u03a82 = hC\u22a4\n\nf vf iq(f ), which are\n\nf vCf viq(f ), \u03a81 = hCf viq(f ), and \u03c83 = hC\u22a4\ng \u2113 exp(cid:16)\u2212 (vj \u2212vk)2\n\n[\u03a82]jk =\n\n\u03c34\n\nn\n\nXi=1\n\n2[\u03a3]ii+\u21132\n\n4\u21132 \u2212 ([\u00b5]i\u2212(vj +vk)/2)2\np2[\u03a3]ii + \u21132\n\n(cid:17)\n\n\u03c80 = n\u03c32\ng\n\n\u03c32\n\n2([\u03a3]ii+\u21132)(cid:17)\ng\u2113 exp(cid:16)\u2212 ([\u00b5]i\u2212vj )2\n\n[\u03a81]ij =\n\np[\u03a3]ii + \u21132\n\n[\u03c83]j =\n\nn\n\nXi=1\n\n\u03c32\n\ng\u2113 exp(cid:16)\u2212 ([\u00b5]i\u2212vj )2\n\n2([\u03a3]ii+\u21132)(cid:17) ([\u00b5]i\u21132 \u2212 [\u03a3]iivj )\np([\u03a3]ii + \u21132)3\n\n.\n\nAfter averaging over q(f ), most of the terms do not depend on u and can be taken out of the integral.\nThe remaining terms which depend on u can be arranged as follows:\n\nZ q(u) log\n\np(u) exp(\u2212 1\n\n2\u03c32 u\u22a4C\u22121\n\n\u03c32 (y\u22a4\u03a81 \u2212 \u03c8\u22a4\n\n3 )C\u22121\n\nvv u)\n\nvv \u03a82C\u22121\nvv u + 1\nq(u)\n\ndu.\n\n(4)\n\nNote that we have not speci\ufb01ed any functional form for q(u), so any distribution over u is valid.\nIn particular, we want to choose q(u) so as to maximise (4), because that would be the choice that\nmaximises F (q(u), \u00b5, \u03a3). Inspecting (4), we notice that it has the form of a Jensen\u2019s inequality\nlower bound. The maximum wrt q(u) can then be obtained by reversing Jensen\u2019s inequality:\n\nlogZ p(u) exp(cid:16)\u2212 1\n\n2\u03c32 u\u22a4C\u22121\n\nvv \u03a82C\u22121\n\nvv u + 1\n\n\u03c32 (y\u22a4\u03a81 \u2212 \u03c8\u22a4\n\n3 )C\u22121\n\n=\n\n1\n2\u03c32 (y\u22a4\u03a81 \u2212 \u03c8\u22a4\n\n3 )(\u03a82 + \u03c32Cvv)\u22121(\u03a8\u22a4\n\n1 y \u2212 \u03c83) \u2212\n\n1\n2\n\nlog\n\nvv u(cid:17) du\n|\u03a82 + \u03c32Cvv|\n\n|Cvv|\n\n+\n\nn\n2\n\nlog \u03c32,\n\nwhich corresponds to selecting3 q\u2217(u) = N (u | Cvv\u03b2, \u03c32Cvv(\u03a82 + \u03c32Cvv)\u22121Cvv) with \u03b2 =\n(\u03a82 +\u03c32Cvv)\u22121(\u03a8\u22a4\n1 y \u2212\u03c83). Replacing one of the variational distributions within the bound by its\noptimal value is sometimes referred to as using a \u201cmarginalised variational bound\u201d [11]. Grouping\nall terms together, we \ufb01nally obtain:\n\nF BWGP(\u00b5, \u03a3) = \u2212\n\n1\n2\u03c32 (||y \u2212 \u00b5||2 + trace(\u03a3) + \u03c80 \u2212 trace(\u03a82C\u22121\n\nvv )) \u2212\n\n1\n2\n\nlog\n\n|\u03a82 + \u03c32Cvv|\n\n|Cvv|\n\n+\n\n1\n2\u03c32 (y\u22a4\u03a81 \u2212 \u03c8\u22a4\n\n3 )(\u03a82 + \u03c32Cvv)\u22121(\u03a8\u22a4\n\n3Using variational arguments, q\n\n\u2217(u) \u221d p(u) exp(\u2212\n\n4\n\n1 y \u2212 \u03c83) \u2212\n\nn\n2\n2\u03c32 u\u22a4C\u22121\n\n1\n\nlog 2\u03c0 \u2212 KL(N (\u00b5, \u03a3)||N (\u00b50, K))\n\nvv \u03a82C\u22121\n\nvv u + 1\n\n\u03c32 (y\u22a4\u03a81 \u2212 \u03c8\u22a4\n\n3 )C\u22121\n\nvv u).\n\n\fThis bound depends on \u00b5 and \u03a3, i.e., n+n(n+1)/2 variational parameters which must be optimised.\nEven for moderate sizes of n, this can be inconvenient. Following [12, 13], we can reduce the\nnumber of free parameters by considering the conditions that must be met at any local maxima. By\nimposing \u2202F (\u00b5,\u03a3)\n\u2202\u03a3 = 0, we know that the posterior covariance can be expressed as \u03a3 = (K\u22121 +\n\u039b)\u22121, for some diagonal matrix \u039b. With this de\ufb01nition, the bound F (\u00b5, \u039b) now depends only on\n2n free variational parameters and can be computed in O(n3) time and O(n2) space, just as WGP.\n\n3.3 Model selection\n\nThe gradients of the variational bound FBWGP(\u00b5, \u039b, \u03b8) (now explicitly including its dependence on\nthe hyperparameters) can be computed analytically so it is possible to jointly optimise it both wrt to\nthe 2n free variational parameters and hyperparameters \u03b8 in order to simultaneously perform model\nselection (by choosing the hyperparameters) and obtaining an accurate posterior (by choosing the\nfree variational parameters). The hyperparameters are the same as for a WGP that uses a single tanh\nfunction, so no over\ufb01tting is expected, while still enjoying a completely \ufb02exible warping function.\n\n3.4 Approximate predictive density\n\nIn order to use the proposed approximate posterior to make predictions for a new test output y\u2217\n\ngiven input x\u2217 we need to compute q(y\u2217|y) = R p(y\u2217|g\u2217)p(g\u2217|f\u2217, u)q(u)p(f\u2217|f )q(f )dg\u2217df\u2217dudf .\n\nIntegration wrt all variables can be computed analytically except for f\u2217, resulting in\n\nq(y\u2217|y) = Z q(y\u2217|f\u2217)q(f\u2217|y)df\u2217\n\n2 Cvv)\u22121c\u2217) and q(f\u2217|y) =\nwith q(y\u2217|f\u2217) = N (y\u2217 | f\u2217 + c\u2217\u03b2, \u03c32 + c\u2217\u2217 \u2212 c\u22a4\n\u2217 (K + \u039b\u22121)\u22121k\u2217, k\u2217 =\nN (f\u2217|\u00b5\u2217, \u03c32\n[k(x\u2217, x1) . . . k(x\u2217, xn)]\u22a4, k\u2217\u2217 = k(x\u2217, x\u2217), c\u2217 = [c(f\u2217, f1) . . . c(f\u2217, fn)]\u22a4, c\u2217\u2217 = c(f\u2217, f\u2217) = \u03c32\ng\nand 1 is an appropriately sized vector of ones.\n\n\u2217 (Cvv + \u03c32Cvv\u03a8\u22121\n\u2217 = k\u2217\u2217 \u2212 k\u22a4\n\n\u2217), where \u00b5\u2217 = \u00b50 + k\u22a4\n\n\u2217 K\u22121(\u00b5 \u2212 \u00b501), \u03c32\n\nThis latter one-dimensional integral can be computed numerically if needed, using Gaussian quadra-\nture techniques. However, the posterior mean and variance can be computed analytically. Indeed,\n\nEq[y\u2217|y] = \u00b5\u2217 + \u03a81\u2217\u03b2\n\nVq[y\u2217|y] = \u03b2\u22a4(\u03a82\u2217 \u2212 \u03a8\u22a4\n\n1\u2217\u03a81\u2217)\u03b2 + 2(\u03c8\u22a4\n\n3\u2217 \u2212 \u00b5\u22a4\n\n+ \u03c32 + c\u2217\u2217 \u2212 trace(\u03a82\u2217(Cvv + \u03c32Cvv\u03a8\u22121\n\n\u2217 \u03a81\u2217)\u03b2 + \u03c32\n2 Cvv)\u22121)\n\n\u2217\n\nwhere \u03a8\u2217 matrices are de\ufb01ned as their non-starred counterparts, but using \u00b5\u2217 and \u03c32\nand \u03a3 in their computation. In spite of this, the approximate posterior is not Gaussian in general.\n\n\u2217 instead of \u00b5\n\n4 Experiments\n\nWe will now investigate the behaviour of BWGP on several real regression and classi\ufb01cation\ndatasets. In our experiments we will compare its performance with that of the original implemen-\ntation4 of the maximum likelihood WGP model from [1]. In order to show the effect of varying\nthe complexity of the parametric warping function in WGP, we tested a 3 tanh model (the default,\nused in the experiments from [1]) and a 20 tanh model, denoted as WGP3 and WGP20, respec-\ntively. We did our best to achieve the maximum accuracy in WGP, so in order to solve each data\nsplit, we optimised its hyperparameters 5 times from a random initialisation (the implementation\u2019s\ndefault method) and 5 times more using a standard GP to initialise the underlying GP (and randomly\ninitialising the warping function). Out of the 10 total runs, we used the one achieving a higher evi-\ndence. The BWGP model was initialised from a standard GP and run only once per data split. The\nstandard ARD SE covariance function [4] plus noise was used for the underlying GP in all models.\nThe two measures that we use to compare performance are MSE = 1\ni=1(y\u2217i \u2212 Eq[y\u2217i|y])2 and\nNLPD = \u2212 1\n\ni=1 log q(y\u2217i|y). In both cases, a lower value indicates better performance.\n\nn\u2217 Pn\u2217\n\nn\u2217 Pn\u2217\n\n4Available from http://www.gatsby.ucl.ac.uk/\u02dcsnelson/.\n\n5\n\n\f4.1 Toy 1D data\n\nFirst we evaluate the model on a simple one-dimensional toy problem. In order to generate a non-\nlinearly distorted signal, we round a sine function to the nearest integer and add Gaussian noise with\nvariance \u03c32 = 2.5 \u00d7 10\u22123. The training set consists of 51 uniformly spaced samples between \u2212\u03c0\nand \u03c0. We train a standard GP, WGP, and BWGP and then we test them on 401 uniformly spaced\nsamples in the same interval. Results are displayed on Fig. 1.\n\nTraining samples\nGP\nWGP3\nBWGP\n\n \n\n \n\n0\n=\n\n \n\n*\n\nx\n \nt\n\na\n\n \n)\n\nD\n\n|\n\ny\n(\np\n\n*\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n.\n\n4\n0\n=\n\n \n\n \n\nx\n \nt\n\na\n\n \n)\n\nD\n\n|\n\ny\n(\np\n\n*\n\n*\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\nx\n\n1\n\n2\n\n3\n\n4\n\n0\n\u22122\n\n\u22121\n\n0\ny\n\n*\n\n1\n\n2\n\n0\n\u22122\n\n\u22121\n\n1\n\n2\n\n0\ny\n\n*\n\nBWGP\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n)\nf\n(\ng\n\n\u22121\n\u22121.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\u22128\n\n\u22121\n\n\u22120.5\n\n0\nf\n\nWGP3\n\n0.5\n\n1\n\n1.5\n\n\u22126\n\n\u22124\n\n\u22122\n\nf\n\n0\n\n2\n\n4\n\n6\n\n)\nf\n(\n\n \n\n1\n\u2212\nw\n=\n \n)\nf\n(\ng\n\n1\n\n0.5\n\ny\n\n0\n\n\u22120.5\n\n\u22121\n\n \n\u22124\n\nFigure 1: Left: Posterior mean for the proposed models. A dashed envelope encloses 90% posterior\nmass for WGP, whereas a shading is used to show 90% posterior mass for BWGP. Middle: The\ndotted line shows the true posterior at x = 0 and x = 0.4, which is much better modelled by BWGP.\nRight: Warping functions inferred by each model.\n\nThe warping functions look reasonable for both models. For WGP it is a deterministic function,\nthe inverse of the strictly monotonic function w(y), so it can never achieve completely \u201c\ufb02at\u201d zones.\nSince WGP does not model output noise explicitly, these \ufb02at zones transfer and magnify output\nnoise to latent space, with the consequent degradation in performance. Note the extra spread of\nthe posterior mass in comparison with the actual training data, which is much better modelled by\nBWGP. The mean of WGP fails to follow the \ufb02at regions at zero, behaving as a sine function, just\nlike the standard GP. The standard GP is also unable to handle this signal properly because of the\nnon-stationary smoothness: Abrupt changes are followed by constant levels. BWGP is able to deal\nproperly with noisy quantised signals and it is able to learn the implicit quantisation function.\n\n4.2 Regression data sets\n\nWe now turn to the three real data sets originally used in [1] to assess WGP and for which it is\nspecially suited. These are: abalone [14] (4177 samples, 8 dimensions), creep [15, 16] (2066\nsamples, 30 dimensions), and ailerons [17] (7154 samples, 40 dimensions). As for the size of\nthe training set, the typical choice is to use 1000, 800 and 1000 samples respectively. For each\nproblem, we generated 60 splits by randomly partitioning data. Results are displayed on Table 1.\nThe warping functions inferred by BWGP are displayed in Fig. 3(a)-(c) and are almost identical to\nthose displayed in [1] for WGP. The shading represents 99.99% posterior mass.\n\nTable 1: NMSE and NLPD \ufb01gures for the compared methods on original data sets of [1].\n\nMSE\n\nNLPD\n\nModel\n\nabalone\n\ncreep\n\nail (\u00d710\u22128 )\n\nabalone\n\ncreep\n\nailerons\n\nGP\nBWGP\nMLWGP3\nMLWGP20\n\n4.55\u00b10.14\n4.55\u00b10.11\n4.54\u00b10.10\n4.59\u00b10.32\n\n584.9\u00b171.2\n491.8\u00b136.2\n502.3\u00b143.3\n506.3\u00b146.1\n\n2.95\u00b10.16\n2.91\u00b10.14\n2.80\u00b10.11\n3.42\u00b12.87\n\n2.17\u00b10.01\n1.99\u00b10.01\n1.97\u00b10.02\n1.99\u00b10.05\n\n4.46\u00b10.03\n4.31\u00b10.04\n4.21\u00b10.03\n4.21\u00b10.08\n\n-7.30\u00b10.01\n-7.38\u00b10.02\n-7.44\u00b10.01\n-7.45\u00b10.08\n\nIn terms of NLPD, BWGP always outperforms the standard GP, but it is in turn outperformed by\nthe maximum likelihood variants, which do not need to resort to any approximation to compute its\nposterior. In terms of MSE, BWGP always performs better than WGP20 on these data sets, but only\nperforms better than WGP3 on the creep data set, which, on the other hand, is the one that seems\n\n6\n\n\fto bene\ufb01t more from the use of a warping function. It seems that the additional \ufb02exibility of the\nwarping function in WGP20 is penalising its ability to generalise properly.\n\nUpon seeing these results, we can conclude that WGP3 is already a good enough solution when\nabundant training data are available and a simple warping function is required. This is reasonable:\nThe additional number of hyperparameters is small (only 9) and inference can be performed ana-\nlytically. We can also see in Fig. 3(a)-(c) that the posterior over the warping functions is highly\npeaked, so a maximum likelihood approach makes sense. However, performance might suffer when\nthe warping function becomes even slightly complex, as in creep, or when the number of available\ndata for training is very small (see the effect of the training set size on Fig. 2). In those cases, BWGP\nis a safer option, since it will not over\ufb01t independently of the amount of data while allowing for a\nhighly \ufb02exible warping function.\n\nabalone\n\ncreep\n\nailerons\n\n101\n\ns\nt\ni\nl\n\np\ns\n \n\n0\n6\n\n \n\n \n\nn\no\nE\nS\nM\n \ne\ng\na\nr\ne\nv\nA\n\n \n\nGP\nWGP3\nBWGP\nWGP20\n\ns\nt\ni\nl\n\np\ns\n \n\n0\n6\n\n \n\n \n\nn\no\nE\nS\nM\n \ne\ng\na\nr\ne\nv\nA\n\n103\n\n \n\nGP\nWGP3\nBWGP\nWGP20\n\n10\u22127\n\ns\nt\ni\nl\n\n \n\nGP\nWGP3\nBWGP\nWGP20\n\n \n\n \n\np\ns\n \n0\n6\nn\no\nE\nS\nM\ne\ng\na\nr\ne\nv\nA\n\n \n\n \n\n 50\n\n 100\n\n 200\n\nNumber of training data\n\n 500\n\n1000\n\n \n\n 50\n\n100\n\n200\n\nNumber of training data\n\n500\n\n800\n\n \n\n 50\n\n 100\n\n 200\n\nNumber of training data\n\n 500\n\n1000\n\nFigure 2: Average MSE, as well as estimated \u00b11 std. deviation of the average, for 60 splits.\n\n4.3 Censored regression data sets\n\nWe will now modify the previous data sets so that they become more challenging. We will consider\nthat they have been censored, i.e., values that lie above or below some thresholds have been trun-\ncated. This is a realistic setting in the case of physical measurements (e.g., due to the limitation of\nmeasuring devices), but clusters of values lying at the end of the range can appear in other cases. In\nour experiments, we truncated the upper and lower 20% of the previous datasets, while keeping the\nremaining 60% of data untouched. Note that the methods have no information about the existing\ntruncation or the used thresholds.\n\nAs discussed in [1], for this type of data, WGP tries to spread the samples in latent space by using a\nvery sharp warping function and this causes the model problems. Additionally, the computation of\nthe NLPD becomes erroneous due to numerical problems, with some of the tanh functions becom-\ning very close to sign functions. None of these problems were experienced by BWGP, which still\nworks signi\ufb01cantly better than a standard GP on this type of problems, see Table 2. The correspond-\ning warping functions are displayed on Figs. 3.(e)-(g).\n\nTable 2: NMSE and NLPD \ufb01gures for the compared methods on censored data sets.\n\nMSE\n\nNLPD\n\nModel\n\nabalone\n\ncreep\n\nail (\u00d710\u22128\n\n)\n\nabalone\n\ncreep\n\nailerons\n\nGP\nBWGP\nWGP3\nWGP20\n\n1.27\u00b10.12\n1.27\u00b10.12\n1.40\u00b10.31\n1.38\u00b10.22\n\n339.5\u00b129.2\n276.8\u00b126.8\n434.6\u00b1169.0\n382.1\u00b193.4\n\n1.20\u00b10.12\n1.18\u00b10.12\n1.83\u00b12.18\n1.39\u00b10.78\n\n1.54\u00b10.05\n0.74\u00b10.36\n\u2014\n\u2014\n\n4.22\u00b10.04\n3.68\u00b10.17\n\u2014\n\u2014\n\n-7.70\u00b10.05\n-7.89\u00b10.07\n\u2014\n\u2014\n\n4.4 Classi\ufb01cation data sets\n\nClassi\ufb01cation can be regarded as an extreme case of censoring or quantisation of a regression data\nset. We also mentioned in Section 2.2 that the (conditional) generative model of GP classi\ufb01cation\n\n7\n\n\fCreep\n\nx 10\u22123\n\n0\n\nAilerons\n\nAbalone\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n)\nf\n(\ng\n\n\u221210\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\nf\n\n5\n\n10\n\n600\n\n500\n\n400\n\n)\nf\n(\ng\n\n300\n\n200\n\n100\n\n0\n\n\u22120.5\n\n\u22121\n\n)\nf\n(\ng\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\n\u2212200\n\n\u2212100\n\n0\n\nf\n\n100\n\n200\n\n300\n\ngerman\n\n1.5\n\n1\n\n0.5\n\n)\nf\n(\ng\n\n0\n\n\u22120.5\n\n\u22121\n\n\u221220\n\n\u221215\n\n\u221210\n\nf\n\n\u22125\n\n0\n\n5\n\nx 10\u22124\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\nf\n\n0.5\n\n1\n\n(a) abalone (reg)\n\n(b) creep (reg)\n\n(c) ailerons (reg)\n\n(d) german (class)\n\n)\nf\n(\ng\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\nAbalone\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nf\n\n)\nf\n(\ng\n\n240\n\n220\n\n200\n\n180\n\n160\n\n140\n\n120\n\n100\n\n80\n\n60\n\nCreep\n\n\u2212100\n\n\u221250\n\n0\n\nf\n\n50\n\n100\n\n\u22124\n\n\u22125\n\n\u22126\n\n\u22127\n\n)\nf\n(\ng\n\n\u22128\n\n\u22129\n\n\u221210\n\n\u221211\n\n\u221212\n\nx 10\u22124\n\nAilerons\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\nf\n\n2\n\n3\n\n4\n\n5\nx 10\u22124\n\n1.5\n\n1\n\n0.5\n\n)\nf\n(\ng\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\ntitanic\n\n\u22120.8\n\n\u22120.6\n\n\u22120.4\n\n\u22120.2\n\n0\n\nf\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n(e) abalone (cens)\n\n(f) creep (cens)\n\n(g) ailerons (cens)\n\n(h) titanic (class)\n\nFigure 3: Inferred warping functions.\n\nTable 3: Error rates (in percentage) for the proposed model on the benchmark from R\u00a8atsch [18].\n\nban\n\nbre\n\ndia\n\nfla\n\nger\n\nhea\n\nima\n\nrin\n\nspl\n\nthy\n\ntit\n\ntwo\n\nwav\n\nGP\nBWGP\nGPC\n\n13.2\n10.7\n10.6\n\n29.6\n29.5\n29.5\n\n28.0\n24.5\n24.2\n\n39.1\n33.3\n33.5\n\n27.6\n23.9\n24.8\n\n28.6\n23.5\n21.7\n\n03.2\n02.1\n02.1\n\n21.1\n04.8\n07.9\n\n23.4\n17.0\n22.8\n\n13.7\n04.7\n04.0\n\n23.6\n22.0\n22.2\n\n10.1\n04.2\n04.2\n\n15.5\n12.4\n11.4\n\ncould be seen as a particular selection for g(f ). So we decided to test the BWGP model on the 13\nclassi\ufb01cation data sets from R\u00a8atsch benchmark [18].\n\nSince WGP does not produce any meaningful results on this type of data, as mentioned in [1],\nwe did not include it in the comparison. Instead, we used a standard GP classi\ufb01er (GPC) using a\nprobit likelihood and expectation propagation for approximate inference. We measured the error\nrate, which is the performance \ufb01gure we are interested in for those data sets, averaging over 10 splits\nof the data. Results from Table 3 show that BWGP is able to match and occasionally exceed the\nperformance of GPC, outperforming in all cases the standard GP. The learned warping functions\nlook similar for the different data sets. We have depicted two typical cases in Figs. 3.(d) and 3.(h).\nSpecially good results are obtained for german, ringnorm, and splice, though we are aware\nthan even better results can be obtained by using an isotropic SE covariance on these data sets [19].\n\n5 Discussion and further work\n\nIn this work we have shown how it is possible to variationally integrate out the warping function\nfrom warped GPs. This is useful to overcome the limitations of maximum likelihood warped GPs,\nnamely: To work in the low data sample regime; to handle censored observations and classi\ufb01cation\ndata; to explicitly model output noise; and to allow for warping functions of unlimited \ufb02exibility,\nwhich may include \ufb02at regions. The experiments demonstrate the improved robustness of the BWGP\nmodel, which is able to operate properly in a much wider set of scenarios. While a speci\ufb01c model\n(should it exist) will generally be a better tool for a speci\ufb01c task (e.g., GPC for classi\ufb01cation), BWGP\nbehaves as a Swiss Army knife providing good performance on general tasks.\n\nIn addition to the tasks discussed in this work, there are other cases in which BWGP can be of\nimmediate application. One example is ordinal regression [8], where the locations and widths of\nthe bins can be integrated out instead of selected. Another potential future application is within the\npopular \ufb01eld of copulas [20, 21, 22, 23], since they routinely resort to \ufb01xed warpings of GPs.\n\nAcknowledgments\n\nMLG is grateful to Michalis K. Titsias and the anonymous reviewers for helpful comments.\n\n8\n\n\fReferences\n\n[1] E. Snelson, Z. Ghahramani, and C. Rasmussen. Warped Gaussian processes.\n\nIn Advances in Neural\n\nInformation Processing Systems 16, 2003.\n\n[2] C. E. Rasmussen. Evaluation of Gaussian Processes and other Methods for Non-linear Regression. PhD\n\nthesis, University of Toronto, 1996.\n\n[3] M.N. Gibbs. Bayesian Gaussian Processes for Regression and Classi\ufb01cation. PhD thesis, University of\n\nCambridge, 1997.\n\n[4] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation\n\nand Machine Learning. MIT Press, 2006.\n\n[5] M.N. Schmidt. Function factorization using warped gaussian processes. In Proc. of the 26th International\n\nConference on Machine Learning, pages 21\u2013928. Omnipress, 2009.\n\n[6] Y. Zhang and D.-Y Yeung. Multi-task warped gaussian process for personalized age estimation. In IEEE\n\nConf. on Computer Vision and Pattern Recognition, pages 2622\u20132629.\n\n[7] C.K.I. Williams and C.E. Rasmussen. Gaussian processes for regression. In Advances in Neural Infor-\n\nmation Processing Systems 8. MIT Press, 1996.\n\n[8] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Journal of Machine Learning\n\nResearch, 6:1019\u20131041, 2005.\n\n[9] M.K. Titsias and N.D. Lawrence. Bayesian Gaussian process latent variable model.\n\nIn Proc. of the\n13th International Workshop on Arti\ufb01cial Intelligence and Statistics, volume 9 of JMLR: W&CP, pages\n844\u2013851, 2010.\n\n[10] M.K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proc. of the 12th\n\nInternational Workshop on Arti\ufb01cial Intelligence and Statistics, 2009.\n\n[11] M. L\u00b4azaro-Gredilla and M. Titsias. Variational heteroscedastic Gaussian process regression.\n\nIn 28th\nInternational Conference on Machine Learning (ICML-11), pages 841\u2013848, New York, NY, USA, June\n2011. ACM.\n\n[12] M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural Computation,\n\n21(3):786\u2013792, 2009.\n\n[13] M.K. Titsias A.C. Damianou and N.D. Lawrence. Variational gaussian process dynamical systems. In\n\nAdvances in Neural Information Processing System 25. IEEE Conf. publications, 2011.\n\n[14] A. Frank and A. Asuncion. UCI machine learning repository, 2010. http://archive.ics.uci.\n\nedu/ml University of California, Irvine, School of Information and Computer Sciences.\n\n[15] Materials algorithms project (MAP) program and data library. http://www.msm.cam.ac.uk/\n\nmap/map.html.\n\n[16] D. Cole, C. Martin-Moran, A. G. Sheard, H. K. D. H. Bhadeshia, and D. J. C. MacKay. Modelling creep\n\nrupture strength of ferritic steel welds. Science and Technology of Welding and Joining, 5:81\u201390, 2000.\n\n[17] L. Torgo. http://www.liacc.up.pt/\u02dcltorgo/Regression/.\n[18] G. R\u00a8atsch, T. Onoda, and K.-R. M\u00a8uller. Soft margins for AdaBoost. Machine Learning, 42(3):287\u2013\n320, 2001. http://people.tuebingen.mpg.de/vipin/www.fml.tuebingen.mpg.de/\nMembers/raetsch/benchmark.1.html.\n\n[19] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Informa-\n\ntion Processing Systems 20, pages 1057\u20131064. MIT Press, 2008.\n\n[20] R.B. Nelsen. An Introduction to Copulas. Springer, 1999.\n\n[21] P.X.-K. Song. Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of\n\nStatistics, 27(2):305\u2013320, 2000.\n\n[22] A. Wilson and Z. Ghahramani. Copula processes. In Advances in Neural Information Processing Systems\n\n23, pages 2460\u20132468. MIT Press, 2010.\n\n[23] F.L. Wauthier and M.I. Jordan. Heavy-tailed process priors for selective shrinkage. In Advances in Neural\n\nInformation Processing Systems 23. MIT Press, 2010.\n\n9\n\n\f", "award": [], "sourceid": 765, "authors": [{"given_name": "Miguel", "family_name": "L\u00e1zaro-Gredilla", "institution": null}]}