{"title": "Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 6005, "page_last": 6015, "abstract": "This paper studies the problem of deriving fast and accurate classification algorithms with uncertainty quantification. Gaussian process classification provides a principled approach, but the corresponding computational burden is hardly sustainable in large-scale problems and devising efficient alternatives is a challenge. In this work, we investigate if and how Gaussian process regression directly applied to classification labels can be used to tackle this question. While in this case training is remarkably faster, predictions need to be calibrated for classification and uncertainty estimation. To this aim, we propose a novel regression approach where the labels are obtained through the interpretation of classification labels as the coefficients of a degenerate Dirichlet distribution. Extensive experimental results show that the proposed approach provides essentially the same accuracy and uncertainty quantification as Gaussian process classification while requiring only a fraction of computational resources.", "full_text": "Dirichlet-based Gaussian Processes\n\nfor Large-scale Calibrated Classi\ufb01cation\n\nDimitrios Milios\n\nEURECOM\n\nSophia Antipolis, France\n\ndimitrios.milios@eurecom.fr\n\nPietro Michiardi\n\nEURECOM\n\nSophia Antipolis, France\n\npietro.michiardi@eurecom.fr\n\nRaffaello Camoriano\n\nLCSL\n\nIIT (Italy) & MIT (USA)\n\nraffaello.camoriano@iit.it\n\nLorenzo Rosasco\n\nDIBRIS - Universit\u00e0 degli Studi di Genova, Italy\n\nLCSL - IIT (Italy) & MIT (USA)\n\nlrosasco@mit.edu\n\nMaurizio Filippone\n\nEURECOM\n\nSophia Antipolis, France\n\nmaurizio.filippone@eurecom.fr\n\nAbstract\n\nThis paper studies the problem of deriving fast and accurate classi\ufb01cation algo-\nrithms with uncertainty quanti\ufb01cation. Gaussian process classi\ufb01cation provides a\nprincipled approach, but the corresponding computational burden is hardly sustain-\nable in large-scale problems and devising ef\ufb01cient alternatives is a challenge. In\nthis work, we investigate if and how Gaussian process regression directly applied\nto classi\ufb01cation labels can be used to tackle this question. While in this case\ntraining is remarkably faster, predictions need to be calibrated for classi\ufb01cation\nand uncertainty estimation. To this aim, we propose a novel regression approach\nwhere the labels are obtained through the interpretation of classi\ufb01cation labels\nas the coef\ufb01cients of a degenerate Dirichlet distribution. Extensive experimental\nresults show that the proposed approach provides essentially the same accuracy\nand uncertainty quanti\ufb01cation as Gaussian process classi\ufb01cation while requiring\nonly a fraction of computational resources.\n\n1\n\nIntroduction\n\nClassi\ufb01cation is a classic machine learning task. While the most basic performance measure is\nclassi\ufb01cation accuracy, in practice assigning a calibrated con\ufb01dence to the predictions is often crucial\n[5]. For example in image classi\ufb01cation, providing class predictions with a calibrated score is\nimportant to avoid making over-con\ufb01dent decisions [6, 12, 15]. Several classi\ufb01cation algorithms that\noutput a continuous score are not necessarily calibrated (e.g., support vector machines (SVMs) [24]).\nPopular ways to calibrate classi\ufb01ers use a validation set to learn a transformation of their output score\nthat recovers calibration; these include Platt scaling [24] and isotonic regression [39]. Calibration can\nalso be achieved if a sensible loss function is employed [13], for example the logistic/cross-entropy\nloss, and it is known to be positively impacted if the classi\ufb01er is well regularized [6].\nBayesian approaches provide a natural framework to tackle these kinds of questions, since quanti\ufb01-\ncation of uncertainty is of primary interest. In particular, Gaussian Processes Classi\ufb01cation (GPC)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[8, 25, 36] combines the \ufb02exibility of Gaussian Processes (GPs) [25] and the regularization stem-\nming from their probabilistic nature, with the use of the correct likelihood for classi\ufb01cation, that is\nBernoulli or multinomial for binary or multi-class classi\ufb01cation, respectively. While we are not aware\nof empirical studies on the calibration properties of GPC, our results con\ufb01rm the intuition that GPC is\nactually calibrated. The most severe drawback of GPC, however, is its computational burden, making\nit unattractive for large-scale problems.\nIn this paper, we study the question of whether GPs can be made ef\ufb01cient to \ufb01nd accurate and\nwell-calibrated classi\ufb01cation rules. A simple idea is to use GP regression directly on classi\ufb01cation\nlabels. This idea is quite common in non-probabilistic approaches [27, 34] and can be grounded from\na decision theoretic point of view. Indeed, the Bayes\u2019 rule minimizing the expected least-squares\nis the expected conditional probability, which in classi\ufb01cation is directly related to the conditional\nprobabilities of each class (see e.g. [3, 31]). Performing regression directly on the labels leads to fast\ntraining and excellent classi\ufb01cation accuracies [11, 17, 29]. However, the corresponding predictions\nare not calibrated for uncertainty quanti\ufb01cation.\nThe main contribution of our work is the proposal of a transformation of the classi\ufb01cation labels,\nwhich turns the original problem into a regression problem without compromising on calibration. For\nGPs, this has the enormous advantage of bypassing the need for expensive posterior approximations,\nleading to a method that is as fast as a simple regression carried out on the original labels. The\nproposed method is based on the interpretation of the labels as the output of a Dirichlet distribution,\nso we name it Dirichlet-based GP classi\ufb01cation (GPD). Through an extensive experimental validation,\nincluding large-scale classi\ufb01cation tasks, we demonstrate that GPD is calibrated and competitive in\nperformance with state-of-the-art GPC1 .\n\n2 Related work\n\nCalibration of classi\ufb01ers: Platt scaling [24] is a popular method to calibrate the output score of\nclassi\ufb01ers, as well as isotonic regression [39]. More recently, Beta calibration [13] and temperature\nscaling [6] have been proposed to extend the class of possible transformations and reduce the\nparameterization of the transformation, respectively. It is established that binary classi\ufb01ers are\ncalibrated when they employ the logistic loss; this is a direct consequence of the fact that the\nappropriate model for Bernoulli distributed variables is the one associated with this loss [13]. The\nextension to multi-class problems yields the so-called cross-entropy loss, which corresponds to the\nmultinomial likelihood. Not necessarily, however, the right loss makes classi\ufb01ers well calibrated;\nrecent works on calibration of convolutional neural networks for image classi\ufb01cation show that depth\nnegatively impacts calibration due to the introduction of a large number of parameters to optimize,\nand that regularization is important to recover calibration [6].\n\nKernel-based classi\ufb01cation: Performing regression on classi\ufb01cation labels is also known as least-\nsquares classi\ufb01cation [27, 34]. We are not aware of works that study GP-based least-squares clas-\nsi\ufb01cation in depth; we could only \ufb01nd a few comments on it in [25] (Sec. 6.5). GPC is usually\napproached assuming a latent process, which is given a GP prior, that is transformed into a probability\nof class labels through a suitable squashing function [25]. Due to the non-conjugacy between the\nGP prior and the non-Gaussian likelihood, applying standard Bayesian inference techniques in GPC\nleads to analytical intractabilities, and it is necessary to resort to approximations. Standard ways to\napproximate computations include the Laplace Approximation [36] and Expectation Propagation\n(EP, [19]); see, e.g., [16, 21] for a detailed review of these methods. More recently, there have been\nadvancements in works that extend \u201csparse\u201d GP approximations [35] to classi\ufb01cation [9] in order to\ndeal with the issues of scalability with the number of observations through the use of mini-batch-based\noptimization.\n\n3 Background\nConsider a multi-class classi\ufb01cation problem. Given a set of N training inputs X = {x1, . . . , xN}\nand their corresponding labels Y = {y1, . . . , yN}, with one-hot encoded classes denoted by the\nvectors yi, a classi\ufb01er produces a predicted label f (x\u2217) as function of any new input x\u2217.\n\n1 The code is available at https://github.com/dmilios/dirichletGPC.\n\n2\n\n\fM(cid:88)\n\nIn the literature, calibration is assessed through the Expected Calibration Error (ECE) [6], which is\nthe average of the absolute difference between accuracy and con\ufb01dence:\n\nECE =\n\n|Xm|\n|X\u2217| |acc(f (Xm), Ym) \u2212 conf(f , Xm)| ,\n\n(1)\nwhere the test set X\u2217 is divided into disjoint subsets {X1, . . . , XM}, each corresponding to a\ngiven level of con\ufb01dence conf(f , Xm) predicted by the classi\ufb01er f, while acc(f (Xm), Ym) is the\nclassi\ufb01cation accuracy of f measured on the m-th subset. Other metrics used in this work to\ncharacterize the quality of a classi\ufb01er are the error rate on the test set, and the mean negative\nlog-likelihood (MNLL) of the test set under the classi\ufb01cation model:\n\nm=1\n\n(cid:88)\n\nMNLL = \u2212 1\n|X\u2217|\n\nx\u2217,y\u2217\u2208X\u2217,Y\u2217\nAll metrics are de\ufb01ned so that lower values are better.\n\nlog p(y\u2217 | X, Y, x\u2217)\n\n(2)\n\n3.1 Kernel methods for classi\ufb01cation\n\nGP classi\ufb01cation (GPC) GP-based classi\ufb01cation is de\ufb01ned by the following abstract steps:\n\n1. A GP prior, which is characterized by mean function \u00b5(x) and covariance function k(x, x(cid:48)),\nis placed over a latent function f (x). The GP prior is transformed by a sigmoid function so\nthat the sample functions produce proper probability values. In the multi-class case, we con-\nsider C independent priors over the vector of functions f = [f1, . . . , fC](cid:62); transformation\nto proper probabilities is achieved by applying the softmax function \u03c3(f ) 2.\nnents p(yc | f ) = \u03c3(f (x))c, for any c \u2208 {1, . . . , C}.\n\n2. The observed labels y are associated with a categorical likelihood with probability compo-\n\n3. The latent posterior is obtained by means of Bayes\u2019 theorem.\n4. The latent posterior is transformed via \u03c3(f ), to obtain a distribution over class probabilities.\n\n2l2\n\nThroughout this work, we consider \u00b5(x) = 0 and covariance k(x, x(cid:48)) = a2 exp\n, which\nis also known as the RBF kernel, and it is characterized by the a2 and l hyper-parameters, interpreted\nas the GP marginal variance and length-scale, respectively. The hyper-parameters are commonly\nselected my maximizing the marginal likelihood of the model.\nThe major computational challenge of GPC can be identi\ufb01ed in Step 3 described above. The categorical\nlikelihood implies that the posterior over the stochastic process is not Gaussian and it cannot be\ncalculated analytically. Therefore, different approaches resort to different approximations of the\nposterior, for which we have p(f | X, Y ) \u221d p(f | X) p(y | f ). For example in EP [19], local\nlikelihoods are approximated by Gaussian terms so that the posterior has the following form:\n\np(f | X, Y ) \u2248 q(f | X, Y ) \u221d p(f | X)N ( \u02dc\u00b5, \u02dc\u03a3)\n\n(3)\nwhere \u02dc\u00b5 and \u02dc\u03a3 are determined by the site parameters learned through an iterative process. In varia-\ntional classi\ufb01cation approaches [9, 23], the approximating distribution q(f ) is directly parametrized\nby a set of variational parameters. Despite being successful, such approaches contribute signi\ufb01cantly\nto the computational cost of GP classi\ufb01cation, as they introduce a large number of parameters that\nneed to be optimized. In this work, we explore a more straightforward Gaussian approximation to the\nlikelihood that requires no signi\ufb01cant computational overhead.\n\n(cid:16)\u2212 (x\u2212x(cid:48))2\n\n(cid:17)\n\nGP regression (GPR) on classi\ufb01cation labels A simple way to bypass the problem induced by\ncategorical likelihoods is to perform least-squares regression on the labels by ignoring their discrete\nnature. This implies considering a Gaussian likelihood p(y | f ) = N (f , \u03c32\nn is the\nobservation noise variance. It is well-known that if the observed labels are 0 and 1, then the function\nf that minimizes the mean squared error converges to the true class probabilities in the limit of\nin\ufb01nite data [26]. Nevertheless, by not squashing f through a softmax function, we can no longer\nguarantee that the resulting distribution of functions will lie within 0 and 1. For this reason, additional\ncalibration steps are required (i.e. Platt scaling).\n\nnI), where \u03c32\n\n2Softmax function \u03c3(f ) s.t. \u03c3(f )j = exp(fj)/(cid:80)C\n\nc=1 exp(fc) for j = 1, ...C\n\n3\n\n\fFigure 1: Convergence of classi\ufb01ers with different loss functions and regularization properties. Left:\nsummary of the mean squared error (MSE) from the true function fp for 1000 randomly sampled\ntraining sets of different size; the Bayesian CE-based classi\ufb01er is characterized by smaller variance\neven when the number of training inputs is small. Right: demonstration of how the averaged classi\ufb01ers\napproximate the true function for different training sizes.\n\nKernel Ridge Regression (KRR) for classi\ufb01cation The idea of performing regression directly on\nthe labels is quite common when GP estimators are applied within a frequentist context [27]. Here\nthey are typically derived from a non-probabilistic perspective based on empirical risk minimization,\nand the corresponding approach is dubbed Kernel Ridge Regression [7]. Taking this perspective,\nwe make two observations. The \ufb01rst is that the noise and covariance parameters are viewed as\nregularization parameters that need to be tuned, typically by cross-validation. In our experiments,\nwe compare this method with a canonical GPR approach. The second observation is that carrying\nout regression on the labels with least-squares can be justi\ufb01ed from a decision-theoretic point of\nview. The Bayes\u2019 rule minimizing the expected least-squares is the regression function (the expected\nconditional probability), which in binary classi\ufb01cation is proportional to the conditional probability\nof one of the two classes [3] (similar reasoning applies to multi-class classi\ufb01cation [2, 20]). From this\nperspective, one could expect a least-squares estimator to be self-calibrated, however this is typically\nnot the case in practice, a feature imputed to the limited number of points and the choice of function\nmodels. Post-hoc calibration has to be applied to both GPR- and KRR-based learning pipelines.\n\nPlatt scaling Platt scaling [24] is an effective approach to perform post-hoc calibration for different\ntypes of classi\ufb01ers, such as SVMs [22] and neural networks [6]. Given a decision function f, which is\nthe result of a trained binary classi\ufb01er, the class probabilities are given by the sigmoid transformation\n\u03c0(x) = \u03c3(af (x) + b), where a and b are optimised over a separate validation set, so that the resulting\nmodel best explains the data. Although this parametric form may seem restrictive, Platt scaling has\nbeen shown to be effective for a wide range of classi\ufb01ers [22].\n\n3.2 A note on calibration properties\n\nWe advocate that two components are critical for well-calibrated classi\ufb01ers: regularization and the\ncross-entropy loss. Previous work indicates that regularization has a positive effect on calibration [6].\nAlso, classi\ufb01ers that rely on the cross-entropy loss are reported to be well-calibrated [22]. This form\nof loss function is equivalent to the negative Bernoulli log-likelihood (or categorical in the multi-class\ncase), which is the proper interpretation of classi\ufb01cation outcomes.\nIn Figure 1, we demonstrate the effects of regularization and cross-entropy empirically: we summarize\nclassi\ufb01cation results on four synthetic datasets of increasing size. We assume that each class label is\nsampled from a Bernoulli distribution with probability given by the unknown function fp : R \u2192 [0, 1].\nFor a classi\ufb01er to be well-calibrated, it is suf\ufb01cient that it accurately approximates fp. We \ufb01t three\nkinds of classi\ufb01ers: a maximum likelihood (ML) classi\ufb01er that relies on cross entropy loss (CE), a\nBayesian classi\ufb01er with MSE loss (i.e. GPR classi\ufb01cation), and \ufb01nally a Bayesian classi\ufb01er that relies\non CE (i.e. GPC). We report the averages over 1000 iterations and the average standard deviations.\nThe Bayesian classi\ufb01ers that rely on the cross entropy loss converge to the true solution at a faster\nrate, and they are characterized by smaller variance.\nAlthough performing GPR on the labels induces regularization through the prior, the likelihood model\nis not appropriate. One possible solution is to employ meticulous likelihood approximations such as\nEP or variational GP classi\ufb01cation [9], alas at an often prohibitive computational cost, especially for\nconsiderably large datasets. In the section that follows, we introduce a methodology that combines\nthe best of both worlds. We propose to perform GP regression on labels transformed in such a way\nthat a less crude approximation of the categorical likelihood is achieved.\n\n4\n\n204080200Size of training set0.00.1MSEML + CE lossBayes + MSE lossBayes + CE lossn = 20True function fpML + CE (std~0.341)Bayes + MSE (std~0.235)Bayes + CE (std~0.173)n = 40True function fpML + CE (std~0.271)Bayes + MSE (std~0.149)Bayes + CE (std~0.130)n = 80True function fpML + CE (std~0.285)Bayes + MSE (std~0.133)Bayes + CE (std~0.129)\f4 GP regression on transformed Dirichlet variables\n\nThere is an obvious defect in GP-based least-squares classi\ufb01cation: each point is associated with a\nGaussian likelihood, which is not the appropriate noise model for Bernoulli-distributed variables.\nInstead of approximating the true non-Gaussian likelihood, we propose to transform the labels in a\nlatent space where a Gaussian approximation to the likelihood is more sensible.\nFor a given input, the goal of a Bayesian classi\ufb01er is to estimate the distribution over its class\nprobability vector; such a distribution is naturally represented by a Dirichlet-distributed random\nvariable. More formally, in a C-class classi\ufb01cation problem each observation y is a sample from a\ncategorical distribution Cat(\u03c0). The objective is to infer the class probabilities \u03c0 = [\u03c01, . . . , \u03c0C](cid:62),\nfor which we use a Dirichlet model: \u03c0 \u223c Dir(\u03b1). In order to fully describe the distribution of\nclass probabilities, we have to estimate the concentration parameters \u03b1 = [\u03b11, . . . , \u03b1C](cid:62). Given\nan observation y such that yk = 1, our best guess for the values of \u03b1 will be: \u03b1k = 1 + \u03b1\u0001 and\n\u03b1i = \u03b1\u0001,\u2200i (cid:54)= k. Note that it is necessary to add a small quantity 0 < \u03b1\u0001 (cid:28) 1, so as to have valid\nparameters for the Dirichlet distribution. Intuitively, we implicitly induce a Dirichlet prior so that\nbefore observing a data point we have the probability mass shared equally across C classes; we know\nthat we should observe exactly one count for a particular class, but we do not know which one. Most\nof the mass is concentrated on the corresponding class when y is observed. This practice can be\nthought of as the categorical/Bernoulli analogue of the noisy observations in GP regression. The\nlikelihood model is:\n\np(y | \u03b1) = Cat(\u03c0), where \u03c0 \u223c Dir(\u03b1).\n\n(4)\nIt is well-known that a Dirichlet sample can be generated by sampling from C independent Gamma-\ndistributed random variables with shape parameters \u03b1i and rate \u03bb = 1; realizations of the class\nprobabilities can be generated as follows:\n\n, where xi \u223c Gamma(\u03b1i, 1)\n\n(5)\n\nxi(cid:80)C\n\nc=1 xc\n\n\u03c0i =\n\nTherefore, the noisy Dirichlet likelihood assumed for each observation translates to C independent\nGamma likelihoods with shape parameters either \u03b1i = 1 + \u03b1\u0001, if yi = 1, or \u03b1i = \u03b1\u0001 otherwise.\nIn order to construct a Gaussian likelihood in the log-space, we approximate each Gamma-distributed\nxi with \u02dcxi \u223c Lognormal(\u02dcyi, \u02dc\u03c32\n\ni ) through moment matching (mean and variance):\n\nE[xi] = E[\u02dcxi] \u21d4 \u03b1i = exp(\u02dcyi + \u02dc\u03c32\n\nVar[xi] = Var[\u02dcxi] \u21d4 \u03b1i =(cid:0)exp(\u02dc\u03c32\n\ni ) \u2212 1(cid:1) exp(2\u02dcyi + \u02dc\u03c32\n\ni /2)\n\ni )\n\nThus, for the parameters of the normally distributed logarithm we have:\n\n\u02dcyi = log \u03b1i \u2212 \u02dc\u03c32\n\ni /2,\n\n\u02dc\u03c32\ni = log(1/\u03b1i + 1)\n\n(6)\n\nNote that this is the \ufb01rst approximation to the likelihood that we have employed so far. One could\nargue that a log-Normal approximation to a Gamma-distributed variable is reasonable, although it is\nnot accurate for small values of the shape parameter \u03b1i. However, the most important implication\nis that we can now consider a Gaussian likelihood in the log-space. Assuming a vector of latent\nprocesses f = [f1, . . . , fC](cid:62), we have:\n\np(\u02dcyi | f ) = N (fi, \u02dc\u03c32\ni ),\n\n(7)\nwhere class labels in the transformed logarithmic space are now denoted by \u02dcyi. We note that each\ni , yielding a heteroskedastic regression\nobservation is associated with a different noise parameter \u02dc\u03c32\nmodel. In fact, the \u02dc\u03c32\ni values (as well as \u02dcyi) solely depend on the Dirichlet pseudo-count assumed in the\nprior, which has only two possible values. Given this likelihood approximation, it is straightforward to\nplace a GP prior over f and evaluate the posterior over the C latent processes exactly. The multivariate\nGP prior does not assume any prior covariance across classes, meaning that they are assumed to be\nindependent a priori. It is possible to make kernel parameters independent across processes, or shared\nso that they are informed by all classes.\nRemark: In the binary classi\ufb01cation case, we still have to perform regression on two latent processes.\nThe use of heteroskedastic noise model implies that one latent process is not a mirrored version of\nthe other (see Figure 2), contrary to GPC.\n\n5\n\n\fFigure 2: Example of Dirichlet regression for a one-dimensional binary classi\ufb01cation problem. Left:\nthe latent GP posterior for class \u201c0\u201d (top) and class \u201c1\u201d (bottom). Right: the transformed posterior\nthrough softmax for class \u201c0\u201d (top) and class \u201c1\u201d (bottom).\n\n4.1 From GP posterior to Dirichlet variables\n\nThe obtained GP posterior emulates the logarithm of a stochastic process with Gamma marginals\nthat gives rise to the Dirichlet posterior over class labels. It is straightforward to sample from the\nposterior log-Normal marginals, which should behave approximately as Gamma-distributed samples\nto generate posterior Dirichlet samples as in Equation (5), which corresponds to a simple application\nof the softmax function on the samples from the GP posterior. The expectation of class probabilities\nis:\n\nE[\u03c0i,\u2217 | X, Y, x\u2217] =\n\n(cid:90)\n\n(cid:80)\n\nexp(fi,\u2217)\nj exp(fj,\u2217)\n\np(fi,\u2217 | X, Y, x\u2217) df\u2217 ,\n\n(8)\n\nwhich can be approximated by sampling from the Gaussian posterior p(fi,\u2217 | X, Y, x\u2217).\nFigure 2 shows an example of Dirichlet regression for a one-dimensional binary classi\ufb01cation problem.\nThe left panels demonstrate how the GP posterior approximates the transformed data; the error bars\nrepresent the standard deviation for each data-point. Notice that the posterior for class \u201c0\u201d (top)\nis not a mirror image of class \u201c1\u201d (bottom), because of the different noise terms assumed for each\nlatent process. The right panels show results in the original output space, after applying softmax\ntransformation; as expected in the binary case, one posterior process is a mirror image of the other.\n\n4.2 Optimizing the Dirichlet prior \u03b1\u0001\n\n\u221a\ni = log(2) and \u02dcyi = log(1/\n\nThe performance of Dirichlet-based classi\ufb01cation is affected by the choice of \u03b1\u0001, in addition to the\nusual GP hyper-parameters. As \u03b1\u0001 approaches zero, \u03b1i converges to either 1 or 0. It is easy to see\nthat for the transformed \u201c1\u201d labels we have \u02dc\u03c32\n2) in the limit. The\ntransformed \u201c0\u201d labels, however, converge to in\ufb01nity, and so do their variances. The role of \u03b1\u0001 is to\nmake the transformed labels \ufb01nite, so that it is possible to perform regression. The smaller \u03b1\u0001 is, the\nfurther the transformed labels will be apart, but at the same time, the variance for the \u201c0\u201d label will\nbe larger.\nBy increasing \u03b1\u0001, the transformed labels of different classes tend to be closer. The marginal log-\nlikelihood tends to be larger, as it is easier for a zero-mean GP prior to \ufb01t the data. However,\nthis behavior is not desirable for classi\ufb01cation purposes. For this reason, the Gaussian marginal\nlog-likelihood in the transformed space is not appropriate to determine the optimal value for \u03b1\u0001.\nFigure 3 demonstrates the effect of \u03b1\u0001 on classi\ufb01cation accuracy, as re\ufb02ected by the MNLL metric.\nEach sub\ufb01gure corresponds to a different dataset; MNLL is reported for different choices of \u03b1\u0001\nbetween 0.1 and 0.001. As a general remark, it appears that there is no globally optimal \u03b1\u0001 parameter\nacross datasets. However, the reported training and test MNLL curves appear to be in agreement\nregarding the optimal choice for \u03b1\u0001. We therefore propose to select the \u03b1\u0001 value that minimizes the\nMNLL on the training data.\n\n5 Experiments\n\nWe experimentally evaluate the methodologies discussed on the datasets outlined in Table 1. For\nthe implementation of GP-based models, we use and extend the algorithms available in the GP\ufb02ow\n\n6\n\n5.02.50.0Class 0PosteriorData0.00.51.0Class 0PredictionData5.02.50.0Class 1PosteriorData0.00.51.0Class 1PredictionData\f(a) HTRU2\n\n(d) DRIVE\nFigure 3: Exploration of \u03b1\u0001 for 4 different datasets with respect to the MNLL metric.\n\n(c) LETTER\n\n(b) MAGIC\n\nTable 1: Datasets used for evaluation, available from the UCI repository [1].\n\nDataset\nEEG\nHTRU2\nMAGIC\nMINIBOO\nCOVERBIN\nSUSY\nLETTER\nDRIVE\nMOCAP\n\nClasses Training instances Test instances Dimensionality\n14\n8\n10\n50\n54\n18\n16\n48\n37\n\n4000\n5000\n5000\n10000\n58102\n1000000\n5000\n10000\n10000\n\n10980\n12898\n14020\n120064\n522910\n4000000\n15000\n48509\n68095\n\n2\n2\n2\n2\n2\n2\n26\n11\n5\n\nInducing points\n200\n200\n200\n400\n500\n200\n200\n500\n500\n\nlibrary [18]. More speci\ufb01cally, for GPC we make use of variational sparse GP [8], while for GPR we\nemploy sparse variational GP regression [35]. The latter is also the basis for our GPD implementation:\nwe apply adjustments so that heteroskedastic noise is admitted, as dictated by the Dirichlet mapping.\nConcerning KRR, in order to scale it up to large-scale problems we use a subsampling-based variant\nnamed Nystr\u00f6m KRR (NKRR) [33, 37]. Nystr\u00f6m-based approaches have been shown to achieve\nstate-of-the-art accuracy on large-scale learning problems [4, 14, 28, 30, 32]. The number of inducing\n(subsampled) points used for each dataset is reported in Table 1.\nThe experiments have been repeated for 10 random training/test splits. For each iteration, inducing\npoints are chosen by applying k-means clustering on the training inputs. Exceptions are COVERBIN\nand SUSY, for which we used 5 splits and inducing points chosen uniformly at random. For GPR we\nfurther split each training dataset: 80% of which is used to train the model and the remaining 20% is\nused for calibration with Platt scaling. NKRR uses an 80-20% split for k-fold cross-validation and\nPlatt scaling calibration, respectively. For each of the datasets, the \u03b1\u0001 parameter of GPD was selected\naccording to the training MNLL: we have 0.1 for COVERBIN, 0.001 for LETTER, DRIVE and MOCAP,\nand 0.01 for the remaining datasets.\nIn all experiments, we consider an isotropic RBF kernel; the kernel hyper-parameters are selected by\nmaximizing the marginal likelihood for the GP-based approaches, and by k-fold cross validation for\nNKRR (with k = 10 for all datasets except from SUSY, for which k = 5). In the case of GPD, kernel\nparameters are shared across classes so they are informed by all classes. In the case of GPR, we also\noptimize the noise variance jointly with all kernel parameters.\nThe performance of GPD, GPC, GPR and NKRR is compared in terms of various error metrics, including\nerror rate, MNLL and ECE for a collection of datasets. The obtained error rate, MNLL and ECE values\nare summarized in Figure 4. The GPC method tends to outperform GPR in most cases. Regarding the\nGPD approach, its performance tends to lie between GPC and GPR; in some instances classi\ufb01cation\nperformance is better than GPC and NKRR. Most importantly, this performance is obtained at a\nfraction of the computational time required by the GPC method. Figure 5 summarizes the speed-\nup achieved during hyper-parameter optimization by GPD in comparison with the variational GP\nclassi\ufb01cation approach. In the context of sparse variational regression, these computational gains\nare a consequence of closed-form results for the optimal variational distribution [35], which are not\navailable for non-Gaussian likelihoods. We note that hyper-parameter and variational optimization\nhave been performed using the ScipyOptimizer class of GP\ufb02ow, which applies early stopping if\nconvergence is detected. Convergence for GPD is faster simply because optimization involves fewer\nparameters. A more detailed exposition can be found in the supplementary material.\n\n7\n\n0.10.010.0010.0001Dirichlet prior 0.0750.1000.1250.150MNLLTraining setTest set0.10.010.0010.0001Dirichlet prior 0.350.40MNLLTraining setTest set0.10.010.0010.0001Dirichlet prior 0.51.0MNLLTraining setTest set0.10.010.0010.0001Dirichlet prior 0.20.4MNLLTraining setTest set\fFigure 4: Error rate, MNLL and ECE for the datasets considered in this work.\n\n(a) LETTER\n\n(b) MINIBOO\n\n(c) MOCAP\n\nFigure 5: Left: Speed-up obtained by using GPD as opposed to GPC. Right: Error vs training time for\nGPD as the number of inducing points is increased for three datasets. The dashed line represents the\nerror obtained by GPC using the same number of inducing points as the fastest GPD listed.\n\nThis dramatic difference in computational ef\ufb01ciency has some interesting implications regarding\nthe applicability of GP-based classi\ufb01cation methods on large datasets. GP-based machine learning\napproaches are known to be computationally expensive; their practical application on large datasets\ndemands the use of scalable methods to perform approximate inference. The approximation quality\nof sparse approaches depends on the number (and the selection) of inducing points. In the case of\nclassi\ufb01cation, the speed-up obtained by GPD implies that the saved computational budget can be spent\non a more \ufb01ne-grained sparse GP approximation. In Figure 5, we explore the effect of increasing\nthe number of inducing points Nu for three datasets: LETTER with Nu \u2208 {500, 800, 1000, 1600},\nMINIBOO with Nu \u2208 {400, 500, 600, 800} and MOCAP with Nu \u2208 {500, 800, 1000, 1600}. Regard-\ning GPC, we \ufb01x the computational budget to the smallest Nu in each case. We see that the error rate\nfor GPD drops signi\ufb01cantly as the budget is increased; however, the latter remains a fraction of the\noriginal GPC computational effort.\nFinally, we acknowledge that the computational cost of variational GPC can be reduced by means\nof mini-batches-based training [8, 10, 38]. In the supplementary material, we perform a detailed\ncomparison between GPD and variational GPC with mini-batches [8]. The ef\ufb01ciency of GPC with\na carefully selected mini-batch size is signi\ufb01cantly improved, although stochastic optimization is\ncharacterized by slower convergence compared to full-batch-based optimization. As a result, GPD\nconvergence remains faster for most datasets. This advantage becomes more obvious in scenarios\nwhere hyper-parameters are either known or reused, since no optimization step is required for a\nregression-based method.\n\n8\n\nEEGHTRU2MAGICMiniBooCoverSUSYletterDriveMoCap0.00.10.2Error rateGPCGPDGPR (Platt)NKRREEGHTRU2MAGICMiniBooCoverSUSYletterDriveMoCap0.00.20.4MNLLGPCGPDGPR (Platt)NKRREEGHTRU2MAGICMiniBooCoverSUSYletterDriveMoCap0.000.020.040.060.08ECEGPCGPDGPR (Platt)NKRREEGHTRU2MAGICMiniBooCoverletterDriveMoCap020406080Speed-up2014627081550Training time (sec)0.050.060.07Error rateGPC (~14K sec)GPD847178924174244Training time (sec)0.0800.085Error rateGPC (~10K sec)GPD25672810222175Training time (sec)0.020.030.04Error rateGPC (~16K sec)GPD\f6 Conclusions\n\nMost GP-based approaches to classi\ufb01cation in the literature are characterized by a meticulous\napproximation of the likelihood. In this work, we experimentally show that such GP classi\ufb01ers\ntend to be well-calibrated, meaning that they correctly estimate classi\ufb01cation uncertainty, as this is\nexpressed through class probabilities. Despite this desirable property, their applicability is limited to\nsmall/moderate size of datasets, due to the high computational complexity of approximating the true\nposterior distribution.\nLeast-squares classi\ufb01cation, which may be implemented either as GPR or KRR, is an established\npractice for more scalable classi\ufb01cation. However, the crude approximation of a non-Gaussian\nlikelihood with a Gaussian one has a negative impact on classi\ufb01cation quality, especially as this is\nre\ufb02ected by the calibration properties of the classi\ufb01er.\nConsidering the strengths and practical limitations of GPs, we proposed a classi\ufb01cation approach that\nis essentially an heteroskedastic GP regression on a latent space induced by a transformation of the\nlabels, which are viewed as Dirichlet-distributed random variables. This allowed us to convert C-class\nclassi\ufb01cation to a problem of regression involving C latent processes with Gamma likelihoods. We\nthen proposed to approximate the Gamma-distributed variables with log-Normal ones, and thus we\nachieved a sensible Gaussian approximation in the logarithmic space. Crucially, this can be seen as a\npre-processing step, that does not have to be learned, unlike in GPC, where an accurate transformation\nis sought iteratively. Our experimental analysis shows that Dirichlet-based GP classi\ufb01cation produces\nwell-calibrated classi\ufb01ers without the need for post-hoc calibration steps. The performance of our\napproach in terms of classi\ufb01cation accuracy tends to lie between properly-approximated GPC and\nleast-squares classi\ufb01cation, but most importantly it is orders of magnitude faster than GPC.\nAs a \ufb01nal remark, we note that the predictive distribution of the GPD approach is different from that\nobtained by GPC, as can be seen in the extended results in the supplementary material. An extended\ncharacterization of the predictive distribution for GPD is subject of future work.\n\nAcknowledgments\n\nL. R. acknowledges the \ufb01nancial support of the Center for Brains, Minds and Machines (CBMM),\nfunded by NSF STC award CCF-1231216, the Italian Institute of Technology, the AFOSR projects\nFA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Of\ufb01ce of Aerospace Research\nand Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826. R. C. and\nL. R. gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp\nGPUs and the Tesla k40 GPU used for this research. DM and PM are partially supported by KPMG.\nMF gratefully acknowledges support from the AXA Research Fund.\n\nReferences\n[1] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007.\n\n[2] L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. Multi-output learning via spectral \ufb01ltering. Machine\n\nlearning, 87(3):259\u2013301, 2012.\n\n[3] P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the American\n\nStatistical Association, 2006.\n\n[4] R. Camoriano, T. Angles, A. Rudi, and L. Rosasco. NYTRO: When Subsampling Meets Early Stopping. In\nProceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1403\u20131411,\n2016.\n\n[5] P. A. Flach. Classi\ufb01er Calibration, pages 1\u20138. Springer US, Boston, MA, 2016.\n\n[6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In D. Precup\nand Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 1321\u20131330, International Convention Centre, Sydney,\nAustralia, Aug. 2017. PMLR.\n\n[7] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining,\n\ninference and prediction. The Mathematical Intelligencer, 27(2):83\u201385, 2001.\n\n9\n\n\f[8] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable Variational Gaussian Process Classi\ufb01cation. In\nG. Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages\n351\u2013360, San Diego, California, USA, May 2015. PMLR.\n\n[9] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for Variationally Sparse Gaussian\nProcesses. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 28, pages 1648\u20131656. Curran Associates, Inc., 2015.\n\n[10] D. Hern\u00e1ndez-Lobato and J. M. Hern\u00e1ndez-Lobato. Scalable gaussian process classi\ufb01cation via expectation\npropagation. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 51 of Proceedings of Machine Learning Research, pages 168\u2013176, 2016.\n\n[11] P. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match deep neural\nnetworks on TIMIT. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2014.\n\n[12] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?,\n\nMar. 2017.\n\n[13] M. Kull, T. S. Filho, and P. Flach. Beta calibration: a well-founded and easily implemented improvement\non logistic calibration for binary classi\ufb01ers. In A. Singh and J. Zhu, editors, Proceedings of the 20th\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 54 of Proceedings of Machine\nLearning Research, pages 623\u2013631, Fort Lauderdale, FL, USA, Apr. 2017. PMLR.\n\n[14] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystrom Method. In NIPS, pages 1060\u20131068. Curran\n\nAssociates, Inc., 2009.\n\n[15] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world, Feb. 2017.\n\narXiv:1607.02533.\n\n[16] M. Kuss and C. E. Rasmussen. Assessing Approximate Inference for Binary Gaussian Process Classi\ufb01ca-\n\ntion. Journal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[17] Z. Lu, A. May, K. Liu, A. B. Garakani, D. Guo, A. Bellet, L. Fan, M. Collins, B. Kingsbury, M. Picheny,\nand F. Sha. How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets. CoRR, abs/1411.4000,\n2014.\n\n[18] A. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1, Z. Ghahramani,\nand J. Hensman. GP\ufb02ow: A Gaussian process library using TensorFlow. Journal of Machine Learning\nResearch, 18(40):1\u20136, Apr. 2017.\n\n[19] T. P. Minka. Expectation Propagation for approximate Bayesian inference. In Proceedings of the 17th\nConference in Uncertainty in Arti\ufb01cial Intelligence, UAI \u201901, pages 362\u2013369, San Francisco, CA, USA,\n2001. Morgan Kaufmann Publishers Inc.\n\n[20] Y. Mroueh, T. Poggio, L. Rosasco, and J.-J. Slotine. Multiclass learning with simplex coding. In Advances\n\nin Neural Information Processing Systems, pages 2789\u20132797, 2012.\n\n[21] H. Nickisch and C. E. Rasmussen. Approximations for Binary Gaussian Process Classi\ufb01cation. Journal of\n\nMachine Learning Research, 9:2035\u20132078, Oct. 2008.\n\n[22] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In Proceedings\n\nof the 22Nd International Conference on Machine Learning, ICML \u201905, pages 625\u2013632. ACM, 2005.\n\n[23] M. Opper and C. Archambeau. The variational gaussian approximation revisited. Neural Comput.,\n\n21(3):786\u2013792, 2009.\n\n[24] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood\n\nmethods. Advances in Large Margin Classi\ufb01ers, 10(3), 1999.\n\n[25] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\n[26] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\n[27] R. Rifkin, G. Yeo, T. Poggio, and Others. Regularized least-squares classi\ufb01cation. Nato Science Series Sub\n\nSeries III Computer and Systems Sciences, 190:131\u2013154, 2003.\n\n[28] A. Rudi, R. Camoriano, and L. Rosasco. Less is More: Nystr\u00f6m Computational Regularization. In\nC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 28, pages 1657\u20131665. Curran Associates, Inc., 2015.\n\n10\n\n\f[29] A. Rudi, L. Carratino, and L. Rosasco. FALKON: An Optimal Large Scale Kernel Method. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 30, pages 3888\u20133898. Curran Associates, Inc., 2017.\n\n[30] A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method. In Advances in\n\nNeural Information Processing Systems, pages 3891\u20133901, 2017.\n\n[31] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,\n\nNew York, NY, USA, 2004.\n\n[32] S. Si, C.-J. Hsieh, and I. S. Dhillon. Memory Ef\ufb01cient Kernel Approximation. In ICML, volume 32 of\n\nJMLR Proceedings, pages 701\u2013709. JMLR.org, 2014.\n\n[33] A. J. Smola and B. Sch\u00f6lkopf. Sparse Greedy Matrix Approximation for Machine Learning. In ICML,\n\npages 911\u2013918. Morgan Kaufmann, 2000.\n\n[34] J. A. K. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classi\ufb01ers. Neural Process.\n\nLett., 9(3):293\u2013300, June 1999.\n\n[35] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. A. Dyk\nand M. Welling, editors, Proceedings of the Twelfth International Conference on Arti\ufb01cial Intelligence\nand Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, volume 5 of JMLR\nProceedings, pages 567\u2013574. JMLR.org, 2009.\n\n[36] C. K. I. Williams and D. Barber. Bayesian classi\ufb01cation with Gaussian processes. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 20:1342\u20131351, 1998.\n\n[37] C. K. I. Williams and M. Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In T. K. Leen,\nT. G. Dietterich, V. Tresp, T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS, pages 682\u2013688. MIT\nPress, 2000.\n\n[38] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Stochastic variational deep kernel learning. In\nProceedings of the 30th International Conference on Neural Information Processing Systems, NIPS\u201916,\npages 2594\u20132602. Curran Associates Inc., 2016.\n\n[39] B. Zadrozny and C. Elkan. Transforming Classi\ufb01er Scores into Accurate Multiclass Probability Estimates.\nIn Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, KDD \u201902, pages 694\u2013699, New York, NY, USA, 2002. ACM.\n\n11\n\n\f", "award": [], "sourceid": 2935, "authors": [{"given_name": "Dimitrios", "family_name": "Milios", "institution": "EURECOM"}, {"given_name": "Raffaello", "family_name": "Camoriano", "institution": "Istituto Italiano di Tecnologia"}, {"given_name": "Pietro", "family_name": "Michiardi", "institution": "EURECOM"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}, {"given_name": "Maurizio", "family_name": "Filippone", "institution": "EURECOM"}]}