{"title": "Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance", "book": "Advances in Neural Information Processing Systems", "page_first": 5322, "page_last": 5333, "abstract": "Large amounts of labeled data are typically required to train deep learning models. For many real-world problems, however, acquiring additional data can be expensive or even impossible. We present semi-supervised deep kernel learning (SSDKL), a semi-supervised regression model based on minimizing predictive variance in the posterior regularization framework. SSDKL combines the hierarchical representation learning of neural networks with the probabilistic modeling capabilities of Gaussian processes. By leveraging unlabeled data, we show improvements on a diverse set of real-world regression tasks over supervised deep kernel learning and semi-supervised methods such as VAT and mean teacher adapted for regression.", "full_text": "Semi-supervised Deep Kernel Learning:\n\nRegression with Unlabeled Data by Minimizing\n\nPredictive Variance\n\nNeal Jean\u2217, Sang Michael Xie\u2217, Stefano Ermon\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA 94305\n\n{nealjean, xie, ermon}@cs.stanford.edu\n\nAbstract\n\nLarge amounts of labeled data are typically required to train deep learning models.\nFor many real-world problems, however, acquiring additional data can be expensive\nor even impossible. We present semi-supervised deep kernel learning (SSDKL), a\nsemi-supervised regression model based on minimizing predictive variance in the\nposterior regularization framework. SSDKL combines the hierarchical represen-\ntation learning of neural networks with the probabilistic modeling capabilities of\nGaussian processes. By leveraging unlabeled data, we show improvements on a\ndiverse set of real-world regression tasks over supervised deep kernel learning and\nsemi-supervised methods such as VAT and mean teacher adapted for regression.\n\n1\n\nIntroduction\n\nThe prevailing trend in machine learning is to automatically discover good feature representations\nthrough end-to-end optimization of neural networks. However, most success stories have been enabled\nby vast quantities of labeled data [1]. This need for supervision poses a major challenge when we\nencounter critical scienti\ufb01c and societal problems where \ufb01ne-grained labels are dif\ufb01cult to obtain.\nAccurately measuring the outcomes that we care about\u2014e.g., childhood mortality, environmental\ndamage, or extreme poverty\u2014can be prohibitively expensive [2, 3, 4]. Although these problems\nhave limited data, they often contain underlying structure that can be used for learning; for example,\npoverty and other socioeconomic outcomes are strongly correlated over both space and time.\nSemi-supervised learning approaches offer promise when few labels are available by allowing models\nto supplement their training with unlabeled data [5]. Mostly focusing on classi\ufb01cation tasks, these\nmethods often rely on strong assumptions about the structure of the data (e.g., cluster assumptions,\nlow data density at decision boundaries [6]) that generally do not apply to regression [7, 8, 9, 10, 11].\nIn this paper, we present semi-supervised deep kernel learning, which addresses the challenge of semi-\nsupervised regression by building on previous work combining the feature learning capabilities of\ndeep neural networks with the ability of Gaussian processes to capture uncertainty [12, 3, 13]. SSDKL\nincorporates unlabeled training data by minimizing predictive variance in the posterior regularization\nframework, a \ufb02exible way of encoding prior knowledge in Bayesian models [14, 15, 16].\nOur main contributions are the following:\n\n\u2022 We introduce semi-supervised deep kernel learning (SSDKL) for the largely unexplored\ndomain of deep semi-supervised regression. SSDKL is a regression model that combines\n\n\u2217denotes equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Depiction of the variance minimization approach behind semi-supervised deep kernel\nlearning (SSDKL). The x-axis represents one dimension of a neural network embedding and the\ny-axis represents the corresponding output. Left: Without unlabeled data, the model learns an\nembedding by maximizing the likelihood of labeled data. The black and gray dotted lines show the\nposterior distribution after conditioning. Right: Embedding learned by SSDKL tries to minimize\nthe predictive variance of unlabeled data, encouraging unlabeled embeddings to be near labeled\nembeddings. Observe that the representations of both labeled and unlabeled data are free to change.\n\nthe strengths of heavily parameterized deep neural networks and nonparametric Gaussian\nprocesses. While the deep Gaussian process kernel induces structure in an embedding space,\nthe model also allows a priori knowledge of structure (i.e., spatial or temporal) in the input\nfeatures to be naturally incorporated through kernel composition.\n\u2022 By formalizing the semi-supervised variance minimization objective in the posterior regular-\nization framework, we unify previous semi-supervised approaches such as minimum entropy\nand minimum variance regularization under a common framework. To our knowledge, this\nis the \ufb01rst paper connecting semi-supervised methods to posterior regularization.\n\u2022 We demonstrate that SSDKL can use unlabeled data to learn more generalizable features\nand improve performance on a range of regression tasks, outperforming the supervised deep\nkernel learning method and semi-supervised methods such as virtual adversarial training\n(VAT) and mean teacher [17, 18]. In a challenging real-world task of predicting poverty\nfrom satellite images, SSDKL outperforms the state-of-the-art by 15.5%\u2014by incorporating\nprior knowledge of spatial structure, the median improvement increases to 17.9%.\n\n2 Preliminaries\ni=1 and m unlabeled examples {xj}n+m\nWe assume a training set of n labeled examples {(xi, yi)}n\nwith instances x \u2208 Rd and labels y \u2208 R. Let XL, yL, XU refer to the aggregated features and targets,\nwhere XL \u2208 Rn\u00d7d, yL \u2208 Rn, and XU \u2208 Rm\u00d7d. At test time, we are given examples XT \u2208 Rt\u00d7d\nthat we would like to predict.\nThere are two major paradigms in semi-supervised learning, inductive and transductive. In inductive\nsemi-supervised learning, the labeled data (XL, yL) and unlabeled data XU are used to learn a\nfunction f : X (cid:55)\u2192 Y that generalizes well and is a good predictor on unseen test examples XT [5].\nIn transductive semi-supervised learning, the unlabeled examples are exactly the test data that we\nwould like to predict, i.e., XT = XU [19]. A transductive learning approach tries to \ufb01nd a function\nf : X n+m (cid:55)\u2192 Y n+m, with no requirement of generalizing to additional test examples. Although the\ntheoretical development of SSDKL is general to both the inductive and transductive regimes, we\nonly test SSDKL in the inductive setting in our experiments for direct comparison against supervised\nlearning methods.\n\nj=n+1\n\nGaussian processes A Gaussian process (GP) is a collection of random variables, any \ufb01nite number\nof which form a multivariate Gaussian distribution. Following the notation of [20], a Gaussian process\nde\ufb01nes a distribution over functions f : Rd \u2192 R from inputs to target values. If\n\nf (x) \u223c GP (\u00b5(x), k\u03c6(xi, xj))\n\n2\n\n0246810h\u03b8(x)1050510f(x)ObservationsUnlabeled dataPosterior meanConfidence interval (1 SD)0246810h\u03b8(x)1050510ObservationsUnlabeled dataPosterior meanConfidence interval (1 SD)\fwith mean function \u00b5(x) and covariance kernel function k\u03c6(xi, xj) parameterized by \u03c6, then any\ncollection of function values is jointly Gaussian,\n\nf (X) = [f (x1), . . . , f (xn)]T \u223c N (\u00b5, KX,X ),\n\nwith mean vector and covariance matrix de\ufb01ned by the GP, s.t. \u00b5i = \u00b5(xi) and (KX,X )ij =\nk\u03c6(xi, xj). In practice, we often assume that observations include i.i.d. Gaussian noise, i.e., y(x) =\nf (x) + \u0001(x) where \u0001 \u223c N (0, \u03c62\n\nn), and the covariance function becomes\n\nCov(y(xi), y(xj)) = k(xi, xj) + \u03c62\n\nn\u03b4ij\n\nwhere \u03b4ij = I[i = j]. To make predictions at unlabeled points XU , we can compute a Gaussian\nposterior distribution in closed form by conditioning on the observed data (XL, yL). For a more\nthorough introduction, we refer readers to [21].\n\nDeep kernel learning Deep kernel learning (DKL) combines neural networks with GPs by using a\nneural network embedding as input to a deep kernel [12]. Given input data x \u2208 X , a neural network\nparameterized by w is used to extract features hw(x) \u2208 Rp. The outputs are modeled as\n\nf (x) \u223c GP(\u00b5(hw(x)), k\u03c6(hw(xi), hw(xj)))\n\nfor some mean function \u00b5(\u00b7) and base kernel function k\u03c6(\u00b7,\u00b7) with parameters \u03c6. Parameters \u03b8 =\n(w, \u03c6) of the deep kernel are learned jointly by minimizing the negative log likelihood of the labeled\ndata [20]:\n\n(1)\nFor Gaussian distributions, the marginal likelihood is a closed-form, differentiable expression, allow-\ning DKL models to be trained via backpropagation.\n\nLlikelihood(\u03b8) = \u2212 log p(yL | XL, \u03b8)\n\nPosterior regularization In probabilistic models, domain knowledge is generally imposed through\nthe speci\ufb01cation of priors. These priors, along with the observed data, determine the posterior\ndistribution through the application of Bayes\u2019 rule. However, it can be dif\ufb01cult to encode our\nknowledge in a Bayesian prior. Posterior regularization offers a more direct and \ufb02exible mechanism\nfor controlling the posterior distribution.\nLet D = (XL, yL) be a collection of observed data. [15] present a regularized optimization formula-\ntion called regularized Bayesian inference, or RegBayes. In this framework, the regularized posterior\nis the solution of the following optimization problem:\n\nRegBayes:\n\n(2)\nwhere L(q(M|D)) is de\ufb01ned as the KL-divergence between the desired post-data posterior q(M|D)\nover models M and the standard Bayesian posterior p(M|D) and \u2126(q(M|D)) is a posterior regu-\nlarizer. The goal is to learn a posterior distribution that is not too far from the standard Bayesian\nposterior while also ful\ufb01lling some requirements imposed by the regularization.\n\nq(M|D)\u2208Pprob\n\nL(q(M|D)) + \u2126(q(M|D))\n\ninf\n\n3 Semi-supervised deep kernel learning\n\nWe introduce semi-supervised deep kernel learning (SSDKL) for problems where labeled data is\nlimited but unlabeled data is plentiful. To learn from unlabeled data, we observe that a Bayesian\napproach provides us with a predictive posterior distribution\u2014i.e., we are able to quantify predictive\nuncertainty. Thus, we regularize the posterior by adding an unsupervised loss term that minimizes the\npredictive variance at unlabeled data points:\n\nLsemisup(\u03b8) =\n\n1\nn\n\nLlikelihood(\u03b8) +\n\nLvariance(\u03b8) =\n\nVar(f (x))\n\n\u03b1\nm\n\nLvariance(\u03b8)\n\n(cid:88)\n\nx\u2208XU\n\n(3)\n\n(4)\n\nwhere n and m are the numbers of labeled and unlabeled training examples, \u03b1 is a hyperparameter\ncontrolling the trade-off between supervised and unsupervised components, and \u03b8 represents the\nmodel parameters.\n\n3\n\n\f3.1 Variance minimization as posterior regularization\n\nOptimizing Lsemisup is equivalent to computing a regularized posterior through solving a speci\ufb01c\ninstance of the RegBayes optimization problem (2), where our choice of regularizer corresponds to\nvariance minimization.\nLet X = (XL, XU ) be the observed input data and D = (X, yL) be the input data with labels for\nthe labeled part XL. Let F denote a space of functions where for f \u2208 F, f : Rd \u2192 R maps from the\ninputs to the target values. Note that here, M = (f, \u03b8) is the model in the RegBayes framework, where\n\u03b8 are the model parameters. We assume that the prior is \u03c0(f, \u03b8) and a likelihood density p(D|f, \u03b8)\nexists. Given observed data D, the Bayesian posterior is p(f, \u03b8|D), while RegBayes computes a\ndifferent, regularized posterior.\nLet \u00af\u03b8 be a speci\ufb01c instance of the model parameters. Instead of maximizing the marginal likelihood\nof the labeled training data in a purely supervised approach, we train our model in a semi-supervised\nfashion by minimizing the compound objective\n\nLsemisup(\u00af\u03b8) = \u2212 1\nn\n\nlog p(yL|XL, \u00af\u03b8) +\n\n\u03b1\nm\n\nVarf\u223cp(f (x))\n\n(5)\n\n(cid:88)\n\nx\u2208XU\n\nwhere the variance is with respect to p(f|\u00af\u03b8,D), the Bayesian posterior given \u00af\u03b8 and D.\nTheorem 1. Let observed data D, a suitable space of functions F, and parameter space \u0398 be given.\nAs in [15], we assume that F is a complete separable metric space and \u03a0 is an absolutely continuous\nprobability measure (with respect to background measure \u03b7) on (F,B(F)), where B(F) is the Borel\n\u03c3-algebra, such that a density \u03c0 exists where d\u03a0 = \u03c0d\u03b7 and we have prior density \u03c0(f, \u03b8) and\nlikelihood density p(D|f, \u03b8). Then the semi-supervised variance minimization problem (5)\n\nis equivalent to the RegBayes optimization problem (2)\n\nLsemisup(\u00af\u03b8)\n\ninf\n\u00af\u03b8\n\nL(q(f, \u03b8|D)) + \u2126(q(f, \u03b8|D))\n\ninf\n\nq(f,\u03b8|D)\u2208Pprob\n\n(cid:18)(cid:90)\n\u2126(q(f, \u03b8|D)) = \u03b1(cid:48) m(cid:88)\n\ni=1\n\nf,\u03b8\n\np(f|\u03b8,D)q(\u03b8|D)(f (XU )i \u2212 Ep[f (XU )i])2d\u03b7(f, \u03b8)\n\n(cid:19)\n\n,\n\nm , and Pprob = {q : q(f, \u03b8|D) = q(f|\u03b8,D)\u03b4\u00af\u03b8(\u03b8|D), \u00af\u03b8 \u2208 \u0398} is a variational family of\n\nwhere \u03b1(cid:48) = \u03b1n\ndistributions where q(\u03b8|D) is restricted to be a Dirac delta centered on \u00af\u03b8 \u2208 \u0398.\nWe include a formal derivation in Appendix A.1 and give a brief outline here. It can be shown that\nsolving the variational optimization objective\n\nDKL(q(f, \u03b8|D)(cid:107)\u03c0(f, \u03b8)) \u2212\n\nq(f, \u03b8|D) log p(D|f, \u03b8)d\u03b7(f, \u03b8)\n\nf,\u03b8\n\ninf\n\nq(f,\u03b8|D)\n\n(6)\nis equivalent to minimizing the unconstrained form of the \ufb01rst term L(q(f, \u03b8|D)) of the RegBayes\nobjective in Theorem 1, and the minimizer is precisely the Bayesian posterior p(f, \u03b8|D). When we\nrestrict the optimization to q \u2208 Pprob the solution is of the form q\u2217(f, \u03b8|D) = p(f|\u03b8,D)\u03b4\u00af\u03b8(\u03b8|D) for\nsome \u00af\u03b8. This allows us to show that (6) is also equivalent to minimizing the \ufb01rst term of Lsemisup(\u00af\u03b8).\nFinally, noting that the regularization function \u2126 only depends on \u00af\u03b8 (through q(\u03b8|D) = \u03b4\u00af\u03b8(\u03b8)), the\nform of q\u2217(f, \u03b8|D) is unchanged after adding \u2126. Therefore the choice of \u2126 reduces to minimizing\nthe predictive variance with respect to q\u2217(f|\u03b8,D) = p(f|\u00af\u03b8,D).\n\n(cid:90)\n\nIntuition for variance minimization By minimizing Lsemisup, we trade off maximizing the\nlikelihood of our observations with minimizing the posterior variance on unlabeled data that we wish\nto predict. The posterior variance acts as a proxy for distance with respect to the kernel function\nin the deep feature space, and the regularizer is an inductive bias on the structure of the feature\nspace. Since the deep kernel parameters are jointly learned, the neural net is encouraged to learn a\nfeature representation in which the unlabeled examples are closer to the labeled examples, thereby\nreducing the variance on our predictions. If we imagine the labeled data as \u201csupports\u201d for the\n\n4\n\n\fsurface representing the posterior mean, we are optimizing for embeddings where unlabeled data\ntend to cluster around these labeled supports. In contrast, the variance regularizer would not bene\ufb01t\nconventional GP learning since \ufb01xed kernels would not allow for adapting the relative distances\nbetween data points.\nAnother interpretation is that the semi-supervised objective is a regularizer that reduces over\ufb01tting\nto labeled data. The model is discouraged from learning features from labeled data that are not also\nuseful for making low-variance predictions at unlabeled data points. In settings where unlabeled data\nprovide additional variation beyond labeled examples, this can improve model generalization.\n\nTraining and inference Semi-supervised deep kernel learning scales well with large amounts\nof unlabeled data since the unsupervised objective Lvariance naturally decomposes into a sum\nover conditionally independent terms. This allows for mini-batch training on unlabeled data with\nstochastic gradient descent. Since all of the labeled examples are interdependent, computing exact\ngradients for labeled examples requires full batch gradient descent on the labeled data. Therefore,\nassuming a constant batch size, each iteration of training requires O(n3) computations for a Cholesky\ndecomposition, where n is the number of labeled training examples. Performing the GP inference\nrequires O(n3) one-time cost in the labeled points. However, existing approximation methods based\non kernel interpolation and structured matrices used in DKL can be directly incorporated in SSDKL\nand would reduce the training complexity to close to linear in labeled dataset size and inference to\nconstant time per test point [12, 22]. While DKL is designed for the supervised setting where scaling\nto large labeled datasets is a very practical concern, our focus is on semi-supervised settings where\nlabels are limited but unlabeled data is abundant.\n\n4 Experiments and results\n\nWe apply SSDKL to a variety of real-world regression tasks in the inductive semi-supervised learning\nsetting, beginning with eight datasets from the UCI repository [23]. We also explore the challenging\ntask of predicting local poverty measures from high-resolution satellite imagery [24]. In our reported\nresults, we use the squared exponential or radial basis function kernel. We also experimented with\npolynomial kernels, but saw generally worse performance. Our SSDKL model is implemented in\nTensorFlow [25]. Additional training details are provided in Appendix A.3, and code and data for\nreproducing experimental results can be found on GitHub.2\n\n4.1 Baselines\n\nWe \ufb01rst compare SSDKL to the purely supervised DKL, showing the contribution of unlabeled data.\nIn addition to the supervised DKL method, we compare against semi-supervised methods including\nco-training, consistency regularization, generative modeling, and label propagation. Many of these\nmethods were originally developed for semi-supervised classi\ufb01cation, so we adapt them here for\nregression. All models, including SSDKL, were trained from random initializations.\nCOREG, or CO-training REGressors, uses two k-nearest neighbor (kNN) regressors, each of which\ngenerates labels for the other during the learning process [26]. Unlike traditional co-training, which\nrequires splitting features into suf\ufb01cient and redundant views, COREG achieves regressor diversity by\nusing different distance metrics for its two regressors [27].\nConsistency regularization methods aim to make model outputs invariant to local input perturbations\n[17, 28, 18]. For semi-supervised classi\ufb01cation, [29] found that VAT and mean teacher were the best\nmethods using fair evaluation guidelines. Virtual adversarial training (VAT) via local distributional\nsmoothing (LDS) enforces consistency by training models to be robust to adversarial local input\nperturbations [17, 30]. Unlike adversarial training [31], the virtual adversarial perturbation is found\nwithout labels, making semi-supervised learning possible. We adapt VAT for regression by choosing\nthe output distribution N (h\u03b8(x), \u03c32) for input x, where h\u03b8 : Rd \u2192 R is a parameterized mapping\nand \u03c3 is \ufb01xed. Optimizing the likelihood term is then equivalent to minimizing squared error; the LDS\nterm is the KL-divergence between the model distribution and a perturbed Gaussian (see Appendix\nA.2). Mean teacher enforces consistency by penalizing deviation from the outputs of a model with\nthe exponential weighted average of the parameters over SGD iterations [18].\n\n2https://github.com/ermongroup/ssdkl\n\n5\n\n\fPercent reduction in RMSE compared to DKL\n\nn = 100\n\nDataset\nSkillcraft\nParkinsons\nElevators\nProtein\nBlog\nCTslice\nBuzz\nElectric\nMedian\n\nN\n\n3,325\n5,875\n16,599\n45,730\n52,397\n53,500\n583,250\n2,049,280\n\nd\n18\n20\n18\n9\n280\n384\n77\n6\n\nSSDKL COREG\n1.87\n-27.45\n-5.22\n-2.37\n11.15\n-12.11\n13.80\n-126.95\n-3.80\n\n3.44\n-2.51\n7.99\n-3.34\n5.65\n-22.48\n5.59\n4.96\n4.20\n\nLabel Prop\n5.12\n-43.43\n2.28\n0.77\n9.01\n-17.12\n1.33\n-201.18\n1.05\n\nVAE Mean Teacher\n-19.72\n0.11\n-91.54\n-122.23\n-27.27\n-22.68\n-5.11\n-8.65\n7.05\n8.96\n-60.71\n-47.59\n-19.26\n-77.08\n-399.85\n-285.61\n-20.97\n-43.99\n\nVAT SSDKL COREG\n0.60\n-22.50\n-25.98\n0.89\n9.60\n2.94\n10.41\n-154.07\n0.75\n\n5.97\n5.97\n6.92\n1.23\n5.34\n5.64\n11.33\n-13.93\n5.81\n\n-21.97\n-143.60\n-31.25\n-6.44\n1.87\n-64.75\n-82.66\n-513.95\n-48.00\n\nn = 300\n\nLabel Prop\n5.78\n-51.35\n-22.08\n2.61\n12.44\n-2.59\n-2.22\n-303.21\n-2.41\n\nVAE Mean Teacher\n-18.17\n4.36\n-132.68\n-167.93\n-82.01\n-53.40\n-8.98\n-9.24\n7.87\n8.14\n-58.97\n-60.18\n-28.65\n-104.88\n-627.83\n-460.48\n-41.02\n-70.49\n\nVAT\n-20.13\n-202.79\n-63.68\n-10.38\n9.08\n-84.60\n-100.82\n-828.35\n-74.14\n\nTable 1: Percent reduction in RMSE compared to baseline supervised deep kernel learning (DKL)\nmodel for semi-supervised deep kernel learning (SSDKL), COREG, label propagation, variational auto-\nencoder (VAE), mean teacher, and virtual adversarial training (VAT) models. Results are averaged\nacross 10 trials for each UCI regression dataset. Here N is the total number of examples, d is the\ninput feature dimension, and n is the number of labeled training examples. Final row shows median\npercent reduction in RMSE achieved by using unlabeled data.\n\nLabel propagation de\ufb01nes a graph structure over the data with edges that de\ufb01ne the probability\nfor a categorical label to propagate from one data point to another [32]. If we encode this graph\nin a transition matrix T and let the current class probabilities be y, then the algorithm iteratively\npropagates y \u2190 T y, row-normalizes y, clamps the labeled data to their known values, and repeats\nuntil convergence. We make the extension to regression by letting y be real-valued labels and\nnormalizing T . As in [32], we use a fully-connected graph and the radial-basis kernel for edge\nweights. The kernel scale hyperparameter is chosen using a validation set.\nGenerative models such as the variational autoencoder (VAE) have shown promise in semi-supervised\nclassi\ufb01cation especially for visual and sequential tasks [33, 34, 35, 36]. We compare against a\nsemi-supervised VAE by \ufb01rst learning an unsupervised embedding of the data and then using the\nembeddings as input to a supervised multilayer perceptron.\n\n4.2 UCI regression experiments\n\nWe evaluate SSDKL on eight regression datasets from the UCI repository. For each dataset, we train\non n = {50, 100, 200, 300, 400, 500} labeled examples, retain 1000 examples as the hold out test set,\nand treat the remaining data as unlabeled examples. Following [29], the labeled data is randomly split\n90-10 into training and validation samples, giving a realistically small validation set. For example,\nfor n = 100 labeled examples, we use 90 random examples for training and the remaining 10 for\nvalidation in every random split. We report test RMSE averaged over 10 trials of random splits to\ncombat the small data sizes. All kernel hyperparameters are optimized directly through Lsemisup, and\nwe use the validation set for early stopping to prevent over\ufb01tting and for selecting \u03b1 \u2208 {0.1, 1, 10}.\nWe did not use approximate GP procedures in our SSDKL or DKL experiments, so the only difference\nis the addition of the variance regularizer. For all combinations of input feature dimensions and\nlabeled data sizes in the UCI experiments, each SSDKL trial (including all training and testing) ran\non the order of minutes.\nFollowing [20], we choose a neural network with a similar [d-100-50-50-2] architecture and two-\ndimensional embedding. Following [29], we use this same base model for all deep models, including\nSSDKL, DKL, VAT, mean teacher, and the VAE encoder, in order to make results comparable across\nmethods. Since label propagation creates a kernel matrix of all data points, we limit the number of\nunlabeled examples for label propagation to a maximum of 20000 due to memory constraints. We\ninitialize labels in label propagation with a kNN regressor with k = 5 to speed up convergence.\nTable 1 displays the results for n = 100 and n = 300; full results are included in Appendix A.3.\nSSDKL gives a 4.20% and 5.81% median RMSE improvement over the supervised DKL in the\nn = 100, 300 cases respectively, superior to other semi-supervised methods adapted for regression.\nA Wilcoxon signed-rank test versus DKL shows signi\ufb01cance at the p = 0.05 level for at least one\nlabeled training set size for all 8 datasets.\nThe same learning rates and initializations are used across all UCI datasets for SSDKL. We use\nlearning rates of 1 \u00d7 10\u22123 and 0.1 for the neural network and GP parameters respectively and\n\n6\n\n\fFigure 2: Left: Average test RMSE vs. number of labeled examples for UCI Elevators dataset,\nn = {50, 100, 200, 300, 400, 500}. SSDKL generally outperforms supervised DKL, co-training\nregressors (COREG), and virtual adversarial training (VAT). Right: SSDKL performance on poverty\nprediction (Section 4.3) as a function of \u03b1, which controls the trade-off between labeled and unlabeled\nobjectives, for n = 300. The dotted lines plot the performance of DKL and COREG. All results\naveraged over 10 trials. In both panels, shading represents one standard deviation.\n\ninitialize all GP parameters to 1. In Fig. 2 (right), we study the effect of varying \u03b1 to trade off\nbetween maximizing the likelihood of labeled data and minimizing the variance of unlabeled data. A\nlarge \u03b1 emphasizes minimization of the predictive variance while a small \u03b1 focuses on \ufb01tting labeled\ndata. SSDKL improves on DKL for values of \u03b1 between 0.1 and 10.0, indicating that performance\nis not overly reliant on the choice of this hyperparameter. Fig. 2 (left) compares SSDKL to purely\nsupervised DKL, COREG, and VAT as we vary the labeled training set size. For the Elevators dataset,\nDKL is able to close the gap on SSDKL as it gains access to more labeled data. Relative to the other\nmethods, which require more data to \ufb01t neural network parameters, COREG performs well in the\nlow-data regime.\nSurprisingly, COREG outperformed SSDKL on the Blog, CTslice, and Buzz datasets. We found that\nthese datasets happen to be better-suited for nearest neighbors-based methods such as COREG. A\nkNN regressor using only the labeled data outperformed DKL on two of three datasets for n = 100,\nbeat SSDKL on all three for n = 100, beat DKL on two of three for n = 300, and beat SSDKL on\none of three for n = 300. Thus, the kNN regressor is often already outperforming SSDKL with only\nlabeled data\u2014it is unsurprising that SSDKL is unable to close the gap on a semi-supervised nearest\nneighbors method like COREG.\n\nRepresentation learning To gain some intuition about how the unlabeled data helps in the learning\nprocess, we visualize the neural network embeddings learned by the DKL and SSDKL models on the\nSkillcraft dataset. In Fig. 3 (left), we \ufb01rst train DKL on n = 100 labeled training examples and plot\nthe two-dimensional neural network embedding that is learned. In Fig. 3 (right), we train SSDKL\non n = 100 labeled training examples along with m = 1000 additional unlabeled data points and\nplot the resulting embedding. In the left panel, DKL learns a poor embedding\u2014different colors\nrepresenting different output magnitudes are intermingled. In the right panel, SSDKL is able to use\nthe unlabeled data for regularization, and learns a better representation of the dataset.\n\n4.3 Poverty prediction\n\nHigh-resolution satellite imagery offers the potential for cheap, scalable, and accurate tracking of\nchanging socioeconomic indicators. In this task, we predict local poverty measures from satellite\nimages using limited amounts of poverty labels. As described in [2], the dataset consists of 3, 066\nvillages across \ufb01ve Africa countries: Nigeria, Tanzania, Uganda, Malawi, and Rwanda. These include\nsome of the poorest countries in the world (Malawi and Rwanda) as well as some that are relatively\nbetter off (Nigeria), making for a challenging and realistically diverse problem.\n\n7\n\n\fFigure 3: Left: Two-dimensional embeddings learned by supervised deep kernel learning (DKL)\nmodel on the Skillcraft dataset using n = 100 labeled training examples. The colorbar shows the\nmagnitude of the normalized outputs. Right: Embeddings learned by semi-supervised deep kernel\nlearning (SSDKL) model using n = 100 labeled examples plus an additional m = 1000 unlabeled\nexamples. By using unlabeled data for regularization, SSDKL learns a better representation.\n\nPercent reduction in RMSE (n = 300)\n\nCountry\nMalawi\nNigeria\nTanzania\nUganda\nRwanda\nMedian\n\nSpatial SSDKL SSDKL\n16.4\n4.6\n15.5\n12.1\n25.4\n\n13.7\n17.9\n10.0\n25.2\n27.0\n\n17.9\n\n15.5\n\nDKL\n15.7\n1.7\n9.2\n13.8\n21.3\n\n13.8\n\nTable 2: Percent RMSE reduction in a poverty measure prediction task compared to baseline ridge\nregression model used in [2]. SSDKL and DKL models use only satellite image data. Spatial SSDKL\nincorporates both location and image data through kernel composition. Final row shows median\nRMSE reduction of each model averaged over 10 trials.\n\nIn this experiment, we use n = 300 labeled satellite images for training. With such a small dataset,\nwe can not expect to train a deep convolutional neural network (CNN) from scratch. Instead we take\na transfer learning approach as in [24], extracting 4096-dimensional visual features and using these\nas input. More details are provided in Appendix A.5.\n\nIncorporating spatial information In order to highlight the usefulness of kernel composition, we\nexplore extending SSDKL with a spatial kernel. Spatial SSDKL composes two kernels by summing\nan image feature kernel and a separate location kernel that operates on location coordinates (lat/lon).\nBy treating them separately, it explicitly encodes the knowledge that location coordinates are spatially\nstructured and distinct from image features.\nAs shown in Table 2, all models outperform the baseline state-of-the-art ridge regression model from\n[2]. Spatial SSDKL signi\ufb01cantly outperforms the DKL and SSDKL models that use only image\nfeatures. Spatial SSDKL outperforms the other models by directly modeling location coordinates\nas spatial features, showing that kernel composition can effectively incorporate prior knowledge of\nstructure.\n\n5 Related work\n\n[37] introduced deep Gaussian processes, which stack GPs in a hierarchy by modeling the outputs of\none layer with a Gaussian process in the next layer. Despite the suggestive name, these models do not\nintegrate deep neural networks and Gaussian processes.\n\n8\n\n\f[12] proposed deep kernel learning, combining neural networks with the non-parametric \ufb02exibility\nof GPs and training end-to-end in a fully supervised setting. Extensions have explored approximate\ninference, stochastic gradient training, and recurrent deep kernels for sequential data [22, 38, 39].\nOur method draws inspiration from transductive experimental design, which chooses the most\ninformative points (experiments) to measure by seeking data points that are both hard to predict and\ninformative for the unexplored test data [40]. Similar prediction uncertainty approaches have been\nexplored in semi-supervised classi\ufb01cation models, such as minimum entropy and minimum variance\nregularization, which can now also be understood in the posterior regularization framework [7, 41].\nRecent work in generative adversarial networks (GANs) [33], variational autoencoders (VAEs) [34],\nand other generative models have achieved promising results on various semi-supervised classi\ufb01cation\ntasks [35, 36]. However, we \ufb01nd that these models are not as well-suited for generic regression tasks\nsuch as in the UCI repository as for audio-visual tasks.\nConsistency regularization posits that the model\u2019s output should be invariant to reasonable perturba-\ntions of the input [17, 28, 18]. Combining adversarial training [31] with consistency regularization,\nvirtual adversarial training uses a label-free regularization term that allows for semi-supervised\ntraining [17]. Mean teacher adds a regularization term that penalizes deviation from a exponential\nweighted average of the parameters over SGD iterations [18]. For semi-supervised classi\ufb01cation, [29]\nfound that VAT and mean teacher were the best methods across a series of fair evaluations.\nLabel propagation de\ufb01nes a graph structure over the data points and propagates labels from labeled\ndata over the graph. The method must assume a graph structure and edge distances on the input\nfeature space without the ability to adapt the space to the assumptions. Label propagation is also\nsubject to memory constraints since it forms a kernel matrix over all data points, requiring quadratic\nspace in general, although sparser graph structures can reduce this to a linear scaling.\nCo-training regressors trains two kNN regressors with different distance metrics that label each others\u2019\nunlabeled data. This works when neighbors in the given input space have similar target distributions,\nbut unlike kernel learning approaches, the features are \ufb01xed. Thus, COREG cannot adapt the space to\na misspeci\ufb01ed distance measure. In addition, as a fully nonparametric method, inference requires\nretaining the full dataset.\nMuch of the previous work in semi-supervised learning is in classi\ufb01cation and the assumptions do\nnot generally translate to regression. Our experiments show that SSDKL outperforms other adapted\nsemi-supervised methods in a battery of regression tasks.\n\n6 Conclusions\n\nMany important problems are challenging because of the limited availability of training data, making\nthe ability to learn from unlabeled data critical. In experiments with UCI datasets and a real-world\npoverty prediction task, we \ufb01nd that minimizing posterior variance can be an effective way to\nincorporate unlabeled data when labeled training data is scarce. SSDKL models are naturally suited\nfor many real-world problems, as spatial and temporal structure can be explicitly modeled through the\ncomposition of kernel functions. While our focus is on regression problems, we believe the SSDKL\nframework is equally applicable to classi\ufb01cation problems\u2014we leave this to future work.\n\nAcknowledgements\n\nThis research was supported by NSF (#1651565, #1522054, #1733686), ONR, Sony, and FLI. NJ was\nsupported by the Department of Defense (DoD) through the National Defense Science & Engineering\nGraduate Fellowship (NDSEG) Program. We are thankful to Volodymyr Kuleshov and Aditya Grover\nfor helpful discussions.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n9\n\n\f[2] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon.\nCombining satellite imagery and machine learning to predict poverty. Science, 353(6301):790\u2013\n794, 2016.\n\n[3] Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. Deep gaussian\nprocess for crop yield prediction based on remote sensing data. In AAAI, pages 4559\u20134566,\n2017.\n\n[4] Barak Oshri, Annie Hu, Peter Adelson, Xiao Chen, Pascaline Dupas, Jeremy Weinstein, Marshall\nBurke, David Lobell, and Stefano Ermon. Infrastructure quality assessment in africa using\nsatellite imagery and deep learning. Proc. 24th ACM SIGKDD Conference, 2018.\n\n[5] Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis\n\nlectures on arti\ufb01cial intelligence and machine learning, 3(1):1\u2013130, 2009.\n\n[6] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A DIRT-T approach to unsupervised\n\ndomain adaptation. In International Conference on Learning Representations, 2018.\n\n[7] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In\n\nAdvances in neural information processing systems, pages 529\u2013536, 2004.\n\n[8] Olivier Chapelle and Alexander Zien. Semi-supervised classi\ufb01cation by low density separation.\n\nIn AISTATS, pages 57\u201364, 2005.\n\n[9] Aarti Singh, Robert Nowak, and Xiaojin Zhu. Unlabeled data: Now it helps, now it doesn\u2019t. In\n\nAdvances in neural information processing systems, pages 1513\u20131520, 2009.\n\n[10] Volodymyr Kuleshov and Stefano Ermon. Deep hybrid models: Bridging discriminative and\n\ngenerative approaches. In Proceedings of the Conference on Uncertainty in AI (UAI), 2017.\n\n[11] Russell Ren, Hongyu Stewart, Jiaming Song, Volodymyr Kuleshov, and Stefano Ermon. Adver-\nsarial constraint learning for structured prediction. Proc. 27th International Joint Conference\non Arti\ufb01cial Intelligence, 2018.\n\n[12] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel\n\nlearning. The Journal of Machine Learning Research, 2015.\n\n[13] Stephan Eissman and Stefano Ermon. Bayesian optimization and attribute adjustment. Proc.\n\n34th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2018.\n\n[14] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. Posterior regularization for structured\n\nlatent variable models. Journal of Machine Learning Research, 11(Jul):2001\u20132049, 2010.\n\n[15] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and\napplications to in\ufb01nite latent svms. Journal of Machine Learning Research, 15(1):1799\u20131847,\n2014.\n\n[16] Rui Shu, Hung H Bui, Shengjia Zhao, Mykel J Kochenderfer, and Stefano Ermon. Amortized\n\ninference regularization. NIPS, 2018.\n\n[17] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional\n\nsmoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.\n\n[18] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged\nconsistency targets improve semi-supervised deep learning results. In I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 1195\u20131204. Curran Associates, Inc., 2017.\n\n[19] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for\ntransductive transfer learning. Proc. Seventh IEEE Int\u2019,l Conf. Data Mining Workshops, 2007.\n\n[20] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel\nlearning. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 370\u2013378, 2016.\n\n10\n\n\f[21] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning.\n\nThe MIT Press, 2006.\n\n[22] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured\ngaussian processes (KISS-GP). In Proceedings of the 32nd International Conference on Machine\nLearning, ICML 2015, Lille, France, 6-11 July 2015, pages 1775\u20131784, 2015.\n\n[23] M. Lichman. UCI machine learning repository, 2013.\n\n[24] Michael Xie, Neal Jean, Marshall Burke, David Lobell, and Stefano Ermon. Transfer learning\nfrom deep features for remote sensing and poverty mapping. AAAI Conference on Arti\ufb01cial\nIntelligence, 2016.\n\n[25] Mart\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[26] Zhi-Hua Zhou and Ming Li. Semi-supervised regression with co-training. In IJCAI, volume 5,\n\npages 908\u2013913, 2005.\n\n[27] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In\nProceedings of the eleventh annual conference on Computational learning theory, pages 92\u2013100.\nACM, 1998.\n\n[28] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. ICLR 2017.\n\n[29] Augustus Odena, Avital Oliver, Colin Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic\n\nevaluation of semi-supervised learning algorithms. 2018.\n\n[30] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial train-\ning: a regularization method for supervised and semi-supervised learning. arXiv preprint\narXiv:1704.03976, 2017.\n\n[31] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[32] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label\n\npropagation. Technical report, 2002.\n\n[33] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[34] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[35] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep\n\ngenerative models. arXiv preprint arXiv:1602.05473, 2016.\n\n[36] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-\nsupervised learning with ladder networks. In Advances in Neural Information Processing\nSystems, pages 3546\u20133554, 2015.\n\n[37] Andreas C. Damianou and Neil D. Lawrence. Deep gaussian processes. The Journal of Machine\n\nLearning Research, 2013.\n\n[38] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational\ndeep kernel learning. In Advances in Neural Information Processing Systems, pages 2586\u20132594,\n2016.\n\n[39] Maruan Al-Shedivat, Andrew Gordon Wilson, Yunus Saatchi, Zhiting Hu, and Eric P Xing.\nLearning scalable deep kernels with recurrent structure. arXiv preprint arXiv:1610.08936, 2016.\n\n11\n\n\f[40] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. The\n\nInternational Conference on Machine Learning (ICML), 2006.\n\n[41] Chenyang Zhao and Shaodan Zhai. Minimum variance semi-supervised boosting for multi-\nlabel classi\ufb01cation. In 2015 IEEE Global Conference on Signal and Information Processing\n(GlobalSIP), pages 1342\u20131346. IEEE, 2015.\n\n12\n\n\f", "award": [], "sourceid": 2545, "authors": [{"given_name": "Neal", "family_name": "Jean", "institution": "Stanford University"}, {"given_name": "Sang Michael", "family_name": "Xie", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}