{"title": "Semi-supervised Learning with Deep Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3581, "page_last": 3589, "abstract": "The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.", "full_text": "Semi-supervised Learning with\n\nDeep Generative Models\n\nDiederik P. Kingma\u2217, Danilo J. Rezende\u2020, Shakir Mohamed\u2020, Max Welling\u2217\n\n\u2217Machine Learning Group, Univ. of Amsterdam, {D.P.Kingma, M.Welling}@uva.nl\n\n\u2020Google Deepmind, {danilor, shakir}@google.com\n\nAbstract\n\nThe ever-increasing size of modern data sets combined with the dif\ufb01culty of ob-\ntaining label information has made semi-supervised learning one of the problems\nof signi\ufb01cant practical importance in modern data analysis. We revisit the ap-\nproach to semi-supervised learning with generative models and develop new mod-\nels that allow for effective generalisation from small labelled data sets to large\nunlabelled ones. Generative approaches have thus far been either in\ufb02exible, in-\nef\ufb01cient or non-scalable. We show that deep generative models and approximate\nBayesian inference exploiting recent advances in variational methods can be used\nto provide signi\ufb01cant improvements, making generative approaches highly com-\npetitive for semi-supervised learning.\n\nIntroduction\n\n1\nSemi-supervised learning considers the problem of classi\ufb01cation when only a small subset of the\nobservations have corresponding class labels. Such problems are of immense practical interest in a\nwide range of applications, including image search (Fergus et al., 2009), genomics (Shi and Zhang,\n2011), natural language parsing (Liang, 2005), and speech analysis (Liu and Kirchhoff, 2013), where\nunlabelled data is abundant, but obtaining class labels is expensive or impossible to obtain for the\nentire data set. The question that is then asked is: how can properties of the data be used to improve\ndecision boundaries and to allow for classi\ufb01cation that is more accurate than that based on classi\ufb01ers\nconstructed using the labelled data alone.\nIn this paper we answer this question by developing\nprobabilistic models for inductive and transductive semi-supervised learning by utilising an explicit\nmodel of the data density, building upon recent advances in deep generative models and scalable\nvariational inference (Kingma and Welling, 2014; Rezende et al., 2014).\nAmongst existing approaches, the simplest algorithm for semi-supervised learning is based on a\nself-training scheme (Rosenberg et al., 2005) where the the model is bootstrapped with additional\nlabelled data obtained from its own highly con\ufb01dent predictions; this process being repeated until\nsome termination condition is reached. These methods are heuristic and prone to error since they\ncan reinforce poor predictions. Transductive SVMs (TSVM) (Joachims, 1999) extend SVMs with\nthe aim of max-margin classi\ufb01cation while ensuring that there are as few unlabelled observations\nnear the margin as possible. These approaches have dif\ufb01culty extending to large amounts of unla-\nbelled data, and ef\ufb01cient optimisation in this setting is still an open problem. Graph-based methods\nare amongst the most popular and aim to construct a graph connecting similar observations; label\ninformation propagates through the graph from labelled to unlabelled nodes by \ufb01nding the minimum\nenergy (MAP) con\ufb01guration (Blum et al., 2004; Zhu et al., 2003). Graph-based approaches are sen-\nsitive to the graph structure and require eigen-analysis of the graph Laplacian, which limits the scale\nto which these methods can be applied \u2013 though ef\ufb01cient spectral methods are now available (Fer-\ngus et al., 2009). Neural network-based approaches combine unsupervised and supervised learning\n\nFor an updated version of this paper, please see http://arxiv.org/abs/1406.5298\n\n1\n\n\fby training feed-forward classi\ufb01ers with an additional penalty from an auto-encoder or other unsu-\npervised embedding of the data (Ranzato and Szummer, 2008; Weston et al., 2012). The Manifold\nTangent Classi\ufb01er (MTC) (Rifai et al., 2011) trains contrastive auto-encoders (CAEs) to learn the\nmanifold on which the data lies, followed by an instance of TangentProp to train a classi\ufb01er that is\napproximately invariant to local perturbations along the manifold. The idea of manifold learning\nusing graph-based methods has most recently been combined with kernel (SVM) methods in the At-\nlas RBF model (Pitelis et al., 2014) and provides amongst most competitive performance currently\navailable.\nIn this paper, we instead, choose to exploit the power of generative models, which recognise the\nsemi-supervised learning problem as a specialised missing data imputation task for the classi\ufb01ca-\ntion problem. Existing generative approaches based on models such as Gaussian mixture or hidden\nMarkov models (Zhu, 2006), have not been very successful due to the need for a large number\nof mixtures components or states to perform well. More recent solutions have used non-parametric\ndensity models, either based on trees (Kemp et al., 2003) or Gaussian processes (Adams and Ghahra-\nmani, 2009), but scalability and accurate inference for these approaches is still lacking. Variational\napproximations for semi-supervised clustering have also been explored previously (Li et al., 2009;\nWang et al., 2009).\nThus, while a small set of generative approaches have been previously explored, a generalised and\nscalable probabilistic approach for semi-supervised learning is still lacking. It is this gap that we\naddress through the following contributions:\n\n\u2022 We describe a new framework for semi-supervised learning with generative models, em-\nploying rich parametric density estimators formed by the fusion of probabilistic modelling\nand deep neural networks.\n\u2022 We show for the \ufb01rst time how variational inference can be brought to bear upon the prob-\nlem of semi-supervised classi\ufb01cation.\nIn particular, we develop a stochastic variational\ninference algorithm that allows for joint optimisation of both model and variational param-\neters, and that is scalable to large datasets.\n\u2022 We demonstrate the performance of our approach on a number of data sets providing state-\n\u2022 We show qualitatively generative semi-supervised models learn to separate the data classes\n(content types) from the intra-class variabilities (styles), allowing in a very straightforward\nfashion to simulate analogies of images on a variety of datasets.\n\nof-the-art results on benchmark problems.\n\n2 Deep Generative Models for Semi-supervised Learning\nWe are faced with data that appear as pairs (X, Y) = {(x1, y1), . . . , (xN , yN )}, with the i-th ob-\nservation xi \u2208 RD and the corresponding class label yi \u2208 {1, . . . , L}. Observations will have\ncorresponding latent variables, which we denote by zi. We will omit the index i whenever it is clear\nthat we are referring to terms associated with a single data point. In semi-supervised classi\ufb01cation,\nonly a subset of the observations have corresponding class labels; we refer to the empirical distribu-\n\ntion over the labelled and unlabelled subsets as(cid:101)pl(x, y) and(cid:101)pu(x), respectively. We now develop\n\nmodels for semi-supervised learning that exploit generative descriptions of the data to improve upon\nthe classi\ufb01cation performance that would be obtained using the labelled data alone.\nLatent-feature discriminative model (M1): A commonly used approach is to construct a model\nthat provides an embedding or feature representation of the data. Using these features, a separate\nclassi\ufb01er is thereafter trained. The embeddings allow for a clustering of related observations in a\nlatent feature space that allows for accurate classi\ufb01cation, even with a limited number of labels.\nInstead of a linear embedding, or features obtained from a regular auto-encoder, we construct a\ndeep generative model of the data that is able to provide a more robust set of latent features. The\ngenerative model we use is:\n\np(z) = N (z|0, I);\n\np\u03b8(x|z) = f (x; z, \u03b8),\n\n(1)\n\nwhere f (x; z, \u03b8) is a suitable likelihood function (e.g., a Gaussian or Bernoulli distribution) whose\nprobabilities are formed by a non-linear transformation, with parameters \u03b8, of a set of latent vari-\nables z. This non-linear transformation is essential to allow for higher moments of the data to be\ncaptured by the density model, and we choose these non-linear functions to be deep neural networks.\n\n2\n\n\fp(y) = Cat(y|\u03c0);\n\np(z) = N (z|0, I);\n\nApproximate samples from the posterior distribution over the latent variables p(z|x) are used as fea-\ntures to train a classi\ufb01er that predicts class labels y, such as a (transductive) SVM or multinomial\nregression. Using this approach, we can now perform classi\ufb01cation in a lower dimensional space\nsince we typically use latent variables whose dimensionality is much less than that of the obser-\nvations. These low dimensional embeddings should now also be more easily separable since we\nmake use of independent latent Gaussian posteriors whose parameters are formed by a sequence of\nnon-linear transformations of the data. This simple approach results in improved performance for\nSVMs, and we demonstrate this in section 4.\nGenerative semi-supervised model (M2): We propose a probabilistic model that describes the data\nas being generated by a latent class variable y in addition to a continuous latent variable z. The data\nis explained by the generative process:\n\np\u03b8(x|y, z) = f (x; y, z, \u03b8),\n\n(2)\nwhere Cat(y|\u03c0) is the multinomial distribution, the class labels y are treated as latent variables if\nno class label is available and z are additional latent variables. These latent variables are marginally\nindependent and allow us, in case of digit generation for example, to separate the class speci\ufb01ca-\ntion from the writing style of the digit. As before, f (x; y, z, \u03b8) is a suitable likelihood function,\ne.g., a Bernoulli or Gaussian distribution, parameterised by a non-linear transformation of the latent\nvariables. In our experiments, we choose deep neural networks as this non-linear function. Since\nmost labels y are unobserved, we integrate over the class of any unlabelled data during the infer-\nence process, thus performing classi\ufb01cation as inference. Predictions for any missing labels are\nobtained from the inferred posterior distribution p\u03b8(y|x). This model can also be seen as a hybrid\ncontinuous-discrete mixture model where the different mixture components share parameters.\nStacked generative semi-supervised model (M1+M2): We can combine these two approaches by\n\ufb01rst learning a new latent representation z1 using the generative model from M1, and subsequently\nlearning a generative semi-supervised model M2, using embeddings from z1 instead of the raw data\nx. The result is a deep generative model with two layers of stochastic variables: p\u03b8(x, y, z1, z2) =\np(y)p(z2)p\u03b8(z1|y, z2)p\u03b8(x|z1), where the priors p(y) and p(z2) equal those of y and z above, and\nboth p\u03b8(z1|y, z2) and p\u03b8(x|z1) are parameterised as deep neural networks.\n3 Scalable Variational Inference\n3.1 Lower Bound Objective\nIn all our models, computation of the exact posterior distribution is intractable due to the nonlinear,\nnon-conjugate dependencies between the random variables. To allow for tractable and scalable\ninference and parameter learning, we exploit recent advances in variational inference (Kingma and\nWelling, 2014; Rezende et al., 2014). For all the models described, we introduce a \ufb01xed-form\ndistribution q\u03c6(z|x) with parameters \u03c6 that approximates the true posterior distribution p(z|x). We\nthen follow the variational principle to derive a lower bound on the marginal likelihood of the model\n\u2013 this bound forms our objective function and ensures that our approximate posterior is as close as\npossible to the true posterior.\nWe construct the approximate posterior distribution q\u03c6(\u00b7) as an inference or recognition model,\nwhich has become a popular approach for ef\ufb01cient variational inference (Dayan, 2000; Kingma and\nWelling, 2014; Rezende et al., 2014; Stuhlm\u00a8uller et al., 2013). Using an inference network, we avoid\nthe need to compute per data point variational parameters, but can instead compute a set of global\nvariational parameters \u03c6. This allows us to amortise the cost of inference by generalising between\nthe posterior estimates for all latent variables through the parameters of the inference network, and\nallows for fast inference at both training and testing time (unlike with VEM, in which we repeat\nthe generalized E-step optimisation for every test data point). An inference network is introduced\nfor all latent variables, and we parameterise them as deep neural networks whose outputs form the\nparameters of the distribution q\u03c6(\u00b7). For the latent-feature discriminative model (M1), we use a\nGaussian inference network q\u03c6(z|x) for the latent variable z. For the generative semi-supervised\nmodel (M2), we introduce an inference model for each of the latent variables z and y, which we we\nassume has a factorised form q\u03c6(z, y|x) = q\u03c6(z|x)q\u03c6(y|x), speci\ufb01ed as Gaussian and multinomial\ndistributions respectively.\n\nM1: q\u03c6(z|x) = N (z|\u00b5\u03c6(x), diag(\u03c32\n\u03c6(x))),\nM2: q\u03c6(z|y, x) = N (z|\u00b5\u03c6(y, x), diag(\u03c32\n\n\u03c6(x))); q\u03c6(y|x) = Cat(y|\u03c0\u03c6(x)),\n\n(3)\n(4)\n\n3\n\n\fwhere \u03c3\u03c6(x) is a vector of standard deviations, \u03c0\u03c6(x) is a probability vector, and the functions\n\u00b5\u03c6(x), \u03c3\u03c6(x) and \u03c0\u03c6(x) are represented as MLPs.\n\n3.1.1 Latent Feature Discriminative Model Objective\nFor this model, the variational bound J (x) on the marginal likelihood for a single data point is:\n\nlog p\u03b8(x) \u2265 Eq\u03c6(z|x) [log p\u03b8(x|z)] \u2212 KL[q\u03c6(z|x)(cid:107)p\u03b8(z)] = \u2212J (x),\n\n(5)\nThe inference network q\u03c6(z|x) (3) is used during training of the model using both the labelled and\nunlabelled data sets. This approximate posterior is then used as a feature extractor for the labelled\ndata set, and the features used for training the classi\ufb01er.\n\n3.1.2 Generative Semi-supervised Model Objective\n\nFor this model, we have two cases to consider. In the \ufb01rst case, the label corresponding to a data\npoint is observed and the variational bound is a simple extension of equation (5):\nlog p\u03b8(x, y)\u2265Eq\u03c6(z|x,y) [log p\u03b8(x|y, z) + log p\u03b8(y) + log p(z) \u2212 log q\u03c6(z|x, y)] =\u2212L(x, y), (6)\n\nFor the case where the label is missing, it is treated as a latent variable over which we perform\nposterior inference and the resulting bound for handling data points with an unobserved label y is:\n\nlog p\u03b8(x) \u2265 Eq\u03c6(y,z|x) [log p\u03b8(x|y, z) + log p\u03b8(y) + log p(z) \u2212 log q\u03c6(y, z|x)]\n\n=\n\nq\u03c6(y|x)(\u2212L(x, y)) + H(q\u03c6(y|x)) = \u2212U(x).\n\n(cid:88)\n\ny\n\n(7)\n\n(8)\n\nThe bound on the marginal likelihood for the entire dataset is now:\n\n(cid:88)\n\nJ =\n\n(x,y)\u223c(cid:101)pl\n\nL(x, y) +\n\n(cid:88)\nx\u223c(cid:101)pu\n\nU(x)\n\nThe distribution q\u03c6(y|x) (4) for the missing labels has the form a discriminative classi\ufb01er, and\nwe can use this knowledge to construct the best classi\ufb01er possible as our inference model. This\ndistribution is also used at test time for predictions of any unseen data.\nIn the objective function (8), the label predictive distribution q\u03c6(y|x) contributes only to the second\nterm relating to the unlabelled data, which is an undesirable property if we wish to use this distribu-\ntion as a classi\ufb01er. Ideally, all model and variational parameters should learn in all cases. To remedy\nthis, we add a classi\ufb01cation loss to (8), such that the distribution q\u03c6(y|x) also learns from labelled\ndata. The extended objective function is:\n\nJ \u03b1 = J + \u03b1 \u00b7 E(cid:101)pl(x,y) [\u2212 log q\u03c6(y|x)] ,\n\n(9)\nwhere the hyper-parameter \u03b1 controls the relative weight between generative and purely discrimina-\ntive learning. We use \u03b1 = 0.1\u00b7 N in all experiments. While we have obtained this objective function\nby motivating the need for all model components to learn at all times, the objective 9 can also be\nderived directly using the variational principle by instead performing inference over the parameters\n\u03c0 of the categorical distribution, using a symmetric Dirichlet prior over these parameterss.\n\n3.2 Optimisation\n\nThe bounds in equations (5) and (9) provide a uni\ufb01ed objective function for optimisation of both\nthe parameters \u03b8 and \u03c6 of the generative and inference models, respectively. This optimisation can\nbe done jointly, without resort to the variational EM algorithm, by using deterministic reparameter-\nisations of the expectations in the objective function, combined with Monte Carlo approximation \u2013\nreferred to in previous work as stochastic gradient variational Bayes (SGVB) (Kingma and Welling,\n2014) or as stochastic backpropagation (Rezende et al., 2014). We describe the core strategy for the\nlatent-feature discriminative model M1, since the same computations are used for the generative\nsemi-supervised model.\nWhen the prior p(z) is a spherical Gaussian distribution p(z) = N (z|0, I) and the variational distri-\nbution q\u03c6(z|x) is also a Gaussian distribution as in (3), the KL term in equation (5) can be computed\n\n4\n\n\fAlgorithm 1 Learning in model M1\n\nJ \u2190(cid:80)\n\nwhile generativeTraining() do\nD \u2190 getRandomMiniBatch()\nzi \u223c q\u03c6(zi|xi) \u2200xi \u2208 D\nn J (xi)\n(g\u03b8, g\u03c6) \u2190 ( \u2202J\n(\u03b8, \u03c6) \u2190 (\u03b8, \u03c6) + \u0393(g\u03b8, g\u03c6)\n\n\u2202\u03b8 , \u2202J\n\u2202\u03c6 )\n\nend while\nwhile discriminativeTraining() do\nD \u2190 getLabeledRandomMiniBatch()\nzi \u223c q\u03c6(zi|xi) \u2200{xi, yi} \u2208 D\ntrainClassi\ufb01er({zi, yi} )\n\nend while\n\nAlgorithm 2 Learning in model M2\n\nwhile training() do\n\nD \u2190 getRandomMiniBatch()\nyi \u223c q\u03c6(yi|xi) \u2200{xi, yi} /\u2208 O\nzi \u223c q\u03c6(zi|yi, xi)\nJ \u03b1 \u2190 eq. (9)\n(g\u03b8, g\u03c6) \u2190 ( \u2202L\u03b1\n(\u03b8, \u03c6) \u2190 (\u03b8, \u03c6) + \u0393(g\u03b8, g\u03c6)\n\n\u2202\u03b8 , \u2202L\u03b1\n\u2202\u03c6 )\n\nend while\n\nanalytically and the log-likelihood term can be rewritten, using the location-scale transformation for\nthe Gaussian distribution, as:\n\nEq\u03c6(z|x) [log p\u03b8(x|z)] = EN (\u0001|0,I)\n\n(10)\nwhere (cid:12) indicates the element-wise product. While the expectation (10) still cannot be solved\nanalytically, its gradients with respect to the generative parameters \u03b8 and variational parameters \u03c6\ncan be ef\ufb01ciently computed as expectations of simple gradients:\n\n(cid:2)log p\u03b8(x|\u00b5\u03c6(x) + \u03c3\u03c6(x) (cid:12) \u0001)(cid:3) ,\n\n\u2207{\u03b8,\u03c6}Eq\u03c6(z|x) [log p\u03b8(x|z)] = EN (\u0001|0,I)\n\n(cid:2)\u2207{\u03b8,\u03c6} log p\u03b8(x|\u00b5\u03c6(x) + \u03c3\u03c6(x) (cid:12) \u0001)(cid:3) .\n\n(11)\n\nThe gradients of the loss (9) for model M2 can be computed by a direct application of the chain rule\nand by noting that the conditional bound L(xn, y) contains the same type of terms as the loss (9).\nThe gradients of the latter can then be ef\ufb01ciently estimated using (11) .\nDuring optimization we use the estimated gradients in conjunction with standard stochastic gradient-\nbased optimization methods such as SGD, RMSprop or AdaGrad (Duchi et al., 2010). This results\nin parameter updates of the form: (\u03b8t+1, \u03c6t+1) \u2190 (\u03b8t, \u03c6t) + \u0393t(gt\n\u03c6), where \u0393 is a diagonal\npreconditioning matrix that adaptively scales the gradients for faster minimization. The training pro-\ncedure for models M1 and M2 are summarised in algorithms 1 and 2, respectively. Our experimental\nresults were obtained using AdaGrad.\n\n\u03b8, gt\n\n3.3 Computational Complexity\nThe overall algorithmic complexity of a single joint update of the parameters (\u03b8, \u03c6) for M1 using the\nestimator (11) is CM1 = M SCMLP where M is the minibatch size used , S is the number of samples\nof the random variate \u0001, and CMLP is the cost of an evaluation of the MLPs in the conditional\ndistributions p\u03b8(x|z) and q\u03c6(z|x). The cost CMLP is of the form O(KD2) where K is the total\nnumber of layers and D is the average dimension of the layers of the MLPs in the model. Training\nM1 also requires training a supervised classi\ufb01er, whose algorithmic complexity, if it is a neural net,\nit will have a complexity of the form CMLP .\nThe algorithmic complexity for M2 is of the form CM2 = LCM1, where L is the number of labels\nand CM1 is the cost of evaluating the gradients of each conditional bound Jy(x), which is the same\nas for M1. The stacked generative semi-supervised model has an algorithmic complexity of the\nform CM1 + CM2. But with the advantage that the cost CM2 is calculated in a low-dimensional space\n(formed by the latent variables of the model M1 that provides the embeddings).\nThese complexities make this approach extremely appealing, since they are no more expensive than\nalternative approaches based on auto-encoder or neural models, which have the lowest computa-\ntional complexity amongst existing competitive approaches. In addition, our models are fully prob-\nabilistic, allowing for a wide range of inferential queries, which is not possible with many alternative\napproaches for semi-supervised learning.\n\n5\n\n\fTable 1: Benchmark results of semi-supervised classi\ufb01cation on MNIST with few labels.\n\nN\n100\n600\n1000\n3000\n\nNN\n25.81\n11.44\n10.7\n6.04\n\nCNN\n22.98\n7.68\n6.45\n3.35\n\nTSVM CAE\n13.47\n16.81\n6.3\n6.16\n5.38\n4.77\n3.22\n3.45\n\nMTC\n12.03\n5.13\n3.64\n2.57\n\nAtlasRBF\n8.10 (\u00b1 0.95)\n\u2013\n3.68 (\u00b1 0.12)\n\u2013\n\nM1+TSVM\n11.82 (\u00b1 0.25)\n5.72 (\u00b1 0.049)\n4.24 (\u00b1 0.07)\n3.49 (\u00b1 0.04)\n\nM2\n11.97 (\u00b1 1.71)\n4.94 (\u00b1 0.13)\n3.60 (\u00b1 0.56)\n3.92 (\u00b1 0.63)\n\nM1+M2\n3.33 (\u00b1 0.14)\n2.59 (\u00b1 0.05)\n2.40 (\u00b1 0.02)\n2.18 (\u00b1 0.04)\n\n4 Experimental Results\n\nOpen source code, with which the most important results and \ufb01gures can be reproduced, is avail-\nable at http://github.com/dpkingma/nips14-ssl. For the latest experimental results,\nplease see http://arxiv.org/abs/1406.5298.\n\n4.1 Benchmark Classi\ufb01cation\nWe test performance on the standard MNIST digit classi\ufb01cation benchmark. The data set for semi-\nsupervised learning is created by splitting the 50,000 training points between a labelled and unla-\nbelled set, and varying the size of the labelled from 100 to 3000. We ensure that all classes are\nbalanced when doing this, i.e. each class has the same number of labelled points. We create a num-\nber of data sets using randomised sampling to con\ufb01dence bounds for the mean performance under\nrepeated draws of data sets.\nFor model M1 we used a 50-dimensional latent variable z. The MLPs that form part of the generative\nand inference models were constructed with two hidden layers, each with 600 hidden units, using\nsoftplus log(1+ex) activation functions. On top, a transductive SVM (TSVM) was learned on values\nof z inferred with q\u03c6(z|x). For model M2 we also used 50-dimensional z. In each experiment, the\nMLPs were constructed with one hidden layer, each with 500 hidden units and softplus activation\nfunctions. In case of SVHN and NORB, we found it helpful to pre-process the data with PCA.\nThis makes the model one level deeper, and still optimizes a lower bound on the likelihood of the\nunprocessed data.\nTable 1 shows classi\ufb01cation results. We compare to a broad range of existing solutions in semi-\nsupervised learning, in particular to classi\ufb01cation using nearest neighbours (NN), support vector\nmachines on the labelled set (SVM), the transductive SVM (TSVM), and contractive auto-encoders\n(CAE). Some of the best results currently are obtained by the manifold tangent classi\ufb01er (MTC)\n(Rifai et al., 2011) and the AtlasRBF method (Pitelis et al., 2014). Unlike the other models in this\ncomparison, our models are fully probabilistic but have a cost in the same order as these alternatives.\n\nResults: The latent-feature discriminative model (M1) performs better than other models based\non simple embeddings of the data, demonstrating the effectiveness of the latent space in providing\nrobust features that allow for easier classi\ufb01cation. By combining these features with a classi\ufb01cation\nmechanism directly in the same model, as in the conditional generative model (M2), we are able to\nget similar results without a separate TSVM classi\ufb01er.\nHowever, by far the best results were obtained using the stack of models M1 and M2. This com-\nbined model provides accurate test-set predictions across all conditions, and easily outperforms the\npreviously best methods. We also tested this deep generative model for supervised learning with\nall available labels, and obtain a test-set performance of 0.96%, which is among the best published\nresults for this permutation-invariant MNIST classi\ufb01cation task.\n\n4.2 Conditional Generation\nThe conditional generative model can be used to explore the underlying structure of the data, which\nwe demonstrate through two forms of analogical reasoning. Firstly, we demonstrate style and con-\ntent separation by \ufb01xing the class label y, and then varying the latent variables z over a range of\nvalues. Figure 1 shows three MNIST classes in which, using a trained model with two latent vari-\nables, and the 2D latent variable varied over a range from -5 to 5. In all cases, we see that nearby\nregions of latent space correspond to similar writing styles, independent of the class; the left region\nrepresents upright writing styles, while the right-side represents slanted styles.\nAs a second approach, we use a test image and pass it through the inference network to infer a\nvalue of the latent variables corresponding to that image. We then \ufb01x the latent variables z to this\n\n6\n\n\f(a) Handwriting styles for MNIST obtained by \ufb01xing the class label and varying the 2D latent variable z\n\n(b) MNIST analogies\n\n(c) SVHN analogies\n\nFigure 1: (a) Visualisation of handwriting styles learned by the model with 2D z-space.\n(b,c)\nAnalogical reasoning with generative semi-supervised models using a high-dimensional z-space.\nThe leftmost columns show images from the test set. The other columns show analogical fantasies\nof x by the generative model, where the latent variable z of each row is set to the value inferred from\nthe test-set image on the left by the inference network. Each column corresponds to a class label y.\n\nTable 2: Semi-supervised classi\ufb01cation on\nthe SVHN dataset with 1000 labels.\n\nTable 3: Semi-supervised classi\ufb01cation on\nthe NORB dataset with 1000 labels.\n\nKNN\n77.93\n(\u00b1 0.08)\n\nTSVM\n66.55\n(\u00b1 0.10)\n\nM1+KNN M1+TSVM\n65.63\n54.33\n(\u00b1 0.15)\n(\u00b1 0.11)\n\nM1+M2\n36.02\n(\u00b1 0.10)\n\nKNN\n78.71\n(\u00b1 0.02)\n\nTSVM\n26.00\n(\u00b1 0.06)\n\nM1+KNN M1+TSVM\n18.79\n65.39\n(\u00b1 0.09)\n(\u00b1 0.05)\n\nvalue, vary the class label y, and simulate images from the generative model corresponding to that\ncombination of z and y. This again demonstrate the disentanglement of style from class. Figure 1\nshows these analogical fantasies for the MNIST and SVHN datasets (Netzer et al., 2011). The\nSVHN data set is a far more complex data set than MNIST, but the model is able to \ufb01x the style of\nhouse number and vary the digit that appears in that style well. These generations represent the best\ncurrent performance in simulation from generative models on these data sets.\nThe model used in this way also provides an alternative model to the stochastic feed-forward net-\nworks (SFNN) described by Tang and Salakhutdinov (2013). The performance of our model sig-\nni\ufb01cantly improves on SFNN, since instead of an inef\ufb01cient Monte Carlo EM algorithm relying on\nimportance sampling, we are able to perform ef\ufb01cient joint inference that is easy to scale.\n\nImage Classi\ufb01cation\n\n4.3\nWe demonstrate the performance of image classi\ufb01cation on the SVHN, and NORB image data sets.\nSince no comparative results in the semi-supervised setting exists, we perform nearest-neighbour\nand TSVM classi\ufb01cation with RBF kernels and compare performance on features generated by\nour latent-feature discriminative model to the original features. The results are presented in tables 2\nand 3, and we again demonstrate the effectiveness of our approach for semi-supervised classi\ufb01cation.\n\n7\n\n\f4.4 Optimization details\nThe parameters were initialized by sampling randomly from N (0, 0.0012I), except for the bias pa-\nrameters which were initialized as 0. The objectives were optimized using minibatch gradient ascent\nuntil convergence, using a variant of RMSProp with momentum and initialization bias correction, a\nconstant learning rate of 0.0003, \ufb01rst moment decay (momentum) of 0.1, and second moment decay\nof 0.001. For MNIST experiments, minibatches for training were generated by treating normalised\npixel intensities of the images as Bernoulli probabilities and sampling binary images from this dis-\ntribution. In the M2 model, a weight decay was used corresponding to a prior of (\u03b8, \u03c6) \u223c N (0, I).\n\n5 Discussion and Conclusion\nThe approximate inference methods introduced here can be easily extended to the model\u2019s parame-\nters, harnessing the full power of variational learning. Such an extension also provides a principled\nground for performing model selection. Ef\ufb01cient model selection is particularly important when the\namount of available data is not large, such as in semi-supervised learning.\nFor image classi\ufb01cation tasks, one area of interest is to combine such methods with convolutional\nneural networks that form the gold-standard for current supervised classi\ufb01cation methods. Since all\nthe components of our model are parametrised by neural networks we can readily exploit convolu-\ntional or more general locally-connected architectures \u2013 and forms a promising avenue for future\nexploration.\nA limitation of the models we have presented is that they scale linearly in the number of classes\nin the data sets. Having to re-evaluate the generative likelihood for each class during training is an\nexpensive operation. Potential reduction of the number of evaluations could be achieved by using a\ntruncation of the posterior mass. For instance we could combine our method with the truncation al-\ngorithm suggested by Pal et al. (2005), or by using mechanisms such as error-correcting output codes\n(Dietterich and Bakiri, 1995). The extension of our model to multi-label classi\ufb01cation problems that\nis essential for image-tagging is also possible, but requires similar approximations to reduce the\nnumber of likelihood-evaluations per class.\nWe have developed new models for semi-supervised learning that allow us to improve the quality of\nprediction by exploiting information in the data density using generative models. We have developed\nan ef\ufb01cient variational optimisation algorithm for approximate Bayesian inference in these models\nand demonstrated that they are amongst the most competitive models currently available for semi-\nsupervised learning. We hope that these results stimulate the development of even more powerful\nsemi-supervised classi\ufb01cation methods based on generative models, of which there remains much\nscope.\n\nAcknowledgements. We are grateful for feedback from the reviewers. We would also like to\nthank the SURFFoundation for the use of the Dutch national e-infrastructure for a signi\ufb01cant part of\nthe experiments.\n\n8\n\n\fReferences\nAdams, R. P. and Ghahramani, Z. (2009). Archipelago: nonparametric Bayesian semi-supervised learning. In\n\nProceedings of the International Conference on Machine Learning (ICML).\n\nBlum, A., Lafferty, J., Rwebangira, M. R., and Reddy, R. (2004). Semi-supervised learning using randomized\n\nmincuts. In Proceedings of the International Conference on Machine Learning (ICML).\n\nDayan, P. (2000). Helmholtz machines and wake-sleep learning. Handbook of Brain Theory and Neural\n\nNetwork. MIT Press, Cambridge, MA, 44(0).\n\nDietterich, T. G. and Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes.\n\narXiv preprint cs/9501101.\n\nDuchi, J., Hazan, E., and Singer, Y. (2010). Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159.\n\nFergus, R., Weiss, Y., and Torralba, A. (2009). Semi-supervised learning in gigantic image collections. In\n\nAdvances in Neural Information Processing Systems (NIPS).\n\nJoachims, T. (1999). Transductive inference for text classi\ufb01cation using support vector machines. In Proceeding\n\nof the International Conference on Machine Learning (ICML), volume 99, pages 200\u2013209.\n\nKemp, C., Grif\ufb01ths, T. L., Stromsten, S., and Tenenbaum, J. B. (2003). Semi-supervised learning with trees. In\n\nAdvances in Neural Information Processing Systems (NIPS).\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the International\n\nConference on Learning Representations (ICLR).\n\nLi, P., Ying, Y., and Campbell, C. (2009). A variational approach to semi-supervised clustering. In Proceedings\n\nof the European Symposium on Arti\ufb01cial Neural Networks (ESANN), pages 11 \u2013 16.\n\nLiang, P. (2005). Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Tech-\n\nnology.\n\nLiu, Y. and Kirchhoff, K. (2013). Graph-based semi-supervised learning for phone and segment classi\ufb01cation.\n\nIn Proceedings of Interspeech.\n\nNetzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images\nwith unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning.\nPal, C., Sutton, C., and McCallum, A. (2005). Fast inference and learning with sparse belief propagation. In\n\nAdvances in Neural Information Processing Systems (NIPS).\n\nPitelis, N., Russell, C., and Agapito, L. (2014). Semi-supervised learning using an unsupervised atlas.\n\nIn\nProceddings of the European Conference on Machine Learning (ECML), volume LNCS 8725, pages 565 \u2013\n580.\n\nRanzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with\ndeep networks. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages\n792\u2013799.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference\nin deep generative models. In Proceedings of the International Conference on Machine Learning (ICML),\nvolume 32 of JMLR W&CP.\n\nRifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011). The manifold tangent classi\ufb01er.\n\nAdvances in Neural Information Processing Systems (NIPS), pages 2294\u20132302.\n\nIn\n\nRosenberg, C., Hebert, M., and Schneiderman, H. (2005). Semi-supervised self-training of object de-\nIn Proceedings of the Seventh IEEE Workshops on Application of Computer Vision\n\ntection models.\n(WACV/MOTION\u201905).\n\nShi, M. and Zhang, B. (2011). Semi-supervised learning improves gene expression-based prediction of cancer\n\nrecurrence. Bioinformatics, 27(21):3017\u20133023.\n\nStuhlm\u00a8uller, A., Taylor, J., and Goodman, N. (2013). Learning stochastic inverses.\n\ninformation processing systems, pages 3048\u20133056.\n\nIn Advances in neural\n\nTang, Y. and Salakhutdinov, R. (2013). Learning stochastic feedforward neural networks.\n\nNeural Information Processing Systems (NIPS), pages 530\u2013538.\n\nIn Advances in\n\nWang, Y., Haffari, G., Wang, S., and Mori, G. (2009). A rate distortion approach for semi-supervised condi-\n\ntional random \ufb01elds. In Advances in Neural Information Processing Systems (NIPS), pages 2008\u20132016.\n\nWeston, J., Ratle, F., Mobahi, H., and Collobert, R. (2012). Deep learning via semi-supervised embedding. In\n\nNeural Networks: Tricks of the Trade, pages 639\u2013655. Springer.\n\nZhu, X. (2006). Semi-supervised learning literature survey. Technical report, Computer Science, University of\n\nWisconsin-Madison.\n\nZhu, X., Ghahramani, Z., Lafferty, J., et al. (2003). Semi-supervised learning using Gaussian \ufb01elds and har-\nmonic functions. In Proceddings of the International Conference on Machine Learning (ICML), volume 3,\npages 912\u2013919.\n\n9\n\n\f", "award": [], "sourceid": 1885, "authors": [{"given_name": "Durk", "family_name": "Kingma", "institution": "University of Amsterdam"}, {"given_name": "Shakir", "family_name": "Mohamed", "institution": "Google DeepMind"}, {"given_name": "Danilo", "family_name": "Jimenez Rezende", "institution": "Google Deepmind"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam"}]}