{"title": "Learning with Pseudo-Ensembles", "book": "Advances in Neural Information Processing Systems", "page_first": 3365, "page_last": 3373, "abstract": "We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We examine the relationship of pseudo-ensembles, which involve perturbation in model-space, to standard ensemble methods and existing notions of robustness, which focus on perturbation in observation-space. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.", "full_text": "Learning with Pseudo-Ensembles\n\nPhilip Bachman\nMcGill University\n\nMontreal, QC, Canada\n\nphil.bachman@gmail.com\n\nOuais Alsharif\nMcGill University\n\nMontreal, QC, Canada\n\nouais.alsharif@gmail.com\n\nDoina Precup\n\nMcGill University\n\nMontreal, QC, Canada\n\ndprecup@cs.mcgill.ca\n\nAbstract\n\nWe formalize the notion of a pseudo-ensemble, a (possibly in\ufb01nite) collection\nof child models spawned from a parent model by perturbing it according to some\nnoise process. E.g., dropout [9] in a deep neural network trains a pseudo-ensemble\nof child subnetworks generated by randomly masking nodes in the parent network.\nWe examine the relationship of pseudo-ensembles, which involve perturbation in\nmodel-space, to standard ensemble methods and existing notions of robustness,\nwhich focus on perturbation in observation-space. We present a novel regular-\nizer based on making the behavior of a pseudo-ensemble robust with respect to\nthe noise process generating it.\nIn the fully-supervised setting, our regularizer\nmatches the performance of dropout. But, unlike dropout, our regularizer nat-\nurally extends to the semi-supervised setting, where it produces state-of-the-art\nresults. We provide a case study in which we transform the Recursive Neural\nTensor Network of [19] into a pseudo-ensemble, which signi\ufb01cantly improves its\nperformance on a real-world sentiment analysis benchmark.\n\n1\n\nIntroduction\n\nEnsembles of models have long been used as a way to obtain robust performance in the presence\nof noise. Ensembles typically work by training several classi\ufb01ers on perturbed input distributions,\ne.g. bagging randomly elides parts of the distribution for each trained model and boosting re-weights\nthe distribution before training and adding each model to the ensemble. In the last few years, dropout\nmethods have achieved great empirical success in training deep models, by leveraging a noise pro-\ncess that perturbs the model structure itself. However, there has not yet been much analysis relating\nthis approach to classic ensemble methods or other approaches to learning robust models.\nIn this paper, we formalize the notion of a pseudo-ensemble, which is a collection of child models\nspawned from a parent model by perturbing it with some noise process. Sec. 2 de\ufb01nes pseudo-\nensembles, after which Sec. 3 discusses the relationships between pseudo-ensembles and standard\nensemble methods, as well as existing notions of robustness. Once the pseudo-ensemble framework\nis de\ufb01ned, it can be leveraged to create new algorithms. In Sec. 4, we develop a novel regularizer\nthat minimizes variation in the output of a model when it is subject to noise on its inputs and its\ninternal state (or structure). We also discuss the relationship of this regularizer to standard dropout\nmethods. In Sec. 5 we show that our regularizer can reproduce the performance of dropout in a fully-\nsupervised setting, while also naturally extending to the semi-supervised setting, where it produces\nstate-of-the-art performance on some real-world datasets. Sec. 6 presents a case study in which we\nextend the Recursive Neural Tensor Network from [19] by converting it into a pseudo-ensemble. We\n\n1\n\n\fgenerate the pseudo-ensemble using a noise process based on Gaussian parameter fuzzing and latent\nsubspace sampling, and empirically show that both types of perturbation contribute to signi\ufb01cant\nperformance improvements beyond that of the original model. We conclude in Sec. 7.\n\n2 What is a pseudo-ensemble?\n\nConsider a data distribution pxy which we want to approximate using a parametric parent model\nf\u03b8. A pseudo-ensemble is a collection of \u03be-perturbed child models f\u03b8(x; \u03be), where \u03be comes from\na noise process p\u03be. Dropout [9] provides the clearest existing example of a pseudo-ensemble.\nDropout samples subnetworks from a source network by randomly masking the activity of subsets\nof its input/hidden layer nodes. The parameters shared by the subnetworks, through their common\nsource network, are learned to minimize the expected loss of the individual subnetworks. In pseudo-\nensemble terms, the source network is the parent model, each sampled subnetwork is a child model,\nand the noise process consists of sampling a node mask and using it to extract a subnetwork.\nThe noise process used to generate a pseudo-ensemble can take fairly arbitrary forms. The only\nrequirement is that sampling a noise realization \u03be, and then imposing it on the parent model f\u03b8, be\ncomputationally tractable. This generality allows deriving a variety of pseudo-ensemble methods\nfrom existing models. For example, for a Gaussian Mixture Model, one could perturb the means of\nthe mixture components with, e.g., Gaussian noise and their covariances with, e.g., Wishart noise.\nThe goal of learning with pseudo-ensembles is to produce models robust to perturbation. To formal-\nize this, the general pseudo-ensemble objective for supervised learning can be written as follows1:\n(1)\nwhere (x, y) \u223c pxy is an (observation, label) pair drawn from the data distribution, \u03be \u223c p\u03be is a noise\nrealization, f\u03b8(x; \u03be) represents the output of a child model spawned from the parent model f\u03b8 via\n\u03be-perturbation, y is the true label for x, and L(\u02c6y, y) is the loss for predicting \u02c6y instead of y.\nThe generality of the pseudo-ensemble approach comes from broad freedom in describing the noise\nprocess p\u03be and the mechanism by which \u03be perturbs the parent model f\u03b8. Many useful methods\ncould be developed by exploring novel noise processes for generating perturbations beyond the\nindependent masking noise that has been considered for neural networks and the feature noise that\nhas been considered in the context of linear models. For example, [17] develops a method for\nlearning \u201cordered representations\u201d by applying dropout/masking noise in a deep autoencoder while\nenforcing a particular \u201cnested\u201d structure among the random masking variables in \u03be, and [2] relies\nheavily on random perturbations when training Generative Stochastic Networks.\n\nL(f\u03b8(x; \u03be), y),\n\n(x,y)\u223cpxy\n\nminimize\n\n\u03b8\n\nE\n\nE\n\u03be\u223cp\u03be\n\n3 Related work\n\nPseudo-ensembles are closely related to traditional ensemble methods as well as to methods for\nlearning models robust to input uncertainty. By optimizing the expected loss of individual ensemble\nmembers\u2019 outputs, rather than the expected loss of the joint ensemble output, pseudo-ensembles\ndiffer from boosting, which iteratively augments an ensemble to minimize the loss of the joint out-\nput [8]. Meanwhile, the child models in a pseudo-ensemble share parameters and structure through\ntheir parent model, which will tend to correlate their behavior. This distinguishes pseudo-ensembles\nfrom traditional \u201cindependent member\u201d ensemble methods, like bagging and random forests, which\ntypically prefer diversity in the behavior of their members, as this provides bias and variance reduc-\ntion when the outputs of their members are averaged [8]. In fact, the regularizers we introduce in\nSec. 4 explicitly minimize diversity in the behavior of their pseudo-ensemble members.\nThe de\ufb01nition and use of pseudo-ensembles are strongly motivated by the intuition that models\ntrained to be robust to noise should generalize better than models that are (overly) sensitive to small\nperturbations. Previous work on robust learning has overwhelmingly concentrated on perturbations\naffecting the inputs to a model. For example, the optimization community has produced a large body\nof theoretical and empirical work addressing \u201cstochastic programming\u201d [18] and \u201crobust optimiza-\ntion\u201d [4]. Stochastic programming seeks to produce a solution to a, e.g., linear program that performs\n\n1It is easy to formulate analogous objectives for unsupervised learning, maximum likelihood, etc.\n\n2\n\n\fwell on average, with respect to a known distribution over perturbations of parameters in the prob-\nlem de\ufb01nition2. Robust optimization generally seeks to produce a solution to a, e.g., linear program\nwith optimal worst case performance over a given set of possible perturbations of parameters in the\nproblem de\ufb01nition. Several well-known machine learning methods have been shown equivalent to\ncertain robust optimization problems. For example, [24] shows that using Lasso (i.e. (cid:96)1 regulariza-\ntion) in a linear regression model is equivalent to a robust optimization problem. [25] shows that\nlearning a standard SVM (i.e. hinge loss with (cid:96)2 regularization in the corresponding RKHS) is also\nequivalent to a robust optimization problem. Supporting the notion that noise-robustness improves\ngeneralization, [25] prove many of the statistical guarantees that make SVMs so appealing directly\nfrom properties of their robust optimization equivalents, rather than using more complicated proofs\ninvolving, e.g., VC-dimension.\n\nMore closely related to pseudo-ensembles are recent\nworks that consider approaches to learning linear mod-\nels with inputs perturbed by different sorts of noise. [5]\nshows how to ef\ufb01ciently learn a linear model that (glob-\nally) optimizes expected performance w.r.t. certain types\nof noise (e.g. Gaussian, zero-masking, Poisson) on its in-\nputs, by marginalizing over the noise. Particularly rele-\nvant to our work is [21], which studies dropout (applied\nto linear models) closely, and shows how its effects are\nwell-approximated by a Tikhonov (i.e. quadratic/ridge)\nregularization term that can be estimated from both la-\nbeled and unlabeled data. The authors of [21] leveraged\nthis label-agnosticism to achieve state-of-the-art perfor-\nmance on several sentiment analysis tasks.\nWhile all the work described above considers noise on\nthe input-space, pseudo-ensembles involve noise in the\nmodel-space. This can actually be seen as a superset of\ninput-space noise, as a model can always be extended with an initial \u201cidentity layer\u201d that copies\nthe noise-free input. Noise on the input-space can then be reproduced by noise on the initial layer,\nwhich is now part of the model-space.\n\nFigure 1: How to compute partial noisy\n\u03b8: (1) compute \u03be-perturbed output\noutput f i\n\u02dcf i\u22121\n\u03b8 from\n\u02dcf i\u22121\n\u03b8, (4) repeat\n\u03b8\nup through the layers > i.\n\nof layers < i, (2) compute f i\n, (3) \u03be-perturb f i\n\n\u03b8 to get \u02dcf i\n\n\u03b8\n\n4 The Pseudo-Ensemble Agreement regularizer\n\n\u03b8 (x; \u03be), ..., f d\n\n\u03b8 (x; \u03be)}, where f i\n\nWe now present Pseudo-Ensemble Agreement (PEA) regularization, which can be used in a fairly\ngeneral class of computation graphs. For concreteness, we present it in the case of deep, layered\nneural networks. PEA regularization operates by controlling distributional properties of the random\nvectors {f 2\n\u03b8(x; \u03be) gives the activities of the ith layer of f\u03b8 in response\nto x when layers < i are perturbed by \u03be while layer i is left unperturbed. Fig. 1 illustrates the\n\u03b8 (x)\nconstruction of these random vectors. We will assume that layer d is the output layer, i.e.f d\n\u03b8 (x; \u03be) = f\u03b8(x; \u03be) gives the\ngives the output of the unperturbed parent model in response to x and f d\nresponse of the child model generated by \u03be-perturbing f\u03b8.\nGiven the random vectors f i\n\n\u03b8(x; \u03be), PEA regularization is de\ufb01ned as follows:\n\n(cid:35)\n\n(cid:34) d(cid:88)\n\ni=2\n\nR(f\u03b8, px, p\u03be) = E\nx\u223cpx\n\nE\n\u03be\u223cp\u03be\n\n\u03bbiVi(f i\n\n\u03b8(x), f i\n\n\u03b8(x; \u03be))\n\n,\n\n(2)\n\nwhere f\u03b8 is the parent model to regularize, x \u223c px is an unlabeled observation, Vi(\u00b7,\u00b7) is the\n\u201cvariance\u201d penalty imposed on the distribution of activities in the ith layer of the pseudo-ensemble\nspawned from f\u03b8, and \u03bbi controls the relative importance of Vi. Note that for Eq. 2 to act on\n\u03b8(x; \u03be). This approximation holds\nthe \u201cvariance\u201d of the f i\nreasonably well for many useful neural network architectures [1, 22]. In our experiments we actually\ncompute the penalties Vi between independently-sampled pairs of child models. We consider several\ndifferent measures of variance to penalize, which we will introduce as needed.\n\n\u03b8(x; \u03be), we should have f i\n\n\u03b8(x) \u2248 E\u03be f i\n\n2Note that \u201cparameters\u201d in a linear program are analogous to inputs in standard machine learning terminol-\n\nogy, as they are observed quantities (rather than quantities optimized over).\n\n3\n\nLayer i-1Layer iLayer i+1(1)(2)(3)(4)\f4.1 The effect of PEA regularization on feature co-adaptation\n\n(cid:35)\n\n(cid:34) d(cid:88)\n\nOne of the original motivations for dropout was that it helps prevent \u201cfeature co-adaptation\u201d [9].\nThat is, dropout encourages individual features (i.e. hidden node activities) to remain helpful, or at\nleast not become harmful, when other features are removed from their local context. We provide\nsome support for that claim by examining the following optimization objective 3:\n\nE\n\n\u03bbiVi(f i\n\n[L(f\u03b8(x), y)] + E\nx\u223cpx\n\nE\n\u03be\u223cp\u03be\n\n\u03b8\n\n,\n\ni=2\n\n\u03b8(x; \u03be))\n\n\u03b8(x), f i\n\nminimize\n\n(x,y)\u223cpxy\n\n(3)\nin which the supervised loss L depends only on the parent model f\u03b8 and the pseudo-ensemble\nonly appears in the PEA regularization term. For simplicity, let \u03bbi = 0 for i < d, \u03bbd = 1,\nand Vd(v1, v2) = DKL(softmax(v1)|| softmax(v2)), where softmax is the standard softmax and\nDKL(p1||p2) is the KL-divergence between p1 and p2 (we indicate this penalty by V k). We use\nxent(softmax(f\u03b8(x)), y) for the loss L(f\u03b8(x), y), where xent(\u02c6y, y) is the cross-entropy between\nthe predicted distribution \u02c6y and the true distribution y. Eq. 3 never explicitly passes label informa-\ntion through a \u03be-perturbed network, so \u03be only acts through its effects on the distribution of the parent\nmodel\u2019s predictions when subjected to \u03be-perturbation. In this case, (3) trades off accuracy against\nfeature co-adaptation, as measured by the degree to which the feature activity distribution at layer i\nis affected by perturbation of the feature activity distributions for layers < i.\nWe test this regularizer empirically in Sec. 5.1. The observed ability of this regularizer to reproduce\nthe performance bene\ufb01ts of standard dropout supports the notion that discouraging \u201cco-adaptation\u201d\nplays an important role in dropout\u2019s empirical success. Also, by acting strictly to make the output of\nthe parent model more robust to \u03be-perturbation, the performance of this regularizer rebuts the claim\nin [22] that noise-robustness plays only a minor role in the success of standard dropout.\n\n4.2 Relating PEA regularization to standard dropout\nThe authors of [21] show that, assuming a noise process \u03be such that E\u03be[f(x; \u03be)] = f(x), logistic\nregression under the in\ufb02uence of dropout optimizes the following objective:\n\n[(cid:96)(f\u03b8(xi; \u03be), yi)] =\n\n(cid:96)(f\u03b8(xi), yi)) + R(f\u03b8),\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nE\n\u03be\n\nR(f\u03b8) \u2261 n(cid:88)\n\ni=1\n\n[A(f\u03b8(xi; \u03be)) \u2212 A(f\u03b8(xi))] ,\n\nE\n\u03be\n\nwhere f\u03b8(xi) = \u03b8xi, (cid:96)(f\u03b8(xi), yi) is the logistic regression loss, and the regularization term is:\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\nwhere A(\u00b7) indicates the log partition function for logistic regression.\nUsing only a KL-d penalty at the output layer, PEA-regularized logistic regression minimizes:\n\n(cid:96)(f\u03b8(xi), yi) + E\n\n\u03be\n\n[DKL (softmax(f\u03b8(xi))|| softmax(f\u03b8(xi; \u03be)))] .\n\nDe\ufb01ning distribution p\u03b8(x) as softmax(f\u03b8(x)), we can re-write the PEA part of Eq. 6 to get:\n\n[DKL (p\u03b8(x)|| p\u03b8(x; \u03be))] = E\n\n(cid:35)\n\n\u03b8(x)\n\u03b8(x) log pc\n(cid:35)\npc\n\u03b8(x; \u03be)\npc\n\n(cid:34)(cid:88)\n\u03b8 (x)(cid:80)\n\u03b8 (x; \u03be)(cid:80)\n\nc\u2208C\n\n\u03be\n\n\u03b8(x) log\npc\n\n\u03b8 (x; \u03be)\n\u03b8 (x)\n\nexp f c\nexp f c\n\u03b8 (x) \u2212 f c\n\nc(cid:48)\u2208C exp f c(cid:48)\nc(cid:48)\u2208C exp f c(cid:48)\n\u03b8(x)(A(f\u03b8(x; \u03be)) \u2212 A(f\u03b8(x)))]\n(cid:35)\n\u03b8(x)(A(f\u03b8(x; \u03be)) \u2212 A(f\u03b8(x)))\npc\n\n\u03b8 (x; \u03be)) + pc\n\n\u03b8(x)(f c\n\n= E\n\n[A(f\u03b8(x; \u03be)) \u2212 A(f\u03b8(x))]\n\n[pc\n\n\u03be\n\nn(cid:88)\n\ni=1\n\n(cid:34)\n\nE\n\u03be\n\n= (cid:88)\n= (cid:88)\n(cid:34)(cid:88)\n\nE\n\u03be\nE\n\u03be\n\nc\u2208C\n\nc\u2208C\n\n= E\n\n\u03be\n\nc\u2208C\n\nwhich brings us to the regularizer in Eq. 5.\n\n3While dropout is well-supported empirically, its mode-of-action is not well-understood outside the limited\n\ncontext of linear models.\n\n4\n\n\f4.3 PEA regularization for semi-supervised learning\nPEA regularization works as-is in a semi-supervised setting, as the penalties Vi do not require label\ninformation. We train networks for semi-supervised learning in two ways, both of which apply the\nobjective in Eq. 1 on labeled examples and PEA regularization on the unlabeled examples. The \ufb01rst\nway applies a tanh-variance penalty V t and the second way applies a xent-variance penalty V x,\nwhich we de\ufb01ne as follows:\n\nV t(\u00afy, \u02dcy) = || tanh(\u00afy) \u2212 tanh(\u02dcy)||2\n\n(11)\nwhere \u00afy and \u02dcy represent the outputs of a pair of independently sampled child models, and tanh\noperates element-wise. The xent-variance penalty can be further expanded as:\n\n2, V x(\u00afy, \u02dcy) = xent(softmax(\u00afy), softmax(\u02dcy)),\n\nV x(\u00afy, \u02dcy) = DKL(softmax(\u00afy)|| softmax(\u02dcy)) + ent(softmax(\u00afy)),\n\n(12)\nwhere ent(\u00b7) denotes the entropy. Thus, V x combines the KL-divergence penalty with an entropy\npenalty, which has been shown to perform well in a semi-supervised setting [7, 14]. Recall that at\nnon-output layers we regularize with the \u201cdirection\u201d penalty V c. Before the masking noise, we also\napply zero-mean Gaussian noise to the input and to the biases of all nodes. In the experiments, we\nchose between the two output-layer penalties V t/V x based on observed performance.\n\n5 Testing PEA regularization\n\nsupervised learning on MNIST dig-\nWe tested PEA regularization in three scenarios:\nlearning on\nits, semi-supervised learning on MNIST digits, and semi-supervised transfer\nfrom the NIPS 2011 Workshop on Challenges in Learning Hierarchical Mod-\na dataset\nels\nand\n[13].\nscripts/instructions for reproducing all of the results in this section are available online at:\nhttp://github.com/Philip-Bachman/Pseudo-Ensembles.\n\nimplementations of our methods, written with THEANO [3],\n\nFull\n\n5.1 Fully-supervised MNIST\n\nThe MNIST dataset comprises 60k 28x28 grayscale hand-written digit images for training and 10k\nimages for testing. For the supervised tests we used SGD hyperparameters roughly following those\nin [9]. We trained networks with two hidden layers of 800 nodes each, using recti\ufb01ed-linear ac-\ntivations and an (cid:96)2-norm constraint of 3.5 on incoming weights for each node. For both standard\ndropout (SDE) and PEA, we used softmax \u2192 xent loss at the output layer. We initialized hidden\nlayer biases to 0.1, output layer biases to 0, and inter-layer weights to zero-mean Gaussian noise\nwith \u03c3 = 0.01. We trained all networks for 1000 epochs with no early-stopping (i.e. performance\nwas measured for the \ufb01nal network state).\nSDE obtained 1.05% error averaged over \ufb01ve random initializations. Using PEA penalty V k at the\noutput layer and computing classi\ufb01cation loss/gradient only for the unperturbed parent network, we\nobtained 1.08% averaged error. The \u03be-perturbation involved node masking but not bias noise. Thus,\ntraining the same network as used for dropout while ignoring the effects of masking noise on the\nclassi\ufb01cation loss, but encouraging the network to be robust to masking noise (as measured by V k),\nmatched the performance of dropout. This result supports the equivalence between dropout and this\nparticular form of PEA regularization, which we derived in Section 4.2.\n\n5.2 Semi-supervised MNIST\n\nWe tested semi-supervised learning on MNIST following the protocol described in [23]. These tests\nsplit MNIST\u2019s 60k training samples into labeled/unlabeled subsets, with the labeled sets containing\nnl \u2208 {100, 600, 1000, 3000} samples. For labeled sets of size 600, 1000, and 3000, the full training\ndata was randomly split 10 times into labeled/unlabeled sets and results were averaged over the\nsplits. For labeled sets of size 100, we averaged over 50 random splits. The labeled sets had the\nsame number of examples for each class. We tested PEA regularization with and without denoising\nautoencoder pre-training [20]4. Pre-trained networks were always PEA-regularized with penalty V x\n\n4See our code for a perfectly complete description of our pre-training.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 2: Performance of PEA regularization for semi-supervised learning using the MNIST dataset. The top\nrow of \ufb01lter blocks in (a) were the result of training a \ufb01xed network architecture on 600 labeled samples using:\nweight norm constraints only (RAW), standard dropout (SDE), standard dropout with PEA regularization on\nunlabeled data (PEA), and PEA preceded by pre-training as a denoising autoencoder [20] (PEA+PT). The\nbottom \ufb01lter block in (a) was the result of training with PEA on 100 labeled samples. (b) shows test error over\nthe course of training for RAW/SDE/PEA, averaged over 10 random training sets of size 600/1000.\non the output layer and V c on the hidden layers. Non-pre-trained networks used V t on the output\nlayer, except when the labeled set was of size 100, for which V x was used. In the latter case, we\ngradually increased the \u03bbi over the course of training, as suggested by [7]. We generated the pseudo-\nensembles for these tests using masking noise and Gaussian input+bias noise with \u03c3 = 0.1. Each\nnetwork had two hidden layers with 800 nodes. Weight norm constraints and SGD hyperparameters\nwere set as for supervised learning.\nTable 1 compares the performance of PEA regularization with previous results. Aside from CNN, all\nmethods in the table are \u201cgeneral\u201d, i.e. do not use convolutions or other image-speci\ufb01c techniques to\nimprove performance. The main comparisons of interest are between PEA(+) and other methods for\nsemi-supervised learning with neural networks, i.e. E-NN, MTC+, and PL+. E-NN (EmbedNN from\n[23]) uses a nearest-neighbors-based graph Laplacian regularizer to make predictions \u201csmooth\u201d with\nrespect to the manifold underlying the data distribution px. MTC+ (the Manifold Tangent Classi-\n\ufb01er from [16]) regularizes predictions to be smooth with respect to the data manifold by penalizing\ngradients in a learned approximation of the tangent space of the data manifold. PL+ (the Pseudo-\nLabel method from [14]) uses the joint-ensemble predictions on unlabeled data as \u201cpseudo-labels\u201d,\nand treats them like \u201ctrue\u201d labels. The classi\ufb01cation losses on true labels and pseudo-labels are\nbalanced by a scaling factor which is carefully modulated over the course of training. PEA regular-\nization (without pre-training) outperforms all previous methods in every setting except 100 labeled\nsamples, where PL+ performs better, but with the bene\ufb01t of pre-training. By adding pretraining\n(i.e. PEA+), we achieve a two-fold reduction in error when using only 100 labeled samples.\n\nTSVM NN\n16.81\n25.81\n11.44\n6.16\n10.70\n5.38\n3.45\n6.04\n\nCNN E-NN MTC+\n22.98\n12.03\n5.13\n7.68\n3.64\n6.45\n3.35\n2.57\n\n16.86\n5.97\n5.73\n3.59\n\nPL+\n10.49\n4.01\n3.46\n2.69\n\nSDE\n22.89\n7.59\n5.80\n3.60\n\nSDE+\n13.54\n5.68\n4.71\n3.00\n\nPEA PEA+\n5.21\n10.79\n2.44\n2.87\n2.23\n2.64\n1.91\n2.30\n\n100\n600\n1000\n3000\n\nTable 1: Performance of semi-supervised learning methods on MNIST with varying numbers of labeled sam-\nples. From left-to-right the methods are Transductive SVM , neural net, convolutional neural net, EmbedNN\n[23], Manifold Tangent Classi\ufb01er [16], Pseudo-Label [14], standard dropout plus fuzzing [9], dropout plus\nfuzzing with pre-training, PEA, and PEA with pre-training. Methods with a \u201c+\u201d used contractive or denoising\nautoencoder pre-training [20]. The testing protocol and the results left of MTC+ were presented in [23]. The\nMTC+ and PL+ results are from their respective papers and the remaining results are our own. We trained\nSDE(+) using the same network/SGD hyperparameters as for PEA. The only difference was that the former\ndid not regularize for pseudo-ensemble agreement on the unlabeled examples. We measured performance on\nthe standard 10k test samples for MNIST, and all of the 60k training samples not included in a given labeled\ntraining set were made available without labels. The best result for each training size is in bold.\n\n5.3 Transfer learning challenge (NIPS 2011)\n\nThe organizers of the NIPS 2011 Workshop on Challenges in Learning Hierarchical Models [13]\nproposed a challenge to improve performance on a target domain by using labeled and unlabeled\n\n6\n\nSDE: 600PEA: 600PEA+PT: 600RAW: 600PEA: 100\fdata from two related source domains. The labeled data source was CIFAR-100 [11], which contains\n50k 32x32 color images in 100 classes. The unlabeled data source was a collection of 100k 32x32\ncolor images taken from Tiny Images [11]. The target domain comprised 120 32x32 color images\ndivided unevenly among 10 classes. Neither the classes nor the images in the target domain appeared\nin either of the source domains. The winner of this challenge used convolutional Spike and Slab\nSparse Coding, followed by max pooling and a linear SVM on the pooled features [6]. Labels on\nthe source data were ignored and the source data was used to pre-train a large set of convolutional\nfeatures. After applying the pre-trained feature extractor to the 120 training images, this method\nachieved an accuracy of 48.6% on the target domain, the best published result on this dataset.\nWe applied semi-supervised PEA regularization by \ufb01rst using the CIFAR-100 data to train a deep\nnetwork comprising three max-pooled convolutional layers followed by a fully-connected hidden\nlayer which fed into a softmax \u2192 xent output layer. Afterwards, we removed the hidden and out-\nput layers, replaced them with a pair of fully-connected hidden layers feeding into an (cid:96)2-hinge-loss\noutput layer5, and then trained the non-convolutional part of the network on the 120 training images\nfrom the target domain. For this \ufb01nal training phase, which involved three layers, we tried standard\ndropout and dropout with PEA regularization on the source data. Standard dropout achieved 55.5%\naccuracy, which improved to 57.4% when we added PEA regularization on the source data. While\nmost of the improvement over the previous state-of-the-art (i.e. 48.6%) was due to dropout and an\nimproved training strategy (i.e. supervised pre-training vs. unsupervised pre-training), controlling\nthe feature activity and output distributions of the pseudo-ensemble on unlabeled data allowed sig-\nni\ufb01cant further improvement.\n\n6\n\nImproved sentiment analysis using pseudo-ensembles\n\nWe now show how the Recursive Neural Tensor Network (RNTN) from [19] can be adapted using\npseudo-ensembles, and evaluate it on the Stanford Sentiment Treebank (STB) task. The STB task\ninvolves predicting the sentiment of short phrases extracted from movie reviews on RottenToma-\ntoes.com. Ground-truth labels for the phrases, and the \u201csub-phrases\u201d produced by processing them\nwith a standard parser, were generated using Amazon Mechanical Turk.\nIn addition to pseudo-\nensembles, we used a more \u201ccompact\u201d bilinear form in the function f : Rn \u00d7 Rn \u2192 Rn that the\nRNTN applies recursively as shown in Figure 3. The computation for the ith dimension of the\noriginal f (for vi \u2208 Rn\u00d71) is:\n\nfi(v1, v2) = tanh([v1; v2](cid:62)Ti[v1; v2] + Mi[v1; v2; 1]), whereas we use:\n\nfi(v1, v2) = tanh(v(cid:62)\n\n1 Tiv2 + Mi[v1; v2; 1]),\n\nin which Ti indicates a matrix slice of tensor T and Mi indicates a vector row of matrix M. In the\noriginal RNTN, T is 2n \u00d7 2n \u00d7 n and in ours it is n \u00d7 n \u00d7 n. The other parameters in the RNTNs\nare a transform matrix M \u2208 Rn\u00d72n+1 and a classi\ufb01cation matrix C \u2208 Rc\u00d7n+1; each RNTN outputs\nc class probabilities for vector v using softmax(C[v; 1]). A \u201c;\u201d indicates vertical vector stacking.\nWe initialized the model with pre-trained word vectors. The pre-training used word2vec on the\ntraining and dev set, with three modi\ufb01cations: dropout/fuzzing was applied during pre-training (to\nmatch the conditions in the full model), the vector norms were constrained so the pre-trained vectors\nhad standard deviation 0.5, and tanh was applied during word2vec (again, to match conditions in\nthe full model). All code required for these experiments is publicly available online.\nWe generated pseudo-ensembles from a parent RNTN using two types of perturbation: subspace\nsampling and weight fuzzing. We performed subspace sampling by keeping only n\n2 randomly sam-\npled latent dimensions out of the n in the parent model when processing a given phrase tree. Us-\ning the same sampled dimensions for a full phrase tree reduced computation time signi\ufb01cantly, as\nthe parameter matrices/tensor could be \u201csliced\u201d to include only the relevant dimensions6. During\n5We found that (cid:96)2-hinge-loss performed better than softmax \u2192 xent in this setting. Switching to\n\nsoftmax \u2192 xent degrades the dropout and PEA results but does not change their ranking.\n\n6This allowed us to train signi\ufb01cantly larger models before over-\ufb01tting offset increased model capacity.\nBut, training these larger models would have been tedious without the parameter slicing permitted by subspace\nsampling, as feedforward for the RNTN is O(n3).\n\n7\n\n\ftraining we sampled a new subspace each time a phrase tree was processed and computed test-\ntime outputs for each phrase tree by averaging over 50 randomly sampled subspaces. We per-\nformed weight fuzzing during training by perturbing parameters with zero-mean Gaussian noise\nbefore processing each phrase tree and then applying gradients w.r.t. the perturbed parameters to\nthe unperturbed parameters. We did not fuzz during testing. Weight fuzzing has an interesting\ninterpretation as an implicit convolution of the objective function (de\ufb01ned w.r.t. the model param-\neters) with an isotropic Gaussian distribution. In the case of recursive/recurrent neural networks\nthis may prove quite useful, as convolving the objective with a Gaussian reduces its curvature,\nthereby mitigating some problems stemming from ill-conditioned Hessians [15]. For further de-\nscription of the model and training/testing process, see the supplementary material and the code\nfrom http://github.com/Philip-Bachman/Pseudo-Ensembles.\n\nFine-grained\nBinary\n\nRNTN PV DCNN CTN CTN+F CTN+S CTN+F+S\n45.7\n85.4\n\n48.4\n88.9\n\n46.1\n85.3\n\n47.5\n87.8\n\n48.7\n87.8\n\n48.5\n86.8\n\n43.1\n83.4\n\nTable 2: Fine-grained and binary root-level prediction performance for the Stanford Sentiment Treebank task.\nRNTN is the original \u201cfull\u201d model presented in [19]. CTN is our \u201ccompact\u201d tensor network model. +F/S\nindicates augmenting our base model with weight fuzzing/subspace sampling. PV is the Paragraph Vector\nmodel in [12] and DCNN is the Dynamic Convolutional Neural Network model in [10].\n\nFollowing the protocol suggested by [19], we measured\nroot-level (i.e. whole-phrase) prediction accuracy on two\ntasks: \ufb01ne-grained sentiment prediction and binary senti-\nment prediction. The \ufb01ne-grained task involves predict-\ning classes from 1-5, with 1 indicating strongly negative\nsentiment and 5 indicating strongly positive sentiment.\nThe binary task is similar, but ignores \u201cneutral\u201d phrases\n(those in class 3) and considers only whether a phrase is\ngenerally negative (classes 1/2) or positive (classes 4/5).\nTable 2 shows the performance of our compact RNTN in\nfour forms that include none, one, or both of subspace\nsampling and weight fuzzing. Using only (cid:96)2 regulariza-\ntion on its parameters, our compact RNTN approached\nthe performance of the full RNTN, roughly matching the\nperformance of the second best method tested in [19].\nAdding weight fuzzing improved performance past that\nof the full RNTN. Adding subspace sampling improved\nperformance further and adding both noise types pushed\nour RNTN well past the full RNTN, resulting in state-of-\nthe-art performance on the binary task.\n\nFigure 3: How to feedforward through the\nRecursive Neural Tensor Network. First,\nthe tree structure is generated by parsing the\ninput sentence. Then, the vector for each\nnode is computed by look-up at the leaves\n(i.e. words/tokens) and by a tensor-based\ntransform of the node\u2019s children\u2019s vectors\notherwise.\n\n7 Discussion\n\nWe proposed the notion of a pseudo-ensemble, which captures methods such as dropout [9] and\nfeature noising in linear models [5, 21] that have recently drawn signi\ufb01cant attention. Using the\nconceptual framework provided by pseudo-ensembles, we developed and applied a regularizer that\nperforms well empirically and provides insight into the mechanisms behind dropout\u2019s success. We\nalso showed how pseudo-ensembles can be used to improve the performance of an already powerful\nmodel on a competitive real-world sentiment analysis benchmark. We anticipate that this idea,\nwhich uni\ufb01es several rapidly evolving lines of research, can be used to develop several other novel\nand successful algorithms, especially for semi-supervised learning.\n\nReferences\n[1] P. Baldi and P. Sadowski. Understanding dropout. In NIPS, 2013.\n[2] Y. Bengio, \u00b4E. Thibodeau-Laufer, G. Alain, and J. Yosinski. Deep generative stochastic net-\n\nworks trainable by backprop. arXiv:1306.1091v5 [cs.LG], 2014.\n\n8\n\nr1p1w2w3w1p1 = f(w2, w3)r1 = f(w1, p1)perhapsthebesttable look-up\f[3] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,\nD. Warde-Farley, and Y. Bengio. Theano: A cpu and gpu math expression compiler. In Python\nfor Scienti\ufb01c Computing Conference (SciPy), 2010.\n\n[4] D. Bertsimas, D. B. Brown, and C. Caramanis. Theory and applications of robust optimization.\n\nSIAM Review, 53(3), 2011.\n\n[5] L. Van der Maaten, M. Chen, S. Tyree, and K. Q. Weinberger. Learning with marginalized\n\ncorrupted features. In ICML, 2013.\n\n[6] I. J. Goodfellow, A. Courville, and Y. Bengio. Large-scale feature learning with spike-and-slab\n\nsparse coding. In ICML, 2012.\n\n[7] Y. Grandvalet and Y. Bengio. Semi-Supervised Learning, chapter Entropy Regularization. MIT\n\nPress, 2006.\n\n[8] T. Hastie, J. Friedman, and R. Tibshirani. Elements of Statistical Learning II. 2008.\n[9] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving\nneural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580v1 [cs.NE],\n2012.\n\n[10] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for mod-\n\nelling sentences. In ACL, 2014.\n\n[11] A. Krizhevsky. Learning multiple layers of features from tiny images. Master\u2019s thesis, Univer-\n\nsity of Toronto, 2009.\n\n[12] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML,\n\n2014.\n\n[13] Q. V. Le, M. A. Ranzato, R. R. Salakhutdinov, A. Y. Ng, and J. Tenenbaum. Workshop on\nchallenges in learning hierarchical models: Transfer learning and optimization. In NIPS, 2011.\n[14] D.-H. Lee. Pseudo-label: The simple and ef\ufb01cient semi-supervised learning method for deep\n\nneural networks. In ICML, 2013.\n\n[15] R. Pacanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culties of training recurrent neural networks.\n\nIn ICML, 2013.\n\n[16] S. Rifai, Y. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classi\ufb01er. In\n\nNIPS, 2011.\n\n[17] O. Rippel, M. A. Gelbart, and R. P. Adams. Learning ordered representations with nested\n\ndropout. In ICML, 2014.\n\n[18] A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on Stochastic Programming: Model-\n\ning and Theory. Society for Industrial and Applied Mathematics (SIAM), 2009.\n\n[19] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive\n\ndeep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.\n\n[20] P. Vincent, H. Larochelle, and Y. Bengio. Extracting and composing robust features with\n\ndenoising autoencoders. In ICML, 2008.\n\n[21] S. Wager, S. Wang, and P. Liang. Dropout training as adaptive regularization. In NIPS, 2013.\n[22] D. Warde-Farley, I. J. Goodfellow, A. Courville, and Y. Bengio. An empirical analysis of\n\ndropout in piecewise linear networks. In ICLR, 2014.\n\n[23] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In ICML,\n\n2008.\n\n[24] H. Xu, C. Caramanis, and S. Mannor. Robust regression and lasso. In NIPS, 2009.\n[25] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector ma-\n\nchines. JMLR, 10, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1716, "authors": [{"given_name": "Philip", "family_name": "Bachman", "institution": null}, {"given_name": "Ouais", "family_name": "Alsharif", "institution": "McGill University, Google Inc"}, {"given_name": "Doina", "family_name": "Precup", "institution": "McGill University"}]}