{"title": "MixMatch: A Holistic Approach to Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5049, "page_last": 5059, "abstract": "Semi-supervised learning has proven to be a powerful paradigm for leveraging unlabeled data to mitigate the reliance on large labeled datasets.\nIn this work, we unify the current dominant approaches for semi-supervised learning to produce a new algorithm, MixMatch, that\nguesses low-entropy labels for data-augmented unlabeled examples and mixes labeled and unlabeled data using MixUp.\nMixMatch obtains state-of-the-art results by a large margin across many datasets and labeled data amounts. For example,\non CIFAR-10 with 250 labels, we reduce error rate by a factor of 4 (from 38% to 11%) and by a factor of 2 on STL-10.\nWe also demonstrate how MixMatch can help achieve a dramatically better accuracy-privacy trade-off for differential privacy.\nFinally, we perform an ablation study to tease apart which components of MixMatch are most important for its success.\nCode is attached.", "full_text": "MixMatch: A Holistic Approach to\n\nSemi-Supervised Learning\n\nDavid Berthelot\nGoogle Research\n\nNicholas Carlini\nGoogle Research\n\nIan Goodfellow\n\nWork done at Google\n\ndberth@google.com\n\nncarlini@google.com\n\nian-academic@mailfence.com\n\nAvital Oliver\n\nGoogle Research\n\nNicolas Papernot\nGoogle Research\n\nColin Raffel\n\nGoogle Research\n\navitalo@google.com\n\npapernot@google.com\n\ncraffel@google.com\n\nAbstract\n\nSemi-supervised learning has proven to be a powerful paradigm for leveraging\nunlabeled data to mitigate the reliance on large labeled datasets. In this work, we\nunify the current dominant approaches for semi-supervised learning to produce a\nnew algorithm, MixMatch, that guesses low-entropy labels for data-augmented un-\nlabeled examples and mixes labeled and unlabeled data using MixUp. MixMatch\nobtains state-of-the-art results by a large margin across many datasets and labeled\ndata amounts. For example, on CIFAR-10 with 250 labels, we reduce error rate by a\nfactor of 4 (from 38% to 11%) and by a factor of 2 on STL-10. We also demonstrate\nhow MixMatch can help achieve a dramatically better accuracy-privacy trade-off\nfor differential privacy. Finally, we perform an ablation study to tease apart which\ncomponents of MixMatch are most important for its success. We release all code\nused in our experiments.1\n\n1\n\nIntroduction\n\nMuch of the recent success in training large, deep neural networks is thanks in part to the existence\nof large labeled datasets. Yet, collecting labeled data is expensive for many learning tasks because\nit necessarily involves expert knowledge. This is perhaps best illustrated by medical tasks where\nmeasurements call for expensive machinery and labels are the fruit of a time-consuming analysis that\ndraws from multiple human experts. Furthermore, data labels may contain private information. In\ncomparison, in many tasks it is much easier or cheaper to obtain unlabeled data.\nSemi-supervised learning [6] (SSL) seeks to largely alleviate the need for labeled data by allowing\na model to leverage unlabeled data. Many recent approaches for semi-supervised learning add a\nloss term which is computed on unlabeled data and encourages the model to generalize better to\nunseen data. In much recent work, this loss term falls into one of three classes (discussed further\nin Section 2): entropy minimization [18, 28]\u2014which encourages the model to output con\ufb01dent\npredictions on unlabeled data; consistency regularization\u2014which encourages the model to produce\nthe same output distribution when its inputs are perturbed; and generic regularization\u2014which\nencourages the model to generalize well and avoid over\ufb01tting the training data.\nIn this paper, we introduce MixMatch, an SSL algorithm which introduces a single loss that gracefully\nuni\ufb01es these dominant approaches to semi-supervised learning. Unlike previous methods, MixMatch\ntargets all the properties at once which we \ufb01nd leads to the following bene\ufb01ts:\n\n1https://github.com/google-research/mixmatch\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Diagram of the label guessing process used in MixMatch. Stochastic data augmentation\nis applied to an unlabeled image K times, and each augmented image is fed through the classi\ufb01er.\nThen, the average of these K predictions is \u201csharpened\u201d by adjusting the distribution\u2019s temperature.\nSee algorithm 1 for a full description.\n\n\u2022 Experimentally, we show that MixMatch obtains state-of-the-art results on all standard\nimage benchmarks (section 4.2), and reducing the error rate on CIFAR-10 by a factor of 4;\n\n\u2022 We further show in an ablation study that MixMatch is greater than the sum of its parts;\n\u2022 We demonstrate in section 4.3 that MixMatch is useful for differentially private learning,\nenabling students in the PATE framework [36] to obtain new state-of-the-art results that\nsimultaneously strengthen both privacy guarantees and accuracy.\n\nIn short, MixMatch introduces a uni\ufb01ed loss term for unlabeled data that seamlessly reduces entropy\nwhile maintaining consistency and remaining compatible with traditional regularization techniques.\n\n2 Related Work\n\nTo set the stage for MixMatch, we \ufb01rst introduce existing methods for SSL. We focus mainly on\nthose which are currently state-of-the-art and that MixMatch builds on; there is a wide literature on\nSSL techniques that we do not discuss here (e.g., \u201ctransductive\u201d models [14, 22, 21], graph-based\nmethods [49, 4, 29], generative modeling [3, 27, 41, 9, 17, 23, 38, 34, 42], etc.). More comprehensive\noverviews are provided in [49, 6]. In the following, we will refer to a generic model pmodel(y | x; \u03b8)\nwhich produces a distribution over class labels y for an input x with parameters \u03b8.\n\n2.1 Consistency Regularization\n\nA common regularization technique in supervised learning is data augmentation, which applies input\ntransformations assumed to leave class semantics unaffected. For example, in image classi\ufb01cation,\nit is common to elastically deform or add noise to an input image, which can dramatically change\nthe pixel content of an image without altering its label [7, 43, 10]. Roughly speaking, this can\narti\ufb01cially expand the size of a training set by generating a near-in\ufb01nite stream of new, modi\ufb01ed data.\nConsistency regularization applies data augmentation to semi-supervised learning by leveraging the\nidea that a classi\ufb01er should output the same class distribution for an unlabeled example even after it\nhas been augmented. More formally, consistency regularization enforces that an unlabeled example x\nshould be classi\ufb01ed the same as Augment(x), an augmentation of itself.\nIn the simplest case, for unlabeled points x, prior work [25, 40] adds the loss term\nkpmodel(y | Augment(x); \u03b8) \u2212 pmodel(y | Augment(x); \u03b8)k2\n2.\n\n(1)\nNote that Augment(x) is a stochastic transformation, so the two terms in eq. (1) are not identical.\n\u201cMean Teacher\u201d [44] replaces one of the terms in eq. (1) with the output of the model using an\nexponential moving average of model parameter values. This provides a more stable target and was\nfound empirically to signi\ufb01cantly improve results. A drawback to these approaches is that they use\ndomain-speci\ufb01c data augmentation strategies. \u201cVirtual Adversarial Training\u201d [31] (VAT) addresses\nthis by instead computing an additive perturbation to apply to the input which maximally changes the\noutput class distribution. MixMatch utilizes a form of consistency regularization through the use of\nstandard data augmentation for images (random horizontal \ufb02ips and crops).\n\n2.2 Entropy Minimization\n\nA common underlying assumption in many semi-supervised learning methods is that the classi\ufb01er\u2019s\ndecision boundary should not pass through high-density regions of the marginal data distribution.\n\n2\n\nSharpen\u2026 K augmentations ...ClassifyClassifyUnlabeledGuessed LabelAverage\fOne way to enforce this is to require that the classi\ufb01er output low-entropy predictions on unlabeled\ndata. This is done explicitly in [18] with a loss term which minimizes the entropy of pmodel(y | x; \u03b8)\nfor unlabeled data x. This form of entropy minimization was combined with VAT in [31] to obtain\nstronger results. \u201cPseudo-Label\u201d [28] does entropy minimization implicitly by constructing hard\n(1-hot) labels from high-con\ufb01dence predictions on unlabeled data and using these as training targets\nin a standard cross-entropy loss. MixMatch also implicitly achieves entropy minimization through the\nuse of a \u201csharpening\u201d function on the target distribution for unlabeled data, described in section 3.2.\n\n2.3 Traditional Regularization\n\nRegularization refers to the general approach of imposing a constraint on a model to make it harder to\nmemorize the training data and therefore hopefully make it generalize better to unseen data [19]. We\nuse weight decay which penalizes the L2 norm of the model parameters [30, 46]. We also use MixUp\n[47] in MixMatch to encourage convex behavior \u201cbetween\u201d examples. We utilize MixUp as both\nas a regularizer (applied to labeled datapoints) and a semi-supervised learning method (applied to\nunlabeled datapoints). MixUp has been previously applied to semi-supervised learning; in particular,\nthe concurrent work of [45] uses a subset of the methodology used in MixMatch. We clarify the\ndifferences in our ablation study (section 4.2.3).\n\n3 MixMatch\n\nIn this section, we introduce MixMatch, our proposed semi-supervised learning method. MixMatch\nis a \u201cholistic\u201d approach which incorporates ideas and components from the dominant paradigms for\nSSL discussed in section 2. Given a batch X of labeled examples with one-hot targets (representing\none of L possible labels) and an equally-sized batch U of unlabeled examples, MixMatch produces\na processed batch of augmented labeled examples X \u2032 and a batch of augmented unlabeled examples\nwith \u201cguessed\u201d labels U \u2032. U \u2032 and X \u2032 are then used in computing separate labeled and unlabeled loss\nterms. More formally, the combined loss L for semi-supervised learning is de\ufb01ned as\n\nX \u2032, U \u2032 = MixMatch(X , U , T, K, \u03b1)\n\n1\n\nLX =\n\nLU =\n\n|X \u2032| Xx,p\u2208X \u2032\nL|U \u2032| Xu,q\u2208U \u2032\n\n1\n\nH(p, pmodel(y | x; \u03b8))\n\nkq \u2212 pmodel(y | u; \u03b8)k2\n\n2\n\n(2)\n\n(3)\n\n(4)\n\n(5)\nwhere H(p, q) is the cross-entropy between distributions p and q, and T , K, \u03b1, and \u03bbU are hyperpa-\nrameters described below. The full MixMatch algorithm is provided in algorithm 1, and a diagram\nof the label guessing process is shown in \ufb01g. 1. Next, we describe each part of MixMatch.\n\nL = LX + \u03bbU LU\n\n3.1 Data Augmentation\n\nAs is typical in many SSL methods, we use data augmentation both on labeled and unlabeled data.\nFor each xb in the batch of labeled data X , we generate a transformed version \u02c6xb = Augment(xb)\n(algorithm 1, line 3). For each ub in the batch of unlabeled data U, we generate K augmentations\n\u02c6ub,k = Augment(ub), k \u2208 (1, . . . , K) (algorithm 1, line 5). We use these individual augmentations\nto generate a \u201cguessed label\u201d qb for each ub, through a process we describe in the following subsection.\n\n3.2 Label Guessing\n\nFor each unlabeled example in U, MixMatch produces a \u201cguess\u201d for the example\u2019s label using the\nmodel\u2019s predictions. This guess is later used in the unsupervised loss term. To do so, we compute the\naverage of the model\u2019s predicted class distributions across all the K augmentations of ub by\n\n\u00afqb =\n\n1\nK\n\nK\n\nXk=1\n\npmodel(y | \u02c6ub,k; \u03b8)\n\n(6)\n\nin algorithm 1, line 7. Using data augmentation to obtain an arti\ufb01cial target for an unlabeled example\nis common in consistency regularization methods [25, 40, 44].\n\n3\n\n\fAlgorithm 1 MixMatch takes a batch of labeled data X and a batch of unlabeled data U and produces\na collection X \u2032 (resp. U \u2032) of processed labeled examples (resp. unlabeled with guessed labels).\n\n1: Input: Batch of labeled examples and their one-hot labels X = (cid:0)(xb, pb); b \u2208 (1, . . . , B)(cid:1), batch of\nunlabeled examples U = (cid:0)ub; b \u2208 (1, . . . , B)(cid:1), sharpening temperature T , number of augmentations K,\nBeta distribution parameter \u03b1 for MixUp.\n\n// Apply data augmentation to xb\n\n\u02c6ub,k = Augment(ub)\n\n// Apply kth round of data augmentation to ub\n\nK Pk pmodel(y | \u02c6ub,k; \u03b8)\n\n\u02c6xb = Augment(xb)\nfor k = 1 to K do\n\nend for\n\u00afqb = 1\nqb = Sharpen(\u00afqb, T )\n\n2: for b = 1 to B do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n10: \u02c6X = (cid:0)(\u02c6xb, pb); b \u2208 (1, . . . , B)(cid:1)\n11: \u02c6U = (cid:0)(\u02c6ub,k, qb); b \u2208 (1, . . . , B), k \u2208 (1, . . . , K)(cid:1)\n12: W = Shu\ufb04e(cid:0)Concat( \u02c6X , \u02c6U )(cid:1)\n13: X \u2032 = (cid:0)MixUp( \u02c6Xi, Wi); i \u2208 (1, . . . , | \u02c6X |)(cid:1)\n14: U \u2032 = (cid:0)MixUp( \u02c6Ui, Wi+| \u02c6X |); i \u2208 (1, . . . , | \u02c6U |)(cid:1)\n15: return X \u2032, U \u2032\n\n// Compute average predictions across all augmentations of ub\n\n// Apply temperature sharpening to the average prediction (see eq. (7))\n\n// Augmented labeled examples and their labels\n\n// Augmented unlabeled examples, guessed labels\n\n// Combine and shuf\ufb02e labeled and unlabeled data\n\n// Apply MixUp to labeled data and entries from W\n\n// Apply MixUp to unlabeled data and the rest of W\n\nIn generating a label guess, we perform one additional step inspired by the success\nSharpening.\nof entropy minimization in semi-supervised learning (discussed in section 2.2). Given the average\nprediction over augmentations \u00afqb, we apply a sharpening function to reduce the entropy of the label\ndistribution. In practice, for the sharpening function, we use the common approach of adjusting the\n\u201ctemperature\u201d of this categorical distribution [16], which is de\ufb01ned as the operation\n\nSharpen(p, T )i := p\n\n1\n\nT\n\ni (cid:30) L\nXj=1\n\n1\n\np\n\nT\nj\n\n(7)\n\nwhere p is some input categorical distribution (speci\ufb01cally in MixMatch, p is the average class\nprediction over augmentations \u00afqb, as shown in algorithm 1, line 8) and T is a hyperparameter. As\nT \u2192 0, the output of Sharpen(p, T ) will approach a Dirac (\u201cone-hot\u201d) distribution. Since we will\nlater use qb = Sharpen(\u00afqb, T ) as a target for the model\u2019s prediction for an augmentation of ub,\nlowering the temperature encourages the model to produce lower-entropy predictions.\n\n3.3 MixUp\n\nWe use MixUp for semi-supervised learning, and unlike past work for SSL we mix both labeled\nexamples and unlabeled examples with label guesses (generated as described in section 3.2). To be\ncompatible with our separate loss terms, we de\ufb01ne a slightly modi\ufb01ed version of MixUp. For a pair\nof two examples with their corresponding labels probabilities (x1, p1), (x2, p2) we compute (x\u2032, p\u2032)\nby\n\n\u03bb \u223c Beta(\u03b1, \u03b1)\n\u03bb\u2032 = max(\u03bb, 1 \u2212 \u03bb)\nx\u2032 = \u03bb\u2032x1 + (1 \u2212 \u03bb\u2032)x2\np\u2032 = \u03bb\u2032p1 + (1 \u2212 \u03bb\u2032)p2\n\n(8)\n(9)\n(10)\n(11)\n\nwhere \u03b1 is a hyperparameter. Vanilla MixUp omits eq. (9) (i.e. it sets \u03bb\u2032 = \u03bb). Given that labeled\nand unlabeled examples are concatenated in the same batch, we need to preserve the order of the\nbatch to compute individual loss components appropriately. This is achieved by eq. (9) which ensures\nthat x\u2032 is closer to x1 than to x2. To apply MixUp, we \ufb01rst collect all augmented labeled examples\nwith their labels and all unlabeled examples with their guessed labels into\n\n\u02c6X = (cid:0)(\u02c6xb, pb); b \u2208 (1, . . . , B)(cid:1)\n\u02c6U = (cid:0)(\u02c6ub,k, qb); b \u2208 (1, . . . , B), k \u2208 (1, . . . , K)(cid:1)\n\n(12)\n\n(13)\n\n4\n\n\f(algorithm 1, lines 10\u201311). Then, we combine these collections and shuf\ufb02e the result to form W\nwhich will serve as a data source for MixUp (algorithm 1, line 12). For each the ith example-label\npair in \u02c6X , we compute MixUp( \u02c6Xi, Wi) and add the result to the collection X \u2032 (algorithm 1, line\ni = MixUp( \u02c6Ui, Wi+| \u02c6X |) for i \u2208 (1, . . . , | \u02c6U|), intentionally using the remainder\n13). We compute U \u2032\nof W that was not used in the construction of X \u2032 (algorithm 1, line 14). To summarize, MixMatch\ntransforms X into X \u2032, a collection of labeled examples which have had data augmentation and\nMixUp (potentially mixed with an unlabeled example) applied. Similarly, U is transformed into U \u2032,\na collection of multiple augmentations of each unlabeled example with corresponding label guesses.\n\n3.4 Loss Function\n\nGiven our processed batches X \u2032 and U \u2032, we use the standard semi-supervised loss shown in eqs. (3)\nto (5). Equation (5) combines the typical cross-entropy loss between labels and model predictions\nfrom X \u2032 with the squared L2 loss on predictions and guessed labels from U \u2032. We use this L2 loss\nin eq. (4) (the multiclass Brier score [5]) because, unlike the cross-entropy, it is bounded and less\nsensitive to incorrect predictions. For this reason, it is often used as the unlabeled data loss in SSL\n[25, 44] as well as a measure of predictive uncertainty [26]. We do not propagate gradients through\ncomputing the guessed labels, as is standard [25, 44, 31, 35]\n\n3.5 Hyperparameters\n\nSince MixMatch combines multiple mechanisms for leveraging unlabeled data, it introduces various\nhyperparameters \u2013 speci\ufb01cally, the sharpening temperature T , number of unlabeled augmentations K,\n\u03b1 parameter for Beta in MixUp, and the unsupervised loss weight \u03bbU . In practice, semi-supervised\nlearning methods with many hyperparameters can be problematic because cross-validation is dif\ufb01cult\nwith small validation sets [35, 39, 35]. However, we \ufb01nd in practice that most of MixMatch\u2019s\nhyperparameters can be \ufb01xed and do not need to be tuned on a per-experiment or per-dataset basis.\nSpeci\ufb01cally, for all experiments we set T = 0.5 and K = 2. Further, we only change \u03b1 and \u03bbU on a\nper-dataset basis; we found that \u03b1 = 0.75 and \u03bbU = 100 are good starting points for tuning. In all\nexperiments, we linearly ramp up \u03bbU to its maximum value over the \ufb01rst 16,000 steps of training as\nis common practice [44].\n\n4 Experiments\n\nWe test the effectiveness of MixMatch on standard SSL benchmarks (section 4.2). Our ablation study\nteases apart the contribution of each of MixMatch\u2019s components (section 4.2.3). As an additional\napplication, we consider privacy-preserving learning in section 4.3.\n\n4.1\n\nImplementation details\n\nUnless otherwise noted, in all experiments we use the \u201cWide ResNet-28\u201d model from [35]. Our\nimplementation of the model and training procedure closely matches that of [35] (including using\n5000 examples to select the hyperparameters), except for the following differences: First, instead\nof decaying the learning rate, we evaluate models using an exponential moving average of their\nparameters with a decay rate of 0.999. Second, we apply a weight decay of 0.0004 at each update for\nthe Wide ResNet-28 model. Finally, we checkpoint every 216 training samples and report the median\nerror rate of the last 20 checkpoints. This simpli\ufb01es the analysis at a potential cost to accuracy by, for\nexample, averaging checkpoints [2] or choosing the checkpoint with the lowest validation error.\n\n4.2 Semi-Supervised Learning\n\nFirst, we evaluate the effectiveness of MixMatch on four standard benchmark datasets: CIFAR-10\nand CIFAR-100 [24], SVHN [32], and STL-10 [8]. Standard practice for evaluating semi-supervised\nlearning on the \ufb01rst three datasets is to treat most of the dataset as unlabeled and use a small portion\nas labeled data. STL-10 is a dataset speci\ufb01cally designed for SSL, with 5,000 labeled images and\n100,000 unlabeled images which are drawn from a slightly different distribution than the labeled data.\n\n5\n\n\fFigure 2: Error rate comparison of MixMatch\nto baseline methods on CIFAR-10 for a varying\nnumber of labels. Exact numbers are provided\nin table 5 (appendix). \u201cSupervised\u201d refers to\ntraining with all 50000 training examples and\nno unlabeled data. With 250 labels MixMatch\nreaches an error rate comparable to next-best\nmethod\u2019s performance with 4000 labels.\n\nFigure 3: Error rate comparison of MixMatch to\nbaseline methods on SVHN for a varying num-\nber of labels. Exact numbers are provided in\ntable 6 (appendix). \u201cSupervised\u201d refers to train-\ning with all 73257 training examples and no un-\nlabeled data. With 250 examples MixMatch\nnearly reaches the accuracy of supervised train-\ning for this model.\n\n4.2.1 Baseline Methods\n\nAs baselines, we consider the four methods considered in [35] (\u03a0-Model [25, 40], Mean Teacher\n[44], Virtual Adversarial Training [31], and Pseudo-Label [28]) which are described in section 2. We\nalso use MixUp [47] on its own as a baseline. MixUp is designed as a regularizer for supervised\nlearning, so we modify it for SSL by applying it both to augmented labeled examples and augmented\nunlabeled examples with their corresponding predictions. In accordance with standard usage of\nMixUp, we use a cross-entropy loss between the MixUp-generated guess label and the model\u2019s\nprediction. As advocated by [35], we reimplemented each of these methods in the same codebase and\napplied them to the same model (described in section 4.1) to ensure a fair comparison. We re-tuned\nthe hyperparameters for each baseline method, which generally resulted in a marginal accuracy\nimprovement compared to those in [35], thereby providing a more competitive experimental setting\nfor testing out MixMatch.\n\n4.2.2 Results\n\nCIFAR-10 For CIFAR-10, we evaluate the accuracy of each method with a varying number of\nlabeled examples from 250 to 4000 (as is standard practice). The results can be seen in \ufb01g. 2. We\nused \u03bbU = 75 for CIFAR-10. We created 5 splits for each number of labeled points, each with a\ndifferent random seed. Each model was trained on each split and the error rates were reported by\nthe mean and variance across splits. We \ufb01nd that MixMatch outperforms all other methods by a\nsigni\ufb01cant margin, for example reaching an error rate of 6.24% with 4000 labels. For reference,\non the same model, fully supervised training on all 50000 samples achieves an error rate of 4.17%.\nFurthermore, MixMatch obtains an error rate of 11.08% with only 250 labels. For comparison, at\n250 labels the next-best-performing method (VAT [31]) achieves an error rate of 36.03, over 4.5\u00d7\nhigher than MixMatch considering that 4.17% is the error limit obtained on our model with fully\nsupervised learning. In addition, at 4000 labels the next-best-performing method (Mean Teacher [44])\nobtains an error rate of 10.36%, which suggests that MixMatch can achieve similar performance\nwith only 1/16 as many labels. We believe that the most interesting comparisons are with very few\nlabeled data points since it reveals the method\u2019s sample ef\ufb01ciency which is central to SSL.\n\nCIFAR-10 and CIFAR-100 with a larger model Some prior work [44, 2] has also considered the\nuse of a larger, 26 million-parameter model. Our base model, as used in [35], has only 1.5 million\nparameters which confounds comparison with these results. For a more reasonable comparison to\nthese results, we measure the effect of increasing the width of our base ResNet model and evaluate\nMixMatch\u2019s performance on a 28-layer Wide Resnet model which has 135 \ufb01lters per layer, resulting\nin 26 million parameters. We also evaluate MixMatch on this larger model on CIFAR-100 with\n10000 labels, to compare to the corresponding result from [2]. The results are shown in table 1.\nIn general, MixMatch matches or outperforms the best results from [2], though we note that the\ncomparison still remains problematic due to the fact that the model from [44, 2] also uses more\n\n6\n\n250500100020004000Number of Labeled Datapoints0%20%40%60%Test Error-ModelMean TeacherVATPseudo-LabelMixUpMixMatchSupervised250500100020004000Number of Labeled Datapoints0%10%20%30%40%Test Error-ModelMean TeacherVATPseudo-LabelMixUpMixMatchSupervised\fMethod\n\nCIFAR-10\n\nCIFAR-100\n\nMean Teacher [44]\nSWA [2]\n\n6.28\n5.00\n\n-\n28.80\n\nMixMatch\n\n4.95 \u00b1 0.08\n\n25.88 \u00b1 0.30\n\nTable 1: CIFAR-10 and CIFAR-100 error rate\n(with 4,000 and 10,000 labels respectively) with\nlarger models (26 million parameters).\n\nMethod\n\nCutOut [12]\nIIC [20]\nSWWAE [48]\nCC-GAN2 [11]\n\n1000 labels\n-\n-\n25.70\n22.20\n\nMixMatch\n\n10.18 \u00b1 1.46\n\n5000 labels\n\n12.74\n11.20\n-\n-\n\n5.59\n\nTable 2: STL-10 error rate using 1000-label\nsplits or the entire 5000-label training set.\n\nLabels\nSVHN\nSVHN+Extra\n\n250\n\n500\n\n1000\n\n2000\n\n4000\n\n3.78 \u00b1 0.26\n2.22 \u00b1 0.08\n\n3.64 \u00b1 0.46\n2.17 \u00b1 0.07\n\n3.27 \u00b1 0.31\n2.18 \u00b1 0.06\n\n3.04 \u00b1 0.13\n2.12 \u00b1 0.03\n\n2.89 \u00b1 0.06\n2.07 \u00b1 0.05\n\nAll\n\n2.59\n1.71\n\nTable 3: Comparison of error rates for SVHN and SVHN+Extra for MixMatch. The last column\n(\u201cAll\u201d) contains the fully-supervised performance with all labels in the corresponding training set.\n\nsophisticated \u201cshake-shake\u201d regularization [15]. For this model, we used a weight decay of 0.0008.\nWe used \u03bbU = 75 for CIFAR-10 and \u03bbU = 150 for CIFAR-100.\n\nSVHN and SVHN+Extra As with CIFAR-10, we evaluate the performance of each SSL method\non SVHN with a varying number of labels from 250 to 4000. As is standard practice, we \ufb01rst\nconsider the setting where the 73257-example training set is split into labeled and unlabeled data.\nThe results are shown in \ufb01g. 3. We used \u03bbU = 250. Here again the models were evaluated on 5\nsplits for each number of labeled points, each with a different random seed. We found MixMatch\u2019s\nperformance to be relatively constant (and better than all other methods) across all amounts of labeled\ndata. Surprisingly, after additional tuning we were able to obtain extremely good performance from\nMean Teacher [44], though its error rate was consistently slightly higher than MixMatch\u2019s.\nNote that SVHN has two training sets: train and extra. In fully-supervised learning, both sets are\nconcatenated to form the full training set (604388 samples). In SSL, for historical reasons the extra set\nwas left aside and only train was used (73257 samples). We argue that leveraging both train and extra\nfor the unlabeled data is more interesting since it exhibits a higher ratio of unlabeled samples over\nlabeled ones. We report error rates for both SVHN and SVHN+Extra in table 3. For SVHN+Extra\nwe used \u03b1 = 0.25, \u03bbU = 250 and a lower weight decay of 0.000002 due to the larger amount of\navailable data. We found that on both training sets, MixMatch nearly matches the fully-supervised\nperformance on the same training set almost immediately \u2013 for example, MixMatch achieves an error\nrate of 2.22% with only 250 labels on SVHN+Extra compared to the fully-supervised performance of\n1.71%. Interestingly, on SVHN+Extra MixMatch outperformed fully supervised training on SVHN\nwithout extra (2.59% error) for every labeled data amount considered. To emphasize the importance\nof this, consider the following scenario: You have 73257 examples from SVHN with 250 examples\nlabeled and are given a choice: You can either obtain 8\u00d7 more unlabeled data and use MixMatch or\nobtain 293\u00d7 more labeled data and use fully-supervised learning. Our results suggest that obtaining\nadditional unlabeled data and using MixMatch is more effective, which conveniently is likely much\ncheaper than obtaining 293\u00d7 more labels.\n\nSTL-10 STL-10 contains 5000 training examples aimed at being used with 10 prede\ufb01ned folds (we\nuse the \ufb01rst 5 only) with 1000 examples each. However, some prior work trains on all 5000 examples.\nWe thus compare in both experimental settings. With 1000 examples MixMatch surpasses both the\nstate-of-the-art for 1000 examples as well as the state-of-the-art using all 5000 labeled examples.\nNote that none of the baselines in table 2 use the same experimental setup (i.e. model), so it is dif\ufb01cult\nto directly compare the results; however, because MixMatch obtains the lowest error by a factor of\ntwo, we take this to be a vote in con\ufb01dence of our method. We used \u03bbU = 50.\n\n4.2.3 Ablation Study\n\nSince MixMatch combines various semi-supervised learning mechanisms, it has a good deal in\ncommon with existing methods in the literature. As a result, we study the effect of removing or\n\n7\n\n\fAblation\n\n250 labels\n\n4000 labels\n\nMixMatch\nMixMatch without distribution averaging (K = 1)\nMixMatch with K = 3\nMixMatch with K = 4\nMixMatch without temperature sharpening (T = 1)\nMixMatch with parameter EMA\nMixMatch without MixUp\nMixMatch with MixUp on labeled only\nMixMatch with MixUp on unlabeled only\nMixMatch with MixUp on separate labeled and unlabeled\nInterpolation Consistency Training [45]\n\n11.80\n17.09\n11.55\n12.45\n27.83\n11.86\n39.11\n32.16\n12.35\n12.26\n38.60\n\n6.00\n8.06\n6.23\n5.88\n10.59\n6.47\n10.97\n9.22\n6.83\n6.50\n6.81\n\nTable 4: Ablation study results. All values are error rates on CIFAR-10 with 250 or 4000 labels.\n\nadding components in order to provide additional insight into what makes MixMatch performant.\nSpeci\ufb01cally, we measure the effect of\n\n\u2022 using the mean class distribution over K augmentations or using the class distribution for a\n\nsingle augmentation (i.e. setting K = 1)\n\n\u2022 removing temperature sharpening (i.e. setting T = 1)\n\u2022 using an exponential moving average (EMA) of model parameters when producing guessed\n\nlabels, as is done by Mean Teacher [44]\n\n\u2022 performing MixUp between labeled examples only, unlabeled examples only, and without\n\nmixing across labeled and unlabeled examples\n\n\u2022 using Interpolation Consistency Training [45], which can be seen as a special case of this\nablation study where only unlabeled mixup is used, no sharpening is applied and EMA\nparameters are used for label guessing.\n\nWe carried out the ablation on CIFAR-10 with 250 and 4000 labels; the results are shown in table 4.\nWe \ufb01nd that each component contributes to MixMatch\u2019s performance, with the most dramatic\ndifferences in the 250-label setting. Despite Mean Teacher\u2019s effectiveness on SVHN (\ufb01g. 3), we\nfound that using a similar EMA of parameter values hurt MixMatch\u2019s performance slightly.\n\n4.3 Privacy-Preserving Learning and Generalization\n\nLearning with privacy allows us to measure our approach\u2019s ability to generalize. Indeed, protecting\nthe privacy of training data amounts to proving that the model does not over\ufb01t: a learning algorithm\nis said to be differentially private (the most widely accepted technical de\ufb01nition of privacy) if adding,\nmodifying, or removing any of its training samples is guaranteed not to result in a statistically\nsigni\ufb01cant difference in the model parameters learned [13]. For this reason, learning with differential\nprivacy is, in practice, a form of regularization [33]. Each training data access constitutes a potential\nprivacy leakage, encoded as the pair of the input and its label. Hence, approaches for deep learning\nfrom private training data, such as DP-SGD [1] and PATE [36], bene\ufb01t from accessing as few labeled\nprivate training points as possible when computing updates to the model parameters. Semi-supervised\nlearning is a natural \ufb01t for this setting.\nWe use the PATE framework for learning with privacy. A student is trained in a semi-supervised way\nfrom public unlabeled data, part of which is labeled by an ensemble of teachers with access to private\nlabeled training data. The fewer labels a student requires to reach a \ufb01xed accuracy, the stronger is the\nprivacy guarantee it provides. Teachers use a noisy voting mechanism to respond to label queries\nfrom the student, and they may choose not to provide a label when they cannot reach a suf\ufb01ciently\nstrong consensus. For this reason, if MixMatch improves the performance of PATE, it would also\nillustrate MixMatch\u2019s improved generalization from few canonical exemplars of each class.\nWe compare the accuracy-privacy trade-off achieved by MixMatch to a VAT [31] baseline on SVHN.\nVAT achieved the previous state-of-the-art of 91.6% test accuracy for a privacy loss of \u03b5 = 4.96 [37].\nBecause MixMatch performs well with few labeled points, it is able to achieve 95.21 \u00b1 0.17% test\n\n8\n\n\faccuracy for a much smaller privacy loss of \u03b5 = 0.97. Because e\u03b5 is used to measure the degree of\nprivacy, the improvement is approximately e4 \u2248 55\u00d7, a signi\ufb01cant improvement. A privacy loss \u03b5\nbelow 1 corresponds to a much stronger privacy guarantee. Note that in the private training setting\nthe student model only uses 10,000 total examples.\n\n5 Conclusion\n\nWe introduced MixMatch, a semi-supervised learning method which combines ideas and components\nfrom the current dominant paradigms for SSL. Through extensive experiments on semi-supervised and\nprivacy-preserving learning, we found that MixMatch exhibited signi\ufb01cantly improved performance\ncompared to other methods in all settings we studied, often by a factor of two or more reduction in\nerror rate. In future work, we are interested in incorporating additional ideas from the semi-supervised\nlearning literature into hybrid methods and continuing to explore which components result in effective\nalgorithms. Separately, most modern work on semi-supervised learning algorithms is evaluated on\nimage benchmarks; we are interested in exploring the effectiveness of MixMatch in other domains.\n\nAcknowledgement\n\nWe would like to thank Balaji Lakshminarayanan for his helpful theoretical insights.\n\nReferences\n\n[1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar,\nand Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC\nConference on Computer and Communications Security, pages 308\u2013318. ACM, 2016.\n\n[2] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson.\n\ning consistency-based semi-supervised learning with weight averaging.\narXiv:1806.05594, 2018.\n\nImprov-\narXiv preprint\n\n[3] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding\n\nand clustering. In Advances in Neural Information Processing Systems, 2002.\n\n[4] Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. Label Propagation and Quadratic\n\nCriterion, chapter 11. MIT Press, 2006.\n\n[5] Glenn W. Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthey Weather\n\nReview, 78(1):1\u20133, 1950.\n\n[6] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-Supervised Learning. MIT\n\nPress, 2006.\n\n[7] Dan Claudiu Cire\u00b8san, Ueli Meier, Luca Maria Gambardella, and J\u00fcrgen Schmidhuber. Deep, big,\nsimple neural nets for handwritten digit recognition. Neural computation, 22(12):3207\u20133220,\n2010.\n\n[8] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsuper-\nvised feature learning. In Proceedings of the fourteenth international conference on arti\ufb01cial\nintelligence and statistics, pages 215\u2013223, 2011.\n\n[9] Adam Coates and Andrew Y. Ng. The importance of encoding versus training with sparse\n\ncoding and vector quantization. In International Conference on Machine Learning, 2011.\n\n[10] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment:\n\nLearning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.\n\n[11] Emily Denton, Sam Gross, and Rob Fergus. Semi-supervised learning with context-conditional\n\ngenerative adversarial networks. arXiv preprint arXiv:1611.06430, 2016.\n\n[12] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural\n\nnetworks with cutout. arXiv preprint arXiv:1708.04552, 2017.\n\n9\n\n\f[13] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to\nsensitivity in private data analysis. Journal of Privacy and Con\ufb01dentiality, 7(3):17\u201351, 2016.\n\n[14] Alexander Gammerman, Volodya Vovk, and Vladimir Vapnik. Learning by transduction. In\n\nProceedings of the Fourteenth Conference on Uncertainty in Arti\ufb01cial Intelligence, 1998.\n\n[15] Xavier Gastaldi. Shake-shake regularization. Fifth International Conference on Learning\n\nRepresentations (Workshop Track), 2017.\n\n[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.\n\n[17] Ian J. Goodfellow, Aaron Courville, and Yoshua Bengio. Spike-and-slab sparse coding for\nunsupervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical\nModels, 2011.\n\n[18] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In\n\nAdvances in Neural Information Processing Systems, 2005.\n\n[19] Geoffrey Hinton and Drew van Camp. Keeping neural networks simple by minimizing the\nIn Proceedings of the 6th Annual ACM Conference on\n\ndescription length of the weights.\nComputational Learning Theory, 1993.\n\n[20] Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information distillation for unsupervised\n\nimage segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.\n\n[21] Thorsten Joachims. Transductive inference for text classi\ufb01cation using support vector machines.\n\nIn International Conference on Machine Learning, 1999.\n\n[22] Thorsten Joachims. Transductive learning via spectral graph partitioning. In International\n\nConference on Machine Learning, 2003.\n\n[23] Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Processing\nSystems, 2014.\n\n[24] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nUniversity of Toronto, 2009.\n\n[25] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In Fifth\n\nInternational Conference on Learning Representations, 2017.\n\n[26] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, 2017.\n\n[27] Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative\nand discriminative models. In IEEE Computer Society Conference on Computer Vision and\nPattern Recognition, 2006.\n\n[28] Dong-Hyun Lee. Pseudo-label: The simple and ef\ufb01cient semi-supervised learning method for\n\ndeep neural networks. In ICML Workshop on Challenges in Representation Learning, 2013.\n\n[29] Bin Liu, Zhirong Wu, Han Hu, and Stephen Lin. Deep metric transfer for label propagation\n\nwith limited annotated data. arXiv preprint arXiv:1812.08781, 2018.\n\n[30] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in Adam. arXiv preprint\n\narXiv:1711.05101, 2017.\n\n[31] Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training:\na regularization method for supervised and semi-supervised learning. IEEE transactions on\npattern analysis and machine intelligence, 2018.\n\n[32] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop on\nDeep Learning and Unsupervised Feature Learning, 2011.\n\n10\n\n\f[33] Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR,\n\nabs/1504.05800, 2015.\n\n[34] Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint\n\narXiv:1606.01583, 2016.\n\n[35] Avital Oliver, Augustus Odena, Colin Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic\nevaluation of deep semi-supervised learning algorithms. In Advances in Neural Information\nProcessing Systems, pages 3235\u20133246, 2018.\n\n[36] Nicolas Papernot, Mart\u00edn Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-\nsupervised knowledge transfer for deep learning from private training data. arXiv preprint\narXiv:1610.05755, 2016.\n\n[37] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and \u00dalfar\n\nErlingsson. Scalable private learning with pate. arXiv preprint arXiv:1802.08908, 2018.\n\n[38] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence\nCarin. Variational autoencoder for deep learning of images, labels and captions. In Advances in\nNeural Information Processing Systems, 2016.\n\n[39] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-\nsupervised learning with ladder networks. In Advances in Neural Information Processing\nSystems, 2015.\n\n[40] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transfor-\nmations and perturbations for deep semi-supervised learning. In Advances in Neural Information\nProcessing Systems, 2016.\n\n[41] Ruslan Salakhutdinov and Geoffrey E. Hinton. Using deep belief nets to learn covariance\nkernels for Gaussian processes. In Advances in Neural Information Processing Systems, 2007.\n\n[42] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training GANs. In Advances in Neural Information Processing Systems,\n2016.\n\n[43] Patrice Y. Simard, David Steinkraus, and John C. Platt. Best practice for convolutional neural\nnetworks applied to visual document analysis. In Proceedings of the International Conference\non Document Analysis and Recognition, 2003.\n\n[44] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged con-\nsistency targets improve semi-supervised deep learning results. Advances in Neural Information\nProcessing Systems, 2017.\n\n[45] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation\n\nconsistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825, 2019.\n\n[46] Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight\n\ndecay regularization. arXiv preprint arXiv:1810.12281, 2018.\n\n[47] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond\n\nempirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.\n\n[48] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-\n\nencoders. arXiv preprint arXiv:1506.02351, 2015.\n\n[49] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian\n\n\ufb01elds and harmonic functions. In International Conference on Machine Learning, 2003.\n\n11\n\n\f", "award": [], "sourceid": 2774, "authors": [{"given_name": "David", "family_name": "Berthelot", "institution": "Google Brain"}, {"given_name": "Nicholas", "family_name": "Carlini", "institution": "Google"}, {"given_name": "Ian", "family_name": "Goodfellow", "institution": "Google Brain"}, {"given_name": "Nicolas", "family_name": "Papernot", "institution": "University of Toronto"}, {"given_name": "Avital", "family_name": "Oliver", "institution": "Google Brain"}, {"given_name": "Colin", "family_name": "Raffel", "institution": "Google Brain"}]}