{"title": "Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 10456, "page_last": 10465, "abstract": "The growing importance of massive datasets with the advent of deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling for large datasets, non-expert labeling, and label corruption by data poisoning adversaries. In the latter case, corruptions may be arbitrarily bad, even so bad that a classifier predicts the wrong labels with high confidence. To protect against such sources of noise, we leverage the fact that a small set of clean labels is often easy to procure. We demonstrate that robustness to label noise up to severe strengths can be achieved by using a set of trusted data with clean labels, and propose a loss correction that utilizes trusted examples in a data-efficient manner to mitigate the effects of label noise on deep neural network classifiers. Across vision and natural language processing tasks, we experiment with various label noises at several strengths, and show that our method significantly outperforms existing methods.", "full_text": "Using Trusted Data to Train Deep Networks on\n\nLabels Corrupted by Severe Noise\n\nDan Hendrycks\u2217\n\nUniversity of California, Berkeley\nhendrycks@berkeley.edu\n\nMantas Mazeika\u2217\nUniversity of Chicago\nmantas@ttic.edu\n\nDuncan Wilson\n\nFoundational Research Institute\nduncanw@nevada.unr.edu\n\nKevin Gimpel\n\nToyota Technological Institute at Chicago\n\nkgimpel@ttic.edu\n\nAbstract\n\nThe growing importance of massive datasets used for deep learning makes robust-\nness to label noise a critical property for classi\ufb01ers to have. Sources of label noise\ninclude automatic labeling, non-expert labeling, and label corruption by data poi-\nsoning adversaries. Numerous previous works assume that no source of labels can\nbe trusted. We relax this assumption and assume that a small subset of the training\ndata is trusted. This enables substantial label corruption robustness performance\ngains. In addition, particularly severe label noise can be combated by using a set of\ntrusted data with clean labels. We utilize trusted data by proposing a loss correction\ntechnique that utilizes trusted examples in a data-ef\ufb01cient manner to mitigate the\neffects of label noise on deep neural network classi\ufb01ers. Across vision and natural\nlanguage processing tasks, we experiment with various label noises at several\nstrengths, and show that our method signi\ufb01cantly outperforms existing methods.\n\n1\n\nIntroduction\n\nRobustness to label noise is set to become an increasingly important property of supervised learning\nmodels. With the advent of deep learning, the need for more labeled data makes it inevitable that\nnot all examples will have high-quality labels. This is especially true of data sources that admit\nautomatic label extraction, such as web crawling for images, and tasks for which high-quality labels\nare expensive to produce, such as semantic segmentation or parsing. Additionally, label corruption\nmay arise in data poisoning [10, 24]. Both natural and malicious label corruptions tend to sharply\ndegrade the performance of classi\ufb01cation systems [30].\nMost prior work on label corruption robustness assumes that all training data are potentially corrupted.\nHowever, it is usually the case that a number of trusted examples are available. Trusted data are\ngathered to create validation and test sets. When it is possible to curate trusted data, a small set of\ntrusted data could be created for training. We depart from the assumption that all training data are\npotentially corrupted by assuming that a subset of the training is trusted. In turn we demonstrate\nthat having some amount of trusted training data enables signi\ufb01cant robustness gains.\nTo leverage the additional information from trusted labels, we propose a new loss correction and\nempirically verify it on a number of vision and natural language datasets with label corruption.\nSpeci\ufb01cally, we demonstrate recovery from extremely high levels of label noise, including the dire\ncase when the untrusted data has a majority of its labels corrupted. Such severe corruption can occur\nin adversarial situations like data poisoning, or when the number of classes is large. In comparison to\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\floss corrections that do not employ trusted data [18], our method is signi\ufb01cantly more accurate in\nproblem settings with moderate to severe label noise. Relative to a recent method which also uses\ntrusted data [11], our method is far more data-ef\ufb01cient and generally more accurate. These results\ndemonstrate that systems can weather label corruption with access only to a small number of gold\nstandard labels. Experiment code is available at https://github.com/mmazeika/glc.\n\n2 Related Work\n\nThe performance of machine learning systems reliant on labeled data has been shown to degrade\nnoticeably in the presence of label noise [17, 19].\nIn the case of adversarial label noise, this\ndegradation can be even worse [20]. Accordingly, modeling, correcting, and learning with noisy\nlabels has been well studied [16, 1, 3].\nThe methods of [15, 9, 18, 25] allow for label noise robustness by modifying the model\u2019s architecture\nor by implementing a loss correction. Unlike Mnih and Hinton [15] who focus on binary classi\ufb01cation\nof aerial images and Larsen et al. [9] who assume symmetric label noise, [18, 25] consider label noise\nin the multi-class problem setting with asymmetric noise.\nSukhbaatar et al. [25] introduce a stochastic matrix measuring label corruption, note its inability\nto be calculated without access to the true labels, and propose a method of forward loss correction.\nForward loss correction adds a linear layer to the end of the model and the loss is adjusted accordingly\nto incorporate learning about the label noise. Patrini et al. [18] also make use of the forward loss\ncorrection mechanism, and propose an estimate of the label corruption estimation matrix which relies\non strong assumptions, and does not make use of clean labels.\nContra [25, 18], we make the assumption that during training the model has access to a small set\nof clean labels. See Charikar, Steinhardt, and Valiant [2] for a general analysis of this assumption.\nThis assumption has been leveraged by others for the purpose of label noise robustness, most notably\n[26, 11, 27, 21]. Veit et al. [26] use human-veri\ufb01ed labels to train a label cleaning network by\nestimating the residuals between the noisy and clean labels in a multi-label classi\ufb01cation setting. In\nthe multi-class setting that we focus on in this work, Li et al. [11] propose distilling the predictions of\na model trained on clean labels into a second network trained on these predictions and the noisy labels.\nOur work differs from these two in that we do not train neural networks on the clean labels alone.\n\n3 Gold Loss Correction\n\nthis estimate is obtained, we use it to train a modi\ufb01ed\nclassi\ufb01er from which we recover an estimate of the desired\nconditional distribution p(y | x). We call this method the\nGold Loss Correction (GLC), so named because we make use of trusted or gold standard labels.\n\nFigure 1: A label corruption matrix (top\nleft) and three matrix estimates for a\ncorrupted CIFAR-10 dataset. Entry Cij\nis the probability that a label of class i\nis corrupted to class j, or symbolically\n\nCij = p((cid:101)y = j|y = i).\n\nWe are given an untrusted dataset (cid:101)D of u examples (x,(cid:101)y),\ndistribution p((cid:101)y | y, x). We are also given a trusted dataset\n\nand we assume that these examples are potentially cor-\nrupted examples from the true data distribution p(x, y)\nwith K classes. Corruption is speci\ufb01ed by a label noise\nD of t examples drawn from p(x, y), where t/u (cid:28) 1.\nWe refer to t/u as the trusted fraction. Concretely, a web\nscraper labeling images from metadata may produce an un-\ntrusted set, while expert-annotated examples would form\na trusted dataset and be a gold standard.\nWe explore two avenues of utilizing D to improve this\napproach. The \ufb01rst directly uses the trusted data while\ntraining the \ufb01nal classi\ufb01er. As this could be applied to ex-\nisting methods, we run ablations to demonstrate its effect.\nThe second avenue uses the additional information con-\nferred by the clean labels to better model the label noise\nfor use in a corrected classi\ufb01er.\nOur method makes use of D to estimate the K \u00d7 K matrix\n\nof corruption probabilities Cij = p((cid:101)y = j | y = i). Once\n\n2\n\nTrueGLC (Ours)ForwardConfusion Matrix0.00.20.40.60.81.0\f(cid:90)\n\np(x |(cid:101)y, y) dx = p((cid:101)y | y).\n\nEstimating The Corruption Matrix. First, we train a classi\ufb01er(cid:98)p((cid:101)y | x) on (cid:101)D. Let(cid:101)y and y be in\nthe set of possible labels. To estimate the probability p((cid:101)y | y), we use the identity p((cid:101)y | y, x)p(x |\ny) = p((cid:101)y | y)p(x |(cid:101)y, y). Integrating over all x gives us\n(cid:90)\np((cid:101)y | y, x)p(x | y) dx = p((cid:101)y | y)\nWe can approximate the integral on the left with the expectation of p((cid:101)y | y, x) over the empirical\ndistribution of x given y. Assuming conditional independence of(cid:101)y and y given x, we have p((cid:101)y |\ny, x) = p((cid:101)y | x), which is directly approximated by(cid:98)p((cid:101)y | x), the classi\ufb01er trained on (cid:101)D. More\nexplicitly, let Ai be the subset of x in D with label i. Denote our estimate of C by (cid:98)C. We have\n(cid:98)p((cid:101)y = j | y = i, x) \u2248 p((cid:101)y = j | y = i).\nThis is how we estimate our corruption matrix for the GLC. The approximation relies on(cid:98)p((cid:101)y | x)\nbeing a good estimate of p((cid:101)y | x), on the number of trusted examples of each class, and on the\n\n(cid:98)p((cid:101)y = j | x) =\n\nextent to which the conditional independence assumption is satis\ufb01ed. The conditional independence\nassumption is reasonable, as it is usually the case that noisy labeling processes do not have access to\nthe true label. Moreover, when the data are separable (i.e. y is deterministic given x), the assumption\nfollows. A proof of this is provided in the Supplementary Material. We investigate these factors in\nexperiments.\nTraining a Corrected Classi\ufb01er.\n\n(cid:98)Cij =\n\n(cid:88)\n\n(cid:88)\n\n1\n|Ai|\n\n1\n|Ai|\n\nx\u2208Ai\n\nx\u2208Ai\n\nNow with (cid:98)C, we follow the method of [25,\nwe de\ufb01ne the new output as(cid:101)s := (cid:98)C Ts. We\nthen train(cid:98)p((cid:101)s | x) on the noisy labels with\n\n18] to train a corrected classi\ufb01er, which\nwe now brie\ufb02y describe. Given the K \u00d7 1\nsoftmax output s of an untrained classi\ufb01er,\n\ncross-entropy loss. We can further improve\non this method by using trusted data to train\nthe corrected classi\ufb01er. Thus, we apply no\ncorrection on examples from the trusted\nset encountered during training. This has\nthe effect of allowing the GLC to handle a\ndegree of instance-dependency in the label\nnoise [14], though our experiments suggest\nthat most of the GLC\u2019s performance gains\n\nderive from our (cid:98)C estimate. A concrete\n\nalgorithm of our method is provided here.\n\nAlgorithm GOLD LOSS CORRECTION (GLC)\n\n1: Input: Trusted data D, untrusted data (cid:101)D, loss (cid:96)\n2: Train network f (x) =(cid:98)p((cid:101)y|x; \u03b8) \u2208 RK on (cid:101)D\n3: Fill (cid:98)C \u2208 RK\u00d7K with zeros\n\nnum_examples += 1\n\nnum_examples = 0\nfor (xi, yi) \u2208 D such that yi = k do\n\n4: for k = 1, . . . , K do\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\n(cid:98)Ck\u2022 += f (xi) {add f (xi) to kth row}\n(cid:98)Ck\u2022 /= num_examples\n12: Initialize new model g(x) =(cid:98)p(y|x; \u03b8)\n13: Train with (cid:96)(g(x), y) on D, (cid:96)((cid:98)C Tg(x),(cid:101)y) on (cid:101)D\n14: Output: Model(cid:98)p(y|x; \u03b8)\n\nend for\n\n4 Experiments\n\nGenerating Corrupted Labels. Suppose our dataset has t + u examples. We sample a set of t\n\ntrusted datapoints D, and the remaining u untrusted examples form (cid:101)D, which we probabilistically\nTo generate the untrusted labels from the true labels in (cid:101)D, we \ufb01rst obtain a corruption matrix C. Then,\n\ncorrupt according to a true corruption matrix C. Note that we do not have knowledge of which of our\nu untrusted examples are corrupted. We only know that they are potentially corrupted.\n\nfor an example with true label i, we sample the corrupted label from the categorical distribution\nparameterized by the ith row of C. Note that this does not satisfy the conditional independence\nassumption required for our estimate of C. However, we \ufb01nd that the GLC still works well in practice,\nperhaps because this assumption is also satis\ufb01ed when the data are separable, in the sense that each x\nhas a single true y, which is approximately true in many of our experiments.\nComparing Loss Correction Methods. The GLC differs from previous loss corrections for label\nnoise in that it reasonably assumes access to a high-quality annotation source. Therefore, to compare\nto other loss correction methods, we ask how each method performs when starting from the same\ndataset with the same label noise. In other words, the only additional information our method uses is\nknowledge of which examples are trusted, and which are potentially corrupted.\n\n3\n\n\fFigure 2: Error curves for numerous label correction methods using a full range of label corruption\nstrengths on several different vision and natural language processing datasets.\n\n4.1 Datasets and Architectures\nMNIST. The MNIST dataset contains 28 \u00d7 28 grayscale images of the digits 0-9. The training set\nhas 60,000 images and the test set has 10,000 images. For preprocessing, we rescale the pixels to the\ninterval [0, 1].We train a 2-layer fully connected network with 256 hidden dimensions. We train with\nAdam for 10 epochs using batches of size 32 and a learning rate of 0.001. For regularization, we use\n(cid:96)2 weight decay on all layers with \u03bb = 1 \u00d7 10\u22126.\nCIFAR. The two CIFAR datasets contain 32\u00d7 32\u00d7 3 color images. CIFAR-10 has ten classes, and\nCIFAR-100 has 100 classes. CIFAR-100 has 20 \u201csuperclasses\u201d which partition its 100 classes into 20\nsemantically similar sets. We use these superclasses for hierarchical noise. Both datasets have 50,000\ntraining images and 10,000 testing images. For both datasets, we train a Wide Residual Network [29]\nof depth 40 and a widening factor of 2. We train for 75 epochs using SGD with Nesterov momentum\nand a cosine learning rate schedule [12].\nIMDB. The IMDB Large Movie Reviews dataset [13] contains 50,000 highly polarized movie\nreviews from the Internet Movie Database, split evenly into train and test sets. We pad and clip reviews\nto a length of 200 tokens, and learn 50-dimensional word vectors from scratch for a vocab size of\n5,000.We train an LSTM with 64 hidden dimensions on this data. We train using the Adam optimizer\n[8] for 3 epochs with batch size 64 and the suggested learning rate of 0.001. For regularization, we\nuse dropout [23] on the linear output layer with a dropping probability of 0.2.\nTwitter. The Twitter Part of Speech dataset [4] contains 1,827 tweets annotated with 25 POS tags.\nThis is split into a training set of 1,000 tweets, a development set of 327 tweets, and a test set of 500\ntweets. We use the development set to augment the training set. We use pretrained 50-dimensional\nword vectors, and for each token, we concatenate word vectors in a \ufb01xed window centered on the\ntoken. These form our training and test set. We use a window size of 3, and train a 2-layer fully\nconnected network with hidden size 256, and use the GELU nonlinearity [7]. We train with Adam for\n15 epochs with batch size 64 and learning rate of 0.001. For regularization, we use (cid:96)2 weight decay\nwith \u03bb = 5 \u00d7 10\u22125 on all but the linear output layer.\nSST. The Stanford Sentiment Treebank dataset consists of single sentence movie reviews [22]. We\nuse the 2-class version (i.e. SST2), which has 6,911 reviews in the training set, 872 in the development\nset, and 1,821 in the test set. We use the development set to augment the training set. We pad and clip\nreviews to a length of 30 tokens and learn 100-dimensional word vectors from scratch for a vocab\nsize of 10,000. Our classi\ufb01er is a word-averaging model with an af\ufb01ne output layer. We use the\nAdam optimizer for 5 epochs with batch size 50 and learning rate 0.001. For regularization, we use\n(cid:96)2 weight decay with \u03bb = 1 \u00d7 10\u22124 on the output layer.\n\n4\n\n0.00.20.40.60.81.0Corruption Strength0.00.20.40.60.81.0Test ErrorCIFAR-10, Flip, 10% trustedGLC (Ours)DistillationForward GoldForwardNo Correction0.00.20.40.60.81.0Corruption Strength0.00.20.40.60.81.0Test ErrorMNIST, Flip, 1% trusted0.00.20.40.60.81.0Corruption Strength0.00.20.40.60.81.0Test ErrorSST, Flip, 1% trusted0.00.20.40.60.81.0Corruption Strength0.00.20.40.60.81.0Test ErrorCIFAR-100, Flip, 10% trusted0.00.20.40.60.81.0Corruption Strength0.00.20.40.60.81.0Test ErrorCIFAR-10, Uniform, 10% trusted0.00.20.40.60.81.0Corruption Strength0.00.20.40.60.81.0Test ErrorIMDB, Flip, 5% trusted0.00.20.40.60.81.0Corruption Strength0.00.20.40.60.81.0Test ErrorCIFAR-100, Hier., 10% trusted0.00.20.40.60.81.0Corruption Strength0.00.20.40.60.81.0Test ErrorCIFAR-100, Uniform, 10% trusted\fT\nS\nI\nN\nM\n\n0\n1\n-\nR\nA\nF\nI\nC\n\n0\n0\n1\n-\nR\nA\nF\nI\nC\n\nCorruption\nType\nUniform\nUniform\nUniform\nFlip\nFlip\nFlip\n\nMean\n\nUniform\nUniform\nUniform\nFlip\nFlip\nFlip\n\nPercent\nTrusted\n5\n10\n25\n5\n10\n25\n\n5\n10\n25\n5\n10\n25\n\nMean\n\n5\nUniform\n10\nUniform\n25\nUniform\n5\nFlip\n10\nFlip\nFlip\n25\nHierarchical 5\nHierarchical 10\nHierarchical 25\n\nMean\n\nTrusted\nOnly\n37.6\n12.9\n6.6\n37.6\n12.9\n6.6\n19.0\n39.6\n31.3\n17.4\n39.6\n31.3\n17.4\n29.4\n82.4\n67.3\n52.2\n82.4\n67.3\n52.2\n82.4\n67.3\n52.2\n67.3\n\n12.9\n12.3\n9.3\n50.1\n51.1\n47.7\n30.6\n31.9\n31.9\n32.7\n53.3\n53.2\n52.7\n42.6\n48.8\n48.4\n45.4\n62.1\n61.9\n59.6\n50.9\n51.9\n54.3\n53.7\n\n14.5\n13.9\n11.8\n51.7\n48.8\n50.2\n31.8\n9.1\n8.6\n7.7\n38.6\n36.5\n37.6\n23.0\n47.7\n47.2\n43.6\n61.6\n61.0\n57.5\n51.0\n50.5\n47.0\n51.9\n\nGold\n13.5\n12.3\n9.2\n41.4\n36.4\n37.1\n25.0\n27.8\n20.6\n27.1\n47.8\n51.0\n49.5\n37.3\n49.6\n48.9\n46.0\n62.6\n62.2\n61.4\n52.4\n52.1\n51.1\n54.0\n\nNo Corr. Forward Forward\n\nDistill. Confusion\n\nMatrix\n21.8\n15.1\n11.0\n11.7\n5.6\n3.8\n11.5\n22.4\n22.7\n16.7\n8.1\n8.2\n7.1\n14.2\n53.6\n49.7\n39.6\n28.6\n26.9\n25.1\n45.8\n38.8\n29.7\n37.5\n\nGLC\n(Ours)\n10.3\n6.3\n4.7\n3.4\n2.9\n2.6\n5.0\n9.0\n6.9\n6.4\n6.6\n6.2\n6.1\n6.9\n42.4\n33.9\n27.3\n27.1\n25.8\n24.7\n34.8\n30.2\n25.4\n30.2\n\n42.1\n9.2\n5.8\n46.5\n32.4\n28.2\n27.4\n29.7\n18.3\n11.6\n29.7\n18.1\n11.8\n19.9\n87.5\n61.2\n39.8\n87.1\n61.8\n40.0\n87.1\n61.7\n39.7\n62.9\n\nTable 1: Vision dataset results. Percent trusted is the trusted fraction multiplied by 100. Unless\notherwise indicated, all values are percentages representing the area under the error curve computed\nat 11 test points. The best mean result is bolded.\n\n4.2 Label Noise Corrections\n\nForward Loss Correction. The forward correction method from Patrini et al. [18] also obtains (cid:98)C\n\nby training a classi\ufb01er on the noisy labels, and using the resulting softmax probabilities. However,\nthis method does not make use of a trusted fraction of the training data. Instead, it uses the argmax\nat the 97th percentile of softmax probabilities for a given class as a heuristic for detecting an example\nthat is truly a member of said class. As in the original paper, we replace this with the argmax over\nall softmax probabilities for a given class on CIFAR-100 experiments. The estimate of C is then used\nto train a corrected classi\ufb01er in the same way as the GLC.\nForward Gold. To examine the effect of training on trusted labels as done by the GLC, we augment\nthe Forward method by replacing its estimate of C with the identity on trusted examples. We call this\nmethod Forward Gold. It can also be seen as the GLC with the Forward method\u2019s estimate of C.\nDistillation. The distillation method of Li et al. [11] involves training a neural network on a large\ntrusted dataset and using this network to provide soft targets for the untrusted data. In this way, labels\nare \u201cdistilled\u201d from a neural network. If the classi\ufb01er\u2019s decisions for untrusted inputs are less reliable\nthan the original noisy labels, then the network\u2019s utility is limited. Thus, to obtain a reliable neural\nnetwork, a large trusted dataset is necessary. A new classi\ufb01er is trained using labels that are a convex\ncombination of the soft targets and the original untrusted labels.\nConfusion Matrices. An intuitive alternative to the GLC is to estimate C by a confusion matrix.\nTo do this, we train a classi\ufb01er on the untrusted examples, obtain its confusion matrix on the trusted\nexamples, row-normalize the matrix, and then train a corrected classi\ufb01er as in the GLC.\n\n4.3 Uniform, Flip, and Hierarchical Corruption\n\nCorruption-Generating Matrices. We consider three types of corruption matrices: corrupting\nuniformly to all classes, i.e. Cij = 1/K, \ufb02ipping a label to a different class, and corrupting uniformly\nto classes which are semantically similar. To create a uniform corruption at different strengths, we\n\n5\n\n\fNo Corr. Forward Forward\n\nDistill. Confusion\n\nCorruption\nType\nUniform\nUniform\nUniform\nFlip\nFlip\nFlip\n\nMean\n\nUniform\nUniform\nUniform\nFlip\nFlip\nFlip\n\nUniform\nUniform\nUniform\nFlip\nFlip\nFlip\n\nMean\n\nMean\n\nPercent\nTrusted\n5\n10\n25\n5\n10\n25\n\n5\n10\n25\n5\n10\n25\n\n5\n10\n25\n5\n10\n25\n\nTrusted\nOnly\n45.4\n35.2\n26.1\n45.4\n35.2\n26.1\n35.6\n36.9\n26.2\n22.2\n36.9\n26.2\n22.2\n28.5\n35.9\n23.6\n16.3\n35.9\n23.6\n16.3\n25.3\n\nT\nS\nS\n\nB\nD\nM\n\nI\n\nr\ne\nt\nt\ni\n\nw\nT\n\n27.5\n27.2\n26.5\n50.2\n49.9\n48.7\n38.3\n26.7\n25.8\n21.4\n49.2\n47.8\n39.4\n35.0\n37.1\n33.5\n25.5\n56.2\n53.8\n43.0\n41.5\n\n26.5\n26.2\n25.3\n50.3\n50.1\n49.0\n37.9\n27.9\n27.2\n23.0\n49.2\n48.3\n39.6\n35.9\n51.7\n49.5\n40.6\n61.6\n59.0\n52.5\n52.5\n\nGold\n26.6\n25.9\n24.6\n50.3\n49.9\n47.3\n37.4\n27.6\n26.1\n20.1\n49.2\n47.5\n36.6\n34.5\n44.1\n40.2\n26.4\n54.8\n48.9\n36.7\n41.9\n\n43.4\n33.3\n25.0\n48.8\n42.1\n31.8\n37.4\n35.5\n24.9\n21.0\n41.8\n28.0\n23.5\n29.1\n32.0\n22.2\n16.6\n36.4\n26.1\n20.5\n25.7\n\nMatrix\n26.1\n25.0\n22.4\n26.0\n24.6\n22.4\n24.4\n25.4\n23.3\n18.9\n25.8\n22.1\n19.2\n22.5\n41.5\n33.6\n20.0\n23.4\n15.9\n13.3\n24.6\n\nGLC\n(Ours)\n24.2\n23.5\n21.7\n24.9\n23.5\n21.7\n23.3\n25.0\n22.3\n18.7\n25.2\n22.0\n18.5\n22.0\n31.0\n22.3\n15.5\n15.8\n12.9\n12.8\n18.4\n\nTable 2: NLP dataset results. Percent trusted is the trusted fraction multiplied by 100. Unless\notherwise indicated, all values are percentages representing the area under the error curve computed\nat 11 test points. The best mean result is bolded.\n\ntake a convex combination of an identity matrix and the matrix 11T/K. We refer to the coef\ufb01cient\nof 11T/K as the corruption strength for a \u201cuniform\u201d corruption. A \u201c\ufb02ip\u201d corruption at strength m\ninvolves, for each row, giving an off-diagonal column probability mass m and the entries along the\ndiagonal probability mass 1 \u2212 m. Finally, a more realistic corruption is hierarchical corruption. For\nthis corruption, we apply uniform corruption only to semantically similar classes; for example, \u201cbed\u201d\nmay be corrupted to \u201ccouch\u201d but not \u201cbeaver\u201d in CIFAR-100. For CIFAR-100, examples are deemed\nsemantically similar if they share the same \u201csuperclass\u201d label speci\ufb01ed by the dataset creators.\nExperiments and Analysis of Results. We train the models described in Section 4.1 under uniform,\nlabel-\ufb02ipping, and hierarchical label corruptions at various fractions of trusted data. To assess the\nperformance of the GLC, we compare it to other loss correction methods and two baselines: one where\nwe train a network only on trusted data without any label corrections, and one where the network trains\non all data without any label corrections. We record errors on the test sets at the corruption strengths\n{0, 0.1, . . . , 1.0}. Since we compute the model\u2019s accuracy at numerous corruption strengths, CIFAR\nexperiments involve training over 500 Wide Residual Networks. In Tables 1 and 2, we report the area\nunder the error curves across corruption strengths {0, 0.1, . . . , 1.0} for all baselines and corrections.\nA sample of error curves are displayed in Figure 2. These curves are the linear interpolation of the\nerrors at the eleven corruption strengths.\nAcross all experiments, the GLC obtains better area under the error curve than the baselines and the\nForward and Distillation methods. The rankings of the other methods and baselines are mixed. On\nMNIST, training on the trusted data alone outperforms all methods save for the GLC and Confusion\nMatrix, but performs signi\ufb01cantly worse on CIFAR-100, even with large trusted fractions.\nThe Confusion Matrix correction performs second to the GLC, which indicates that normalized\nconfusion matrices are not as accurate as our method of estimating C. We veri\ufb01ed this by inspecting\nthe estimates directly, and found that normalized confusion matrices give a highly biased estimate\ndue to taking an argmax over class scores rather than using random sampling. Figure 1 shows an\nexample of this bias in the case of label \ufb02ipping corruption at a strength of 7/10.\nInterestingly, Forward Gold performs worse than Forward on several datasets. We did not observe the\nsame behavior when turning off the corresponding component of the GLC, and believe it may be due\nto variance introduced during training by the difference in signal provided by the Forward method\u2019s\n\n6\n\n\fC estimate and the clean labels. The GLC provides a superior C estimate, and thus may be better\nable to leverage training on the clean labels. Additional results on SVHN are in the Supplementary\nMaterial.\nWe also compare the GLC to the recent work of Ren et al. [21], which proposes a loss correction that\nuses a trusted set and meta-learning. We \ufb01nd that the GLC consistently outperforms this method. To\nconserve space, results are in the Supplementary Material.\n\nPercent\nTrusted\n1\n5\n10\n\n5\n10\n25\n\nTrusted\nOnly\n62.9\n39.6\n31.3\n44.6\n82.4\n67.3\n52.2\n67.3\n\nNo\nCorr.\n28.3\n27.1\n25.9\n27.1\n71.1\n66\n56.9\n64.7\n\nCIFAR-10\n\nMean\n\nCIFAR-100\n\nMean\n\n28.1\n26.6\n25.1\n26.6\n73.9\n68.2\n56.9\n66.3\n\nForward Forward\n\nGold\n30.9\n25.5\n22.9\n26.4\n73.6\n66.1\n51.4\n63.7\n\nDistill.\n\n60.4\n28.1\n17.8\n35.44\n88.3\n62.5\n39.7\n63.5\n\nConfusion\nMatrix\n31.9\n27\n24.2\n27.7\n74.1\n63.8\n50.8\n62.9\n\nGLC\n(Ours)\n26.9\n21.9\n19.2\n22.7\n68.7\n56.6\n40.8\n55.4\n\nTable 3: Results when obtaining noisy labels by sampling from the softmax distribution of a weak\nclassi\ufb01er. Percent trusted is the trusted fraction multiplied by 100. Unless otherwise indicated, all\nvalues are the percent error. The best average result for each dataset is shown in bold.\n4.4 Weak Classi\ufb01er Labels\n\nOur next benchmark for the GLC is to use noisy labels obtained from a weak classi\ufb01er. This models\nthe scenario of label noise arising from a classi\ufb01cation system weaker than one\u2019s own, but with access\nto information about the true labels that one wishes to transfer to one\u2019s own system. For example,\nscraping image labels from surrounding text on web pages provides a valuable signal, but these\nlabels would train a sub-par classi\ufb01er without correcting the label noise. This setting exactly satis\ufb01es\n\nthe conditional independence assumption on label noise used for our (cid:98)C estimate, because the weak\n\nclassi\ufb01er does not take the true label as input when outputting noisy labels.\nWeak Classi\ufb01er Label Generation. To obtain the labels, we train 40-layer Wide Residual Net-\nworks on CIFAR-10 and CIFAR-100 with clean labels for ten epochs each. Then, we sample from\ntheir softmax distributions with a temperature of 5, and \ufb01x the resulting labels. This results in\nnoisy labels which we use in place of the labels obtained through the uniform, \ufb02ip, and hierarchical\ncorruption methods. The labelings produced by the weak classi\ufb01ers have accuracies of 40% on\nCIFAR-10 and 7% on CIFAR-100. Despite the presence of highly corrupted labels, we are able\nto signi\ufb01cantly recover performance with the use of a trusted set. Note that unlike the previous\ncorruption methods, weak classi\ufb01er labels have only one corruption strength. Thus, performance is\nmeasured in percent error rather than area under the error curve. Results are displayed in Table 3.\nAnalysis of Results. On average, the GLC outperforms all other methods in the weak classi\ufb01er label\nexperiments. The Distillation method performs better than the GLC by a small margin at the highest\ntrusted fraction, but performs worse at lower trusted fractions, indicating that the GLC enjoys superior\ndata ef\ufb01ciency. This is highlighted by the GLC attaining a 26.94% error rate on CIFAR-10 with a\ntrusted fraction of only 1%, down from the original error rate of 60%. It should be noted, however, that\ntraining with no correction attains 28.32% error on this experiment, suggesting that the weak classi\ufb01er\nlabels have low bias. The improvement conferred by the GLC is greater with larger trusted fractions.\n\n5 Discussion\n\nData Ef\ufb01ciency. We have seen that the GLC works for small trusted fractions. We further corrobo-\nrate its data ef\ufb01ciency by turning to the Clothing1M dataset [27]. Clothing1M is a massive dataset\nwith both human-annotated and noisy labels, which we use to compare the data ef\ufb01ciency of the GLC\nto that of Distillation when very few trusted labels are present. It consists in 1 million noisily labeled\nclothing images obtained by crawling online marketplaces. 50,000 images have human-annotated\nexamples, from which we take subsamples as our trusted set.\nFor both the GLC and Distillation, we \ufb01rst \ufb01ne-tune a ResNet-34 on untrusted training examples for\nfour epochs, and use this to estimate our corruption matrix. Thereafter, we \ufb01ne-tune the network for\n\n7\n\n\ffour more epochs on the combined trusted and untrusted sets using the respective method. During \ufb01ne\ntuning, we freeze the \ufb01rst seven layers, and train using gradient descent with Nesterov momentum\nand a cosine learning rate schedule. For preprocessing, we randomly crop and use mirroring. We also\nupsample the trusted dataset, \ufb01nding this to give better performance for both methods.\nAs shown in Figure 3, the GLC outperforms Distillation by a\nlarge margin, especially with fewer trusted examples. This is be-\ncause Distillation requires \ufb01ne-tuning a classi\ufb01er on the trusted\ndata alone, which generalizes poorly with very few examples.\nBy contrast, estimating the C matrix can be done with very\nfew examples. Correspondingly, we \ufb01nd that our advantage\ndecreases as the number of trusted examples increases.\nWith more trusted labels, performance on Clothing1M saturates,\nas evident in Figure 3. We consider the extreme and train on the\nentire trusted set for Clothing1M. We \ufb01ne-tune a pre-trained 50-\nlayer ResNeXt [28] on untrusted training examples to estimate\nour corruption matrix. Then, we \ufb01ne-tune the ResNeXt on all\ntraining examples. During \ufb01ne-tuning, we use gradient descent\nwith Nesterov momentum. During the \ufb01rst two epochs, we tune\nonly the output layer with a learning rate of 10\u22122. Thereafter, we tune the whole network at a learning\nrate of 10\u22123 for two epochs, and for another two epochs at 10\u22124. Then we apply our loss correction.\nNow, we \ufb01ne-tune the entire network at a learning rate of 10\u22123 for two epochs, continue training\nat 10\u22124, and early-stop based upon the validation set. In a previous work, Xiao et al. [27] obtain\n78.24% in this setting. However, our method obtains a state-of-the-art accuracy of 80.67%, while\nwith this procedure the Forward method only obtains 79.03% accuracy.\n\nFigure 3: Data ef\ufb01ciency of our\nmethod compared to Distillation on\nClothing1M.\n\nran several variants of the GLC experiment on CIFAR-100 under the label \ufb02ipping corruption at a\ntrusted fraction of 5/100 which we now describe. For all variants, we averaged the area under the\nerror curve over \ufb01ve random initializations.\n\nImproving (cid:98)C Estimation. For some datasets, the classi\ufb01er(cid:98)p((cid:101)y | x) may be a poor estimate of\np((cid:101)y | x), presenting a bottleneck in the estimation of (cid:98)C for the GLC. To see the extent to which\nthis could impact performance, and whether simple methods for improving(cid:98)p((cid:101)y | x) could help, we\n1. In the \ufb01rst variant, we replaced the GLC estimate of (cid:98)C with C, the true corruption matrix.\n(cid:98)p((cid:101)y | x) estimate, despite the higher entropy of the noisy labels, so we used the temperature scaling\ncon\ufb01dence calibration method proposed in the paper to calibrate(cid:98)p((cid:101)y | x).\n3. Suppose we know the base rates of corrupted labels(cid:101)b, where(cid:101)bi = p((cid:101)y = i), and the base rate\nof true labels b of the trusted set. If we posit that (cid:98)C0 corrupted the labels, then we should have\nbT(cid:98)C0 =(cid:101)bT. Thus, we may obtain a superior estimate of the corruption matrix by computing a new\nestimate (cid:98)C = argmin(cid:98)C (cid:107)bT(cid:98)C0 \u2212(cid:101)bT(cid:107)2\n2 subject to (cid:98)C1 = 1.\nWe found that using the true corruption matrix as our (cid:98)C provides a bene\ufb01t of 0.96 percentage points\ndif\ufb01cult without directly improving the performance of the neural network used to estimate(cid:98)p(y | x).\n\nin area under the error curve, but neither the con\ufb01dence calibration nor the base rate incorporation\nwas able to change the performance from the original GLC. This indicates that the GLC is robust\nto the use of uncalibrated networks for estimating C, and that improving its performance may be\n\n2. As demonstrated by Hendrycks and Gimpel [6] and Guo et al. [5], modern deep neural network\nclassi\ufb01ers tend to have overcon\ufb01dent softmax distributions. We found this to be the case with our\n\n2 + \u03bb(cid:107)(cid:98)C \u2212 (cid:98)C0(cid:107)2\n\n6 Conclusion\n\nIn this work, we have shown the impact of having a small set of trusted examples on label noise\nrobustness in neural network classi\ufb01ers. We proposed the Gold Loss Correction (GLC), a method\nfor coping with label noise. This method leverages the assumption that the model has access to a\nsmall set of correct labels in order to yield accurate estimates of the noise distribution. Throughout\nour experiments, the GLC surpasses previous label noise robustness methods across various natural\nlanguage processing and vision domains which we showed by considering several corruptions and\nnumerous strengths, including severe strengths. These results demonstrate that the GLC is a powerful,\ndata-ef\ufb01cient method for improving robustness to label noise.\n\n8\n\n1005001000Number of Trusted Examples3040506070Percent AccuracyDistillationGLC (Ours)\fAcknowledgments\n\nWe thank NVIDIA for donating GPUs used in this research.\n\nReferences\n\n[1] B Biggio, B Nelson, and P Laskov. \u201cSupport Vector Machines Under Adversarial Label Noise\u201d.\n\nIn: ACML (2011).\n\n[2] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. \u201cLearning from Untrusted Data\u201d. In:\n\nSTOC (2017).\n\n[3] Beno\u00eet Fr\u00e9nay and Michel Verleysen. \u201cClassi\ufb01cation in the presence of label noise: a survey\u201d.\n\nIn: IEEE Trans Neural Netw Learn Syst (2014).\n\n[4] Kevin Gimpel et al. \u201cPart-of-speech Tagging for Twitter: Annotation, Features, and Experi-\n\nments\u201d. In: ACL (2011).\n\n[5] Chuan Guo et al. \u201cOn Calibration of Modern Neural Networks\u201d. In: ICML (2017).\n[6] Dan Hendrycks and Kevin Gimpel. \u201cA Baseline for Detecting Misclassi\ufb01ed and Out-of-\n\nDistribution Examples in Neural Networks\u201d. In: ICLR (2017).\n\n[7] Dan Hendrycks and Kevin Gimpel. \u201cGaussian Error Linear Units (GELUs)\u201d. In: arXiv\n\n1606.08415 (2016).\n\n[8] Diederik P. Kingma and Jimmy Ba. \u201cAdam: A Method for Stochastic Optimization\u201d. In: ICLR\n\n(2014).\nJ Larsen et al. \u201cDesign of robust neural network classi\ufb01ers\u201d. In: Acoustics, Speech and Signal\nProcessing, 1998. Proceedings of the 1998 IEEE International Conference on. 1998.\n\n[9]\n\n[10] Bo Li et al. \u201cData Poisoning Attacks on Factorization-Based Collaborative Filtering\u201d. In: NIPS\n\n(2016).\n\n[11] Yuncheng Li et al. \u201cLearning from Noisy Labels with Distillation\u201d. In: ICCV (2017).\n[12]\n\nIlya Loshchilov and Frank Hutter. \u201cSGDR: Stochastic Gradient Descent with Warm Restarts\u201d.\nIn: ICLR (2016).\n\n[13] Andrew L. Maas et al. \u201cLearning Word Vectors for Sentiment Analysis\u201d. In: Proceedings of\nthe 49th Annual Meeting of the Association for Computational Linguistics: Human Language\nTechnologies. 2011.\n\n[14] Aditya Krishna Menon, Brendan van Rooyen, and Nagarajan Natarajan. \u201cLearning from Binary\n\nLabels with Instance-Dependent Corruption\u201d. In: CoRR (2016).\n\n[15] Volodymyr Mnih and Geoffrey E Hinton. \u201cLearning to label aerial images from noisy data\u201d.\n\nIn: ICML (2012).\n\n[16] Nagarajan Natarajan et al. \u201cLearning with Noisy Labels\u201d. In: Advances in Neural Information\n\nProcessing Systems 26. 2013.\n\n[17] David F Nettleton, Albert Orriols-Puig, and Albert Fornells. \u201cA study of the effect of different\ntypes of noise on the precision of supervised learning techniques\u201d. In: Artif Intell Rev (2010).\n[18] Giorgio Patrini et al. \u201cMaking Deep Neural Networks Robust to Label Noise: a Loss Correction\n\nApproach\u201d. In: CVPR (2016).\n\n[19] M Pechenizkiy et al. \u201cClass Noise and Supervised Learning in Medical Domains: The Effect\nof Feature Extraction\u201d. In: 19th IEEE Symposium on Computer-Based Medical Systems\n(CBMS\u201906). 2006.\n\n[20] Scott Reed et al. \u201cTraining Deep Neural Networks on Noisy Labels with Bootstrapping\u201d. In:\n\nICLR Workshop (2014).\n\n[21] Thu Mengye Ren et al. \u201cLearning to Reweight Examples for Robust Deep Learning\u201d. In: ICML\n\n(2018).\n\n[22] Richard Socher et al. \u201cRecursive Deep Models for Semantic Compositionality Over a Sentiment\n\nTreebank\u201d. In: Conference on Empirical Methods in Natural Language Processing. 2013.\n\n[23] Nitish Srivastava et al. \u201cDropout: A simple way to prevent neural networks from over\ufb01tting\u201d.\n\nIn: The Journal of Machine Learning Research (2014).\nJacob Steinhardt, Pang Wei Koh, and Percy Liang. \u201cCerti\ufb01ed Defenses for Data Poisoning\nAttacks\u201d. In: NIPS. 2017.\n\n[24]\n\n9\n\n\f[25] Sainbayar Sukhbaatar et al. \u201cTraining Convolutional Networks with Noisy Labels\u201d. In: ICLR\n\nWorkshop (2014).\n\n[26] Andreas Veit et al. \u201cLearning From Noisy Large-Scale Datasets With Minimal Supervision\u201d.\n\nIn: CVPR (2017).\n\n[27] Tong Xiao et al. \u201cLearning from massive noisy labeled data for image classi\ufb01cation\u201d. In: CVPR\n\n(2015).\n\n[28] Saining Xie et al. \u201cAggregated Residual Transformations for Deep Neural Networks\u201d. In:\n\nCVPR (2016).\n\n[29] Sergey Zagoruyko and Nikos Komodakis. \u201cWide Residual Networks\u201d. In: BMVC (2016).\n[30] Xingquan Zhu and Xindong Wu. \u201cClass Noise vs. Attribute Noise: A Quantitative Study\u201d. In:\n\nArti\ufb01cial Intelligence Review (2004).\n\n10\n\n\f", "award": [], "sourceid": 6697, "authors": [{"given_name": "Dan", "family_name": "Hendrycks", "institution": "UC Berkeley"}, {"given_name": "Mantas", "family_name": "Mazeika", "institution": "University of Chicago"}, {"given_name": "Duncan", "family_name": "Wilson", "institution": "Leap Motion"}, {"given_name": "Kevin", "family_name": "Gimpel", "institution": null}]}