{"title": "Co-teaching: Robust training of deep neural networks with extremely noisy labels", "book": "Advances in Neural Information Processing Systems", "page_first": 8527, "page_last": 8537, "abstract": "Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can totally memorize these noisy labels sooner or later during training. Nonetheless, recent studies on the memorization effects of deep neural networks show that they would first memorize training data of clean labels and then those of noisy labels. Therefore in this paper, we propose a new deep learning paradigm called ''Co-teaching'' for combating with noisy labels. Namely, we train two deep neural networks simultaneously, and let them teach each other given every mini-batch: firstly, each network feeds forward all data and selects some data of possibly clean labels; secondly, two networks communicate with each other what data in this mini-batch should be used for training; finally, each network back propagates the data selected by its peer network and updates itself. Empirical results on noisy versions of MNIST, CIFAR-10 and CIFAR-100 demonstrate that Co-teaching is much superior to the state-of-the-art methods in the robustness of trained deep models.", "full_text": "Co-teaching: Robust Training of Deep Neural\n\nNetworks with Extremely Noisy Labels\n\nBo Han\u22171,2, Quanming Yao\u22173, Xingrui Yu1, Gang Niu2,\n\nMiao Xu2, Weihua Hu4, Ivor W. Tsang1, Masashi Sugiyama2,5\n1Centre for Arti\ufb01cial Intelligence, University of Technology Sydney;\n2RIKEN; 34Paradigm Inc.; 4Stanford University; 5University of Tokyo\n\nAbstract\n\nDeep learning with noisy labels is practically challenging, as the capacity of deep\nmodels is so high that they can totally memorize these noisy labels sooner or later\nduring training. Nonetheless, recent studies on the memorization effects of deep\nneural networks show that they would \ufb01rst memorize training data of clean labels\nand then those of noisy labels. Therefore in this paper, we propose a new deep\nlearning paradigm called \u201cCo-teaching\u201d for combating with noisy labels. Namely,\nwe train two deep neural networks simultaneously, and let them teach each other\ngiven every mini-batch: \ufb01rstly, each network feeds forward all data and selects\nsome data of possibly clean labels; secondly, two networks communicate with each\nother what data in this mini-batch should be used for training; \ufb01nally, each network\nback propagates the data selected by its peer network and updates itself. Empirical\nresults on noisy versions of MNIST, CIFAR-10 and CIFAR-100 demonstrate that\nCo-teaching is much superior to the state-of-the-art methods in the robustness of\ntrained deep models.\n\nIntroduction\n\n1\nLearning from noisy labels can date back to three decades ago [1], and still keeps vibrant in recent\nyears [13, 31]. Essentially, noisy labels are corrupted from ground-truth labels, and thus they\ninevitably degenerate the robustness of learned models, especially for deep neural networks [2, 45].\nUnfortunately, noisy labels are ubiquitous in the real world. For instance, both online queries [4] and\ncrowdsourcing [42, 44] yield a large number of noisy labels across the world everyday.\nAs deep neural networks have the high capacity to \ufb01t noisy labels [45], it is challenging to train deep\nnetworks robustly with noisy labels. Current methods focus on estimating the noise transition matrix.\nFor example, on top of the softmax layer, Goldberger et al. [13] added an additional softmax layer to\nmodel the noise transition matrix. Patrini et al. [31] leveraged a two-step solution to estimating the\nnoise transition matrix heuristically. However, the noise transition matrix is not easy to be estimated\naccurately, especially when the number of classes is large.\nTo be free of estimating the noise transition matrix, a promising direction focuses on training on\nselected samples [17, 26, 34]. These works try to select clean instances out of the noisy ones, and\nthen use them to update the network. Intuitively, as the training data becomes less noisy, better\nperformance can be obtained. Among those works, the representative methods are MentorNet [17]\nand Decoupling [26]. Speci\ufb01cally, MentorNet pre-trains an extra network, and then uses the extra\nnetwork for selecting clean instances to guide the training. When the clean validation data is not\navailable, MentorNet has to use a prede\ufb01ned curriculum (e.g., self-paced curriculum). Nevertheless,\nthe idea of self-paced MentorNet is similar to the self-training approach [6], and it inherited the same\ninferiority of accumulated error caused by the sample-selection bias. Decoupling trains two networks\n\u2217The \ufb01rst two authors (Bo Han and Quanming Yao) made equal contributions. The implementation is\n\navailable at https://github.com/bhanML/Co-teaching.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Comparison of error \ufb02ow among MentorNet (M-Net) [17], Decoupling [26] and Co-\nteaching. Assume that the error \ufb02ow comes from the biased selection of training instances, and error\n\ufb02ow from network A or B is denoted by red arrows or blue arrows, respectively. Left panel: M-Net\nmaintains only one network (A). Middle panel: Decoupling maintains two networks (A & B). The\nparameters of two networks are updated, when the predictions of them disagree (!=). Right panel:\nCo-teaching maintains two networks (A & B) simultaneously. In each mini-batch data, each network\nsamples its small-loss instances as the useful knowledge, and teaches such useful instances to its peer\nnetwork for the further training. Thus, the error \ufb02ow in Co-teaching displays the zigzag shape.\n\nsimultaneously, and then updates models only using the instances that have different predictions from\nthese two networks. Nonetheless, noisy labels are evenly spread across the whole space of examples.\nThus, the disagreement area includes a number of noisy labels, where the Decoupling approach cannot\nhandle noisy labels explicitly. Although MentorNet and Decoupling are representative approaches in\nthis promising direction, there still exist the above discussed issues, which naturally motivates us to\nimprove them in our research.\nMeanwhile, an interesting observation for deep models is that they can memorize easy instances\n\ufb01rst, and gradually adapt to hard instances as training epochs become large [2]. When noisy labels\nexist, deep learning models will eventually memorize these wrongly given labels [45], which leads to\nthe poor generalization performance. Besides, this phenomenon does not change with the choice of\ntraining optimizations (e.g., Adagrad [9] and Adam [18]) or network architectures (e.g., MLP [15],\nAlexnet [20] and Inception [37]) [17, 45].\nIn this paper, we propose a simple but effective learning paradigm called \u201cCo-teaching\u201d, which allows\nus to train deep networks robustly even with extremely noisy labels (e.g., 45% of noisy labels occur\nin the \ufb01ne-grained classi\ufb01cation with multiple classes [8]). Our idea stems from the Co-training\napproach [5]. Similarly to Decoupling, our Co-teaching also maintains two networks simultaneously.\nThat being said, it is worth noting that, in each mini-batch of data, each network views its small-loss\ninstances (like self-paced MentorNet) as the useful knowledge, and teaches such useful instances to\nits peer network for updating the parameters. The intuition why Co-teaching can be more robust is\nbrie\ufb02y explained as follows. In Figure 1, assume that the error \ufb02ow comes from the biased selection\nof training instances in the \ufb01rst mini-batch of data. In MentorNet or Decoupling, the error from one\nnetwork will be directly transferred back to itself in the second mini-batch of data, and the error\nshould be increasingly accumulated. However, in Co-teaching, since two networks have different\nlearning abilities, they can \ufb01lter different types of error introduced by noisy labels. In this exchange\nprocedure, the error \ufb02ows can be reduced by peer networks mutually. Moreover, we train deep\nnetworks using stochastic optimization with momentum, and nonlinear deep networks can memorize\nclean data \ufb01rst to become robust [2]. When the error from noisy data \ufb02ows into the peer network, it\nwill attenuate this error due to its robustness.\nWe conduct experiments on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical\nresults demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robust-\nness of deep learning models trained by the Co-teaching approach is much superior to state-of-the-art\nbaselines. Under low-level noisy circumstances (i.e., 20% of noisy labels), the robustness of deep\nlearning models trained by the Co-teaching approach is still superior to most baselines.\n\n2 Related literature\n\nStatistical learning methods. Statistical learning contributed a lot to the problem of noisy labels,\nespecially in theoretical aspects. The approach can be categorized into three strands: surrogate loss,\n\n2\n\nMini-batch 1AAAAAABBBAAABBB!=!=M-NetDecouplingCo-teachingMini-batch 2Mini-batch 3\fAlgorithm 1 Co-teaching Algorithm.\n1: Input wf and wg, learning rate \u03b7, \ufb01xed \u03c4, epoch Tk and Tmax, iteration Nmax;\nfor T = 1, 2, . . . , Tmax do\n\n//noisy dataset\n\n//sample R(T )% small-loss instances\n//sample R(T )% small-loss instances\n//update wf by \u00afDg;\n//update wg by \u00afDf ;\n\n2: Shuf\ufb02e training set D;\nfor N = 1, . . . , Nmax do\n\n3: Fetch mini-batch \u00afD from D;\n4: Obtain \u00afDf = arg minD(cid:48):|D(cid:48)|\u2265R(T )| \u00afD| (cid:96)(f,D(cid:48));\n5: Obtain \u00afDg = arg minD(cid:48):|D(cid:48)|\u2265R(T )| \u00afD| (cid:96)(g,D(cid:48));\n6: Update wf = wf \u2212 \u03b7\u2207(cid:96)(f, \u00afDg);\n7: Update wg = wg \u2212 \u03b7\u2207(cid:96)(g, \u00afDf );\n\nend\n8: Update R(T ) = 1 \u2212 min\n\nend\n9: Output wf and wg.\n\n(cid:110) T\n\nTk\n\n(cid:111)\n\n\u03c4, \u03c4\n\n;\n\nnoise rate estimation and probabilistic modeling. For example, in the surrogate losses category,\nNatarajan et al. [30] proposed an unbiased estimator to provide the noise corrected loss approach.\nMasnadi-Shirazi et al. [27] presented a robust non-convex loss, which is the special case in a family\nof robust losses. In the noise rate estimation category, both Menon et al. [28] and Liu et al. [23]\nproposed a class-probability estimator using order statistics on the range of scores. Sanderson\net al. [36] presented the same estimator using the slope of the ROC curve. In the probabilistic\nmodeling category, Raykar et al. [32] proposed a two-coin model to handle noisy labels from multiple\nannotators. Yan et al. [42] extended this two-coin model by setting the dynamic \ufb02ipping probability\nassociated with instances.\nOther deep learning approaches.\nIn addition, there are some other deep learning solutions to deal\nwith noisy labels [24, 41]. For example, Li et al. [22] proposed a uni\ufb01ed framework to distill the\nknowledge from clean labels and knowledge graph, which can be exploited to learn a better model\nfrom noisy labels. Veit et al. [40] trained a label cleaning network by a small set of clean labels,\nand used this network to reduce the noise in large-scale noisy labels. Tanaka et al. [38] presented a\njoint optimization framework to learn parameters and estimate true labels simultaneously. Ren et\nal. [34] leveraged an additional validation set to adaptively assign weights to training examples in\nevery iteration. Rodrigues et al. [35] added a crowd layer after the output layer for noisy labels from\nmultiple annotators. However, all methods require either extra resources or more complex networks.\nLearning to teach methods. Learning-to-teach is also a hot topic. Inspired by [16], these methods\nare made up by teacher and student networks. The duty of teacher network is to select more\ninformative instances for better training of student networks. Recently, such idea is applied to learn a\nproper curriculum for the training data [10] and deal with multi-labels [14]. However, these works do\nnot consider noisy labels, and MentorNet [17] introduced this idea into such area.\n\n3 Co-teaching meets noisy supervision\n\nOur idea is to train two deep networks simultaneously. As in Figure 1, in each mini-batch data, each\nnetwork selects its small-loss instances as the useful knowledge, and teaches such useful instances to\nits peer network for the further training. Therefore, the proposed algorithm is named Co-teaching\n(Algorithm 1). As all deep learning training methods are based on stochastic gradient descent, our\nCo-teaching works in a mini-batch manner. Speci\ufb01cally, we maintain two networks f (with parameter\nwf ) and g (with parameter wg). When a mini-batch \u00afD is formed (step 3), we \ufb01rst let f (resp. g)\nselect a small proportion of instances in this mini-batch \u00afDf (resp. \u00afDg) that have small training loss\n(steps 4 and 5). The number of instances is controlled by R(T ), and f (resp. g) only selects R(T )\npercentage of small-loss instances out of the mini-batch. Then, the selected instances are fed into its\npeer network as the useful knowledge for parameter updates (steps 6 and 7).\nThere are two important questions for designing above Algorithm 1:\n\nQ1. Why can sampling small-loss instances based on dynamic R(T ) help us \ufb01nd clean instances?\nQ2. Why do we need two networks and cross-update the parameters?\n\n3\n\n\fTo answer the \ufb01rst question, we \ufb01rst need to clarify the connection between small losses and clean\ninstances. Intuitively, when labels are correct, small-loss instances are more likely to be the ones\nwhich are correctly labeled. Thus, if we train our classi\ufb01er only using small-loss instances in each\nmini-bach data, it should be resistant to noisy labels.\nHowever, the above requires that the classi\ufb01er is reliable enough so that the small-loss instances\nare indeed clean. The \u201cmemorization\u201d effect of deep networks can exactly help us address this\nproblem [2]. Namely, on noisy data sets, even with the existence of noisy labels, deep networks\nwill learn clean and easy pattern in the initial epochs [45, 2]. So, they have the ability to \ufb01lter out\nnoisy instances using their loss values at the beginning of training. Yet, the problem is that when the\nnumber of epochs goes large, they will eventually over\ufb01t on noisy labels. To rectify this problem, we\nwant to keep more instances in the mini-batch at the start, i.e., R(T ) is large. Then, we gradually\nincrease the drop rate, i.e., R(T ) becomes smaller, so that we can keep clean instances and drop those\nnoisy ones before our networks memorize them (details of R(T ) will be discussed in Section 4.2).\nBased on this idea, we can just use one network in Algorithm 1, and let the classi\ufb01er evolve by itself.\nThis process is similar to boosting [11] and active learning [7]. However, it is commonly known that\nboosting and active learning are sensitive to outliers and noise, and a few wrongly selected instances\ncan deteriorate the learning performance of the whole model [12, 3]. This connects with our second\nquestion, where two classi\ufb01ers can help.\nIntuitively, different classi\ufb01ers can generate different decision boundaries and then have different\nabilities to learn. Thus, when training on noisy labels, we also expect that they can have different\nabilities to \ufb01lter out the label noise. This motivates us to exchange the selected small-loss instances,\ni.e., update parameters in f (resp. g) using mini-batch instances selected from g (resp. f). This\nprocess is similar to Co-training [5], and these two networks will adaptively correct the training error\nby the peer network if the selected instances are not fully clean. Take \u201cpeer-review\u201d as a supportive\nexample. When students check their own exam papers, it is hard for them to \ufb01nd any error or bug\nbecause they have some personal bias for the answers. Luckily, they can ask peer classmates to\nreview their papers. Then, it becomes much easier for them to \ufb01nd their potential faults. To sum\nup, as the error from one network will not be directly transferred back itself, we can expect that our\nCo-teaching method can deal with heavier noise compared with the self-evolving one.\n\nRelations to Co-training. Although Co-teaching is motivated by Co-training, the only similarity\nis that two classi\ufb01ers are trained. There are fundamental differences between them. (i). Co-training\nneeds two views (two independent sets of features), while Co-teaching needs a single view. (ii)\nCo-training does not exploit the memorization of deep neural networks, while Co-teaching does. (iii)\nCo-training is designed for semi-supervised learning (SSL), and Co-teaching is for learning with\nnoisy labels (LNL); as LNL is not a special case of SSL, we cannot simply translate Co-training from\none problem setting to another problem setting.\n\n4 Experiments\nDatasets. We verify the effectiveness of our approach on three benchmark datasets. MNIST, CIFAR-\n10 and CIFAR-100 are used here (Table 1), because these data sets are popularly used for evaluation\nof noisy labels in the literature [13, 31, 33].\n\nTable 1: Summary of data sets used in the experiments.\nimage size\n28\u00d728\n32\u00d732\n32\u00d732\n\n60,000\n50,000\n50,000\n\n10,000\n10,000\n10,000\n\nMNIST\nCIFAR-10\nCIFAR-100\n\n# of training\n\n# of testing\n\n# of class\n\n10\n10\n100\n\nSince all datasets are clean, following [31, 33], we need to corrupt these datasets manually by the\nnoise transition matrix Q, where Qij = Pr(\u02dcy = j|y = i) given that noisy \u02dcy is \ufb02ipped from clean y.\nAssume that the matrix Q has two representative structures (Figure 2): (1) Symmetry \ufb02ipping [39];\n(2) Pair \ufb02ipping: a simulation of \ufb01ne-grained classi\ufb01cation with noisy labels, where labelers may\nmake mistakes only within very similar classes. Their precise de\ufb01nition is in Appendix A.\nSince this paper mainly focuses on the robustness of our Co-teaching on extremely noisy supervision,\nthe noise rate \u0001 is chosen from {0.45, 0.5}. Intuitively, this means almost half of the instances have\nnoisy labels. Note that, the noise rate > 50% for pair \ufb02ipping means over half of the training data\n\n4\n\n\fFigure 2: Transition matrices of different noise types (using 5 classes as an example).\n\n(a) Pair (\u0001 = 45%).\n\n(b) Symmetry (\u0001 = 50%).\n\nTable 2: Comparison of state-of-the-art techniques with our Co-teaching approach. In the \ufb01rst column,\n\u201clarge noise\u201d: can deal with a large number of classes; \u201cheavy noise\u201d: can combat with the heavy\nnoise, i.e., high noise ratio; \u201c\ufb02exibility\u201d: need not combine with speci\ufb01c network architecture; \u201cno\npre-train\u201d: can be trained from scratch.\nS-model\n\nF-correction Decoupling MentorNet Co-teaching\n\nBootstrap\n\nlarge class\nheavy noise\n\ufb02exibility\nno pre-train\n\n\u0017\n\u0017\n\u0017\n(cid:88)\n\n\u0017\n\u0017\n\u0017\n\u0017\n\n\u0017\n\u0017\n(cid:88)\n\u0017\n\n(cid:88)\n\u0017\n(cid:88)\n\u0017\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nhave wrong labels that cannot be learned without additional assumptions. As a side product, we also\nverify the robustness of Co-teaching on low-level noisy supervision, where \u0001 is set to 0.2. Note that\npair case is much harder than symmetry case. In Figure 2(a), the true class only has 10% more correct\ninstances over wrong ones. However, the true has 37.5% more correct instances in Figure 2(b).\nBaselines. We compare the Co-teaching (Algorithm 1) with following state-of-art approaches: (i).\nBootstrap [33], which uses a weighted combination of predicted and original labels as the correct\nlabels, and then does back propagation. Hard labels are used as they yield better performance; (ii).\nS-model [13], which uses an additional softmax layer to model the noise transition matrix; (iii).\nF-correction [31], which corrects the prediction by the noise transition matrix. As suggested by the\nauthors, we \ufb01rst train a standard network to estimate the transition matrix; (iv). Decoupling [26],\nwhich updates the parameters only using the samples which have different prediction from two\nclassi\ufb01ers; and (v). MentorNet [17]. An extra teacher network is pre-trained and then used to \ufb01lter\nout noisy instances for its student network to learn robustly under noisy labels. Then, student network\nis used for classi\ufb01cation. We used self-paced MentorNet in this paper. (vi). As a baseline, we\ncompare Co-teaching with the standard deep networks trained on noisy datasets (abbreviated as\nStandard). Above methods are systematically compared in Table 2. As can be seen, our Co-teaching\nmethod does not rely on any speci\ufb01c network architectures, which can also deal with a large number\nof classes and is more robust to noise. Besides, it can be trained from scratch. These make our\nCo-teaching more appealing for practical usage. Our implementation of Co-teaching is available at\nhttps://github.com/bhanML/Co-teaching.\nNetwork structure and optimizer. For the fair comparison, we implement all methods with default\nparameters by PyTorch, and conduct all the experiments on a NIVIDIA K80 GPU. CNN is used with\nLeaky-ReLU (LReLU) active function [25], and the detailed architecture is in Table 3. Namely, the\n9-layer CNN architecture in our paper follows \u201cTemporal Ensembling\u201d [21] and \u201cVirtual Adversarial\nTraining\u201d [29], since the network structure we used here is standard test bed for weakly-supervised\nlearning. For all experiments, Adam optimizer (momentum=0.9) is with an initial learning rate\nof 0.001, and the batch size is set to 128 and we run 200 epochs. Besides, dropout and batch-\nnormalization are also used. As deep networks are highly nonconvex, even with the same network and\noptimization method, different initializations can lead to different local optimal. Thus, following [26],\nwe also take two networks with the same architecture but different initializations as two classi\ufb01ers.\nExperimental setup. Here, we assume the noise level \u0001 is known and set R(T ) = 1 \u2212 \u03c4 \u00b7\nmin (T /Tk, 1) with Tk = 10 and \u03c4 = \u0001. If \u0001 is not known in advanced, \u0001 can be inferred us-\ning validation sets [23, 43]. The choices of R(T ) and \u03c4 are analyzed in Section 4.2. Note that R(T )\nonly depends on the memorization effect of deep networks but not any speci\ufb01c datasets.\nAs for performance measurements, \ufb01rst, we use the test accuracy, i.e., test Accuracy = (# of correct\npredictions) / (# of test dataset). Besides, we also use the label precision in each mini-batch, i.e.,\nlabel Precision = (# of clean labels) / (# of all selected labels). Speci\ufb01cally, we sample R(T ) of\n\n5\n\n\fTable 3: CNN models used in our experiments on MNIST, CIFAR-10, and CIFAR-100. The slopes of\nall LReLU functions in the networks are set to 0.01.\n\nCNN on MNIST\n28\u00d728 Gray Image\n\ndropout, p = 0.25\n\nCNN on CIFAR-10 CNN on CIFAR-100\n32\u00d732 RGB Image\n32\u00d732 RGB Image\n3\u00d73 conv, 128 LReLU\n3\u00d73 conv, 128 LReLU\n3\u00d73 conv, 128 LReLU\n2\u00d72 max-pool, stride 2\n3\u00d73 conv, 256 LReLU\n3\u00d73 conv, 256 LReLU\n3\u00d73 conv, 256 LReLU\n2\u00d72 max-pool, stride 2\n3\u00d73 conv, 512 LReLU\n3\u00d73 conv, 256 LReLU\n3\u00d73 conv, 128 LReLU\n\ndropout, p = 0.25\n\ndense 128\u219210\n\navg-pool\n\ndense 128\u219210\n\ndense 128\u2192100\n\nFlipping-Rate\n\nPair-45%\n\nSymmetry-50% 66.05%\n\nSymmetry-20% 94.05%\n\nTable 4: Average test accuracy on MNIST over the last ten epochs.\n\nStandard Bootstrap\nS-model\n56.52%\n56.88%\n57.23%\n\u00b10.55% \u00b10.73% \u00b10.32%\n62.29%\n\u00b10.61% \u00b10.53% \u00b10.46%\n98.31%\n\u00b10.16% \u00b10.26% \u00b10.11%\n\n67.55%\n\n94.40%\n\nF-correction Decoupling MentorNet Co-teaching\n87.63%\n0.24%\n\u00b10.21%\n\u00b10.03%\n91.32%\n79.61%\n\u00b10.06%\n\u00b11.96%\n98.80%\n97.25%\n\u00b10.03%\n\u00b10.12%\n\n80.88%\n\u00b14.45%\n90.05%\n\u00b10.30%\n96.70%\n\u00b10.22%\n\n58.03%\n\u00b10.07%\n81.15%\n\u00b10.03%\n95.70%\n\u00b10.02%\n\nsmall-loss instances in each mini-batch, and then calculate the ratio of clean labels in the small-loss\ninstances. Intuitively, higher label precision means less noisy instances in the mini-batch after sample\nselection, and the algorithm with higher label precision is also more robust to the label noise. All\nexperiments are repeated \ufb01ve times. The error bar for STD in each \ufb01gure has been highlighted as a\nshade. Besides, the full Y-axis versions for all \ufb01gures are in Appendix B.\n\n4.1 Comparison with the State-of-the-Arts\nResults on MNIST. Table 4 reports the accuracy on the testing set. As can be seen, on the symmetry\ncase with 20% noisy rate, which is also the easiest case, all methods work well. Even Standard can\nachieve 94.05% test set accuracy. Then, when noisy rate raises to 50%, Standard, Bootstrap, S-model\nand F-correction fail, and their accuracy decrease lower than 80%. Methods based on \u201cselected\ninstances\u201d, i.e., Decoupling, MentorNet and Co-teaching are better. Among them, Co-teaching is the\nbest. Finally, in the hardest case, i.e., pair case with 45% noisy rate, Standard, Bootstrap and S-Model\ncannot learn anything. Their testing accuracy keep the same as the percentage of clean instances\nin the training dataset. F-correct fails totally, and it heavily relies on the correct estimation of the\nunderneath transition matrix. Thus, when Standard works, it can work better than Standard; then,\nwhen Standard fails, it works much worse than Standard. In this case, our Co-teaching is again the\nbest, which is also much better than the second method, i.e. 87.53% for Co-teaching vs. 80.88% for\nMentorNet.\nIn Figure 3 , we show test accuracy vs. number of epochs. In all three plots, we can clearly see the\nmemorization effects of networks, i.e., test accuracy of Standard \ufb01rst reaches a very high level and\nthen gradually decreases. Thus, a good robust training method should stop or alleviate the decreasing\nprocessing. On this point, all methods except Bootstrap work well in the easiest Symmetry-20%\ncase. However, only MentorNet and our Co-teaching can combat with the other two harder cases, i.e.,\nPair-45% and Symmetry-50%. Besides, our Co-teaching consistently achieves higher accuracy than\nMentorNet, and is the best method in these two cases.\nTo explain such good performance, we plot label precision vs. number of epochs in Figure 4. Only\nMentorNet, Decoupling and Co-teaching are considered here, as they are methods do instance\nselection during training. First, we can see Decoupling fails to pick up clean instances, and its label\nprecision is the same as Standard which does not compact with noisy label at all. The reason is that\nDecoupling does not utilize the memorization effects during training. Then, we can see Co-teaching\nand MentorNet can successfully pick clean instances out. These two methods tie on the easier\n\n6\n\n\f(a) Pair-45%.\n\n(b) Symmetry-50%.\n\n(c) Symmetry-20%.\n\nFigure 3: Test accuracy vs. number of epochs on MNIST dataset.\n\nSymmetry-50% and Symmetry-20%, when our Co-teaching achieve higher precision on the hardest\nPair-45% case. This shows our approach is better at \ufb01nding clean instances.\n\n(a) Pair-45%.\n\n(b) Symmetry-50%.\n\n(c) Symmetry-20%.\n\nFigure 4: Label precision vs. number of epochs on MNIST dataset.\n\nFinally, note that while in Figure 4(b) and (c), MentorNet and Co-teaching tie together. Co-teaching\nstill gets higher testing accuracy (Table 4). Recall that MentorNet is a self-evolving method, which\nonly uses one classi\ufb01er, while Co-teaching uses two. The better accuracy comes from the fact\nCo-teaching further takes the advantage of different learning abilities of two classi\ufb01ers.\nResults on CIFAR-10. Test accuracy is shown in Table 5. As we can see, the observations here are\nconsistently the same as these for MNIST dataset. In the easiest Symmetry-20% case, all methods\nwork well. F-correction is the best, and our Co-teaching is comparable with F-correction. Then, all\nmethods, except MentorNet and Co-teaching, fail on harder, i.e., Pair-45% and Symmetry-50% cases.\nBetween these two, Co-teaching is the best. In the extreme Pair-45% case, Co-teaching is at least\n14% higher than MentorNet in test accuracy.\n\nFlipping,Rate\n\nPair-45%\n\nSymmetry-50% 48.87%\n\nSymmetry-20% 76.25%\n\nTable 5: Average test accuracy on CIFAR-10 over the last ten epochs.\n\nStandard Bootstrap\nS-model\n48.21%\n49.50%\n50.05%\n\u00b10.42% \u00b10.30% \u00b10.55%\n46.15%\n\u00b10.52% \u00b10.56% \u00b10.76%\n76.84%\n\u00b10.28% \u00b10.29% \u00b10.66%\n\n77.01%\n\n50.66%\n\nF-correction Decoupling MentorNet Co-teaching\n72.62%\n6.61%\n\u00b11.12%\n\u00b10.15%\n74.02%\n59.83%\n\u00b10.04%\n\u00b10.17%\n84.55%\n82.32%\n\u00b10.16%\n\u00b10.07%\n\n58.14%\n\u00b10.38%\n71.10%\n\u00b10.48%\n80.76%\n\u00b10.36%\n\n48.80%\n\u00b10.04%\n51.49%\n\u00b10.08%\n80.44%\n\u00b10.05%\n\nFigure 5 shows test accuracy and label precision vs. number of epochs. Again, on test accuracy, we\ncan see Co-teaching strongly hinders neural networks from memorizing noisy labels. Thus, it works\nmuch better on the harder Pair-45% and Symmetry-50% cases. On label precision, while Decoupling\nfails to \ufb01nd clean instances, both MentorNet and Co-teaching can do this. However, due to the usage\nof two classi\ufb01ers, Co-teaching is stronger.\nResults on CIFAR-100. Finally, we show our results on CIFAR-100. The test accuracy is in Table 6.\nTest accuracy and label precision vs. number of epochs are in Figure 6. Note that there are only 10\nclasses in MNIST and CIFAR-10 datasets. Thus, overall the accuracy is much lower than previous\n\n7\n\nStandardBootstrapS-modelF-correctionDecouplingMentorNetCo-teaching050100150200Epoch0.000.200.400.600.801.00Test Accuracy(MNIST, Pair-45%)050100150200Epoch0.550.600.650.700.750.800.850.900.951.00Test Accuracy(MNIST, Symmetry-50%)050100150200Epoch0.860.880.900.920.940.960.981.00Test Accuracy(MNIST, Symmetry-20%)050100150200Epoch0.600.700.800.90Label Precision(MNIST, Pair-45%)050100150200Epoch0.500.600.700.800.90Label Precision(MNIST, Symmetry-50%)050100150200Epoch0.800.850.900.95Label Precision(MNIST, Symmetry-20%)\f(a) Pair-45%.\n\n(b) Symmetry-50%.\n\n(c) Symmetry-20%.\n\nFigure 5: Results on CIFAR-10 dataset. Top: test accuracy vs. number of epochs; bottom: label\nprecision vs. number of epochs.\n\nones in Tables 4 and 5. However, the observations are the same as previous datasets. We can clearly\nsee our Co-teaching is the best on harder and noisy cases.\n\nFlipping,Rate\n\nPair-45%\n\nSymmetry-50% 25.21%\n\nSymmetry-20% 47.55%\n\nTable 6: Average test accuracy on CIFAR-100 over the last ten epochs.\n\nStandard Bootstrap\nS-model\n21.79%\n31.99%\n32.07%\n\u00b10.64% \u00b10.30% \u00b10.86%\n18.93%\n\u00b10.64% \u00b16.36% \u00b10.39%\n41.51%\n\u00b10.47% \u00b10.54% \u00b10.60%\n\n47.00%\n\n21.98%\n\nF-correction Decoupling MentorNet Co-teaching\n34.81%\n1.60%\n\u00b10.07%\n\u00b10.04%\n41.37%\n41.04%\n\u00b10.08%\n\u00b10.07%\n61.87%\n54.23%\n\u00b10.21%\n\u00b10.08%\n\n31.60%\n\u00b10.51%\n39.00%\n\u00b11.00%\n52.13%\n\u00b10.40%\n\n26.05%\n\u00b10.03%\n25.80%\n\u00b10.04%\n44.52%\n\u00b10.04%\n\n4.2 Choices of R(T ) and \u03c4\nDeep networks initially \ufb01t clean (easy) instances, and then \ufb01t noisy (hard) instances progressively.\nThus, intuitively R(T ) should meet following requirements: (i). R(T ) \u2208 [\u03c4, 1], where \u03c4 depends on\nthe noise rate \u0001; (ii). R(1) = 1, which means we do not need to drop any instances at the beginning.\nAt the initial learning epochs, we can safely update the parameters of deep neural networks using\nentire noisy data, because the networks will not memorize the noisy data at the early stage [2]; (iii).\nR(T ) should be a non-increasing function on T , which means that we need to drop more instances\nwhen the number of epochs gets large. This is because as the learning proceeds, the networks will\neventually try to \ufb01t noisy data (which tends to have larger losses compared to clean data). Thus, we\nneed to ignore them by not updating the networks parameters using large loss instances [2]. The\nMNIST dataset is used in the sequel.\nBased on above principles, to show how the decay of R(T ) affects Co-teaching, \ufb01rst, we let R(T ) =\n1\u2212 \u03c4 \u00b7 min{T c/Tk, 1} with \u03c4 = \u0001, where three choices of c should be considered, i.e., c = {0.5, 1, 2}.\nThen, three values of Tk are considered, i.e., Tk = {5, 10, 15}. Results are in Table 7. As can be\nseen, the test accuracy is stable on the choices of Tk and c here. The previous setup (c = 1 and\nTk = 10) works well but does not lead to the best performance. To show the impact of \u03c4, we\nvary \u03c4 = {0.5, 0.75, 1, 1.25, 1.5}\u0001. Note that, \u03c4 cannot be zero. In this case, no gradient will be\nback-propagated and the optimization will stop. Test accuracy is in Table 8. We can see, with more\ndropped instances, the performance can be improved. However, if too many instances are dropped,\nnetworks may not get suf\ufb01cient training data and the performance can deteriorate. We set \u03c4 = \u0001 in\nSection 4.1, and it works well but not necessarily leads to the best performance.\n\n8\n\nStandardBootstrapS-modelF-correctionDecouplingMentorNetCo-teaching050100150200Epoch0.000.100.200.300.400.500.600.700.80Test Accuracy(CIFAR-10, Pair-45%)050100150200Epoch0.400.450.500.550.600.650.700.750.80Test Accuracy(CIFAR-10, Symmetry-50%)050100150200Epoch0.500.550.600.650.700.750.800.85Test Accuracy(CIFAR-10, Symmetry-20%)050100150200Epoch0.500.550.600.650.700.750.800.85Label Precision(CIFAR-10, Pair-45%)050100150200Epoch0.500.600.700.800.90Label Precision(CIFAR-10, Symmetry-50%)050100150200Epoch0.800.850.900.95Label Precision(CIFAR-10, Symmetry-20%)\f(a) Pair-45%.\n\n(b) Symmetry-50%.\n\n(c) Symmetry-20%.\n\nFigure 6: Results on CIFAR-100 dataset. Top: test accuracy vs. number of epochs; bottom: label\nprecision vs. number of epochs.\n\nTable 7: Average test accuracy on MNIST over the last ten epochs.\nc = 2\n\nc = 0.5\n\nc = 1\n\nPair-45%\n\nSymmetry-50%\n\nSymmetry-20%\n\nTk = 5\nTk = 10\nTk = 15\nTk = 5\nTk = 10\nTk = 15\nTk = 5\nTk = 10\nTk = 15\n\n87.54%\u00b10.23%\n87.59%\u00b10.26%\n75.56%\u00b10.33%\n88.43%\u00b10.25% 87.56%\u00b10.12%\n87.93%\u00b10.21%\n88.37%\u00b10.09% 87.29%\u00b10.15% 88.09%\u00b10.17%\n91.75%\u00b10.12% 92.20%\u00b10.14%\n91.75%\u00b10.13%\n91.70%\u00b10.21%\n91.27%\u00b10.13%\n91.55%\u00b10.08%\n91.74%\u00b10.14%\n91.20%\u00b10.11%\n91.38%\u00b10.08%\n97.10%\u00b10.06%\n97.05%\u00b10.06%\n97.41%\u00b10.08%\n97.33%\u00b10.05%\n96.97%\u00b10.07% 97.48%\u00b10.08%\n97.41%\u00b10.06%\n97.25%\u00b10.09% 97.51%\u00b10.05%\n\nTable 8: Average test accuracy of Co-teaching with different \u03c4 on MNIST over the last ten epochs.\nFlipping,Rate\n66.74%\u00b10.28% 77.86%\u00b10.47% 87.63%\u00b10.21% 97.89%\u00b10.06% 69.47%\u00b10.02%\nSymmetry-50% 75.89%\u00b10.21% 82.00%\u00b10.28% 91.32%\u00b10.06% 98.62%\u00b10.05% 79.43%\u00b10.02%\nSymmetry-20% 94.94%\u00b10.09% 96.25%\u00b10.06% 97.25%\u00b10.03% 98.90%\u00b10.03% 99.39%\u00b10.02%\n\nPair-45%\n\n1.25\u0001\n\n0.75\u0001\n\n0.5\u0001\n\n1.5\u0001\n\n\u0001\n\n5 Conclusion\nThis paper presents a simple but effective learning paradigm called Co-teaching, which trains\ndeep neural networks robustly under noisy supervision. Our key idea is to maintain two networks\nsimultaneously, and cross-trains on instances screened by the \u201csmall loss\u201d criteria. We conduct\nsimulated experiments to demonstrate that, our proposed Co-teaching can train deep models robustly\nwith the extremely noisy supervision. In future, we can extend our work in the following aspects.\nFirst, we can adapt Co-teaching paradigm to train deep models under other weak supervisions,\ne.g., positive and unlabeled data [19]. Second, we would investigate the theoretical guarantees for\nCo-teaching. Previous theories for Co-training are very hard to transfer into Co-teaching, since our\nsetting is fundamentally different. Besides, there is no analysis for generalization performance on\ndeep learning with noisy labels. Thus, we leave the generalization analysis as a future work.\nAcknowledgments.\nMS was supported by JST CREST JPMJCR1403. IWT was supported by ARC FT130100746,\nDP180100106 and LP150100671. BH would like to thank the \ufb01nancial support from RIKEN-AIP.\nXRY was supported by NSFC Project No. 61671481. QY would give special thanks to Weiwei\nTu and Yuqiang Chen from 4Paradigm Inc. We gratefully acknowledge the support of NVIDIA\nCorporation with the donation of the Titan Xp GPU used for this research.\n\n9\n\nStandardBootstrapS-modelF-correctionDecouplingMentorNetCo-teaching050100150200Epoch0.000.050.100.150.200.250.300.35Test Accuracy(CIFAR-100, Pair-45%)050100150200Epoch0.000.100.200.300.400.50Test Accuracy(CIFAR-100, Symmetry-50%)050100150200Epoch0.000.100.200.300.400.500.60Test Accuracy(CIFAR-100, Symmetry-20%)050100150200Epoch0.520.540.560.580.600.62Label Precision(CIFAR-100, Pair-45%)050100150200Epoch0.400.500.600.700.800.90Label Precision(CIFAR-100, Symmetry-50%)050100150200Epoch0.800.850.900.95Label Precision(CIFAR-100, Symmetry-20%)\fReferences\n[1] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2(4):343\u2013370,\n\n1988.\n\n[2] D. Arpit, S. Jastrz\u02dbebski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, A. Fischer,\nA. Courville, and Y. Bengio. A closer look at memorization in deep networks. In ICML, 2017.\n\n[3] M. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer\n\nand System Sciences, 75(1):78\u201389, 2009.\n\n[4] A. Blum, A. Kalai, and H. Wasserman. Noise-tolerant learning, the parity problem, and the\n\nstatistical query model. Journal of the ACM, 50(4):506\u2013519, 2003.\n\n[5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT,\n\n1998.\n\n[6] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning. IEEE Transactions on\n\nNeural Networks, 20(3):542\u2013542, 2009.\n\n[7] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of\n\nArti\ufb01cial Intelligence Research, 4:129\u2013145, 1996.\n\n[8] J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourcing for \ufb01ne-grained recognition. In\n\nCVPR, 2013.\n\n[9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[10] Y. Fan, F. Tian, T. Qin, J. Bian, and T. Liu. Learning to teach. In ICLR, 2018.\n\n[11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an\n\napplication to boosting. In European COLT, 1995.\n\n[12] Y. Freund, R. Schapire, and N. Abe. A short introduction to boosting. Journal-Japanese Society\n\nFor Arti\ufb01cial Intelligence, 14(771-780):1612, 1999.\n\n[13] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer.\n\nIn ICLR, 2017.\n\n[14] C. Gong, D. Tao, J. Yang, and W. Liu. Teaching-to-learn and learning-to-teach for multi-label\n\npropagation. In AAAI, 2016.\n\n[15] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.\n\n[16] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.\n\narXiv:1503.02531, 2015.\n\n[17] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum\n\nfor very deep neural networks on corrupted labels. In ICML, 2018.\n\n[18] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[19] R. Kiryo, G. Niu, M. Du Plessis, and M. Sugiyama. Positive-unlabeled learning with non-\n\nnegative risk estimator. In NIPS, 2017.\n\n[20] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[21] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017.\n\n[22] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and J. Li. Learning from noisy labels with distillation.\n\nIn ICCV, 2017.\n\n[23] T. Liu and D. Tao. Classi\ufb01cation with noisy labels by importance reweighting. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 38(3):447\u2013461, 2016.\n\n10\n\n\f[24] X. Ma, Y. Wang, M. Houle, S. Zhou, S. Erfani, S. Xia, S. Wijewickrema, and J. Bailey.\n\nDimensionality-driven learning with noisy labels. In ICML, 2018.\n\n[25] A. Maas, A. Hannun, and A. Ng. Recti\ufb01er nonlinearities improve neural network acoustic\n\nmodels. In ICML, 2013.\n\n[26] E. Malach and S. Shalev-Shwartz. Decoupling\" when to update\" from\" how to update\". In\n\nNIPS, 2017.\n\n[27] H. Masnadi-Shirazi and N. Vasconcelos. On the design of loss functions for classi\ufb01cation:\n\ntheory, robustness to outliers, and savageboost. In NIPS, 2009.\n\n[28] A. Menon, B. Van Rooyen, C. Ong, and B. Williamson. Learning from corrupted binary labels\n\nvia class-probability estimation. In ICML, 2015.\n\n[29] T. Miyato, A. Dai, and I. Goodfellow. Virtual adversarial training for semi-supervised text\n\nclassi\ufb01cation. In ICLR, 2016.\n\n[30] N. Natarajan, I. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS,\n\n2013.\n\n[31] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to\n\nlabel noise: A loss correction approach. In CVPR, 2017.\n\n[32] V. Raykar, S. Yu, L. Zhao, G. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds.\n\nJournal of Machine Learning Research, 11(Apr):1297\u20131322, 2010.\n\n[33] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural\n\nnetworks on noisy labels with bootstrapping. In ICLR, 2015.\n\n[34] M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep\n\nlearning. In ICML, 2018.\n\n[35] F. Rodrigues and F. Pereira. Deep learning from crowds. In AAAI, 2018.\n\n[36] T. Sanderson and C. Scott. Class proportion estimation with application to multiclass anomaly\n\nrejection. In AISTATS, 2014.\n\n[37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec-\n\nture for computer vision. In CVPR, 2016.\n\n[38] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint optimization framework for learning\n\nwith noisy labels. In CVPR, 2018.\n\n[39] B. Van Rooyen, A. Menon, and B. Williamson. Learning with symmetric label noise: The\n\nimportance of being unhinged. In NIPS, 2015.\n\n[40] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie. Learning from noisy\n\nlarge-scale datasets with minimal supervision. In CVPR, 2017.\n\n[41] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S. Xia. Iterative learning with open-set\n\nnoisy labels. In CVPR, 2018.\n\n[42] Y. Yan, R. Rosales, G. Fung, R. Subramanian, and J. Dy. Learning from multiple annotators\n\nwith varying expertise. Machine Learning, 95(3):291\u2013327, 2014.\n\n[43] X. Yu, T. Liu, M. Gong, K. Batmanghelich, and D. Tao. An ef\ufb01cient and provable approach for\n\nmixture proportion estimation using linear independence assumption. In CVPR, 2018.\n\n[44] X. Yu, T. Liu, M. Gong, and D. Tao. Learning with biased complementary labels. In ECCV,\n\n2018.\n\n[45] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. In ICLR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5143, "authors": [{"given_name": "Bo", "family_name": "Han", "institution": "RIKEN & UTS"}, {"given_name": "Quanming", "family_name": "Yao", "institution": "4Paradigm"}, {"given_name": "Xingrui", "family_name": "Yu", "institution": "University of Technology Sydney"}, {"given_name": "Gang", "family_name": "Niu", "institution": "RIKEN"}, {"given_name": "Miao", "family_name": "Xu", "institution": "RIKEN AIP"}, {"given_name": "Weihua", "family_name": "Hu", "institution": "The University of Tokyo"}, {"given_name": "Ivor", "family_name": "Tsang", "institution": "University of Technology, Sydney"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}