{"title": "Are Anchor Points Really Indispensable in Label-Noise Learning?", "book": "Advances in Neural Information Processing Systems", "page_first": 6838, "page_last": 6849, "abstract": "In label-noise learning, the \\textit{noise transition matrix}, denoting the probabilities that clean labels flip into noisy labels, plays a central role in building \\textit{statistically consistent classifiers}. Existing theories have shown that the transition matrix can be learned by exploiting \\textit{anchor points} (i.e., data points that belong to a specific class almost surely). However, when there are no anchor points, the transition matrix will be poorly learned, and those previously consistent classifiers will significantly degenerate. In this paper, without employing anchor points, we propose a \\textit{transition-revision} ($T$-Revision) method to effectively learn transition matrices, leading to better classifiers. Specifically, to learn a transition matrix, we first initialize it by exploiting data points that are similar to anchor points, having high \\textit{noisy class posterior probabilities}. Then, we modify the initialized matrix by adding a \\textit{slack variable}, which can be learned and validated together with the classifier by using noisy data. Empirical results on benchmark-simulated and real-world label-noise datasets demonstrate that without using exact anchor points, the proposed method is superior to state-of-the-art label-noise learning methods.", "full_text": "Are Anchor Points Really Indispensable\n\nin Label-Noise Learning?\n\nXiaobo Xia1,2 Tongliang Liu1 Nannan Wang2\n\nBo Han3 Chen Gong4 Gang Niu3 Masashi Sugiyama3,5\n\n1University of Sydney\n\n2Xidian University 3RIKEN\n\n4Nanjing University of Science and Technology 5University of Tokyo\n\nAbstract\n\nIn label-noise learning, the noise transition matrix, denoting the probabilities that\nclean labels \ufb02ip into noisy labels, plays a central role in building statistically\nconsistent classi\ufb01ers. Existing theories have shown that the transition matrix can be\nlearned by exploiting anchor points (i.e., data points that belong to a speci\ufb01c class\nalmost surely). However, when there are no anchor points, the transition matrix\nwill be poorly learned, and those previously consistent classi\ufb01ers will signi\ufb01cantly\ndegenerate. In this paper, without employing anchor points, we propose a transition-\nrevision (T -Revision) method to effectively learn transition matrices, leading to\nbetter classi\ufb01ers. Speci\ufb01cally, to learn a transition matrix, we \ufb01rst initialize it by\nexploiting data points that are similar to anchor points, having high noisy class\nposterior probabilities. Then, we modify the initialized matrix by adding a slack\nvariable, which can be learned and validated together with the classi\ufb01er by using\nnoisy data. Empirical results on benchmark-simulated and real-world label-noise\ndatasets demonstrate that without using exact anchor points, the proposed method\nis superior to state-of-the-art label-noise learning methods.\n\n1\n\nIntroduction\n\nLabel-noise learning can be dated back to [1] but becomes a more and more important topic recently.\nThe reason is that, in this era, datasets are becoming bigger and bigger. Often, large-scale datasets\nare infeasible to be annotated accurately due to the expensive cost, which naturally brings us cheap\ndatasets with noisy labels.\nExisting methods for label-noise learning can be generally divided into two categories: algorithms that\nresult in statistically inconsistent/consistent classi\ufb01ers. Methods in the \ufb01rst category usually employ\nheuristics to reduce the side-effect of noisy labels. For example, many state-of-the-art approaches\nin this category are speci\ufb01cally designed to, e.g., select reliable examples [45, 14, 24], reweight\nexamples [33, 15], correct labels [23, 17, 37, 32], employ side information [39, 21], and (implicitly)\nadd regularization [13, 12, 43, 39, 21]. All those methods were reported to work empirically very\nwell. However, the differences between the learned classi\ufb01ers and the optimal ones for clean data are\nnot guaranteed to vanish, i.e., no statistical consistency has been guaranteed.\nThe above issue motivates researchers to explore algorithms in the second category: risk-/classi\ufb01er-\nconsistent algorithms. In general, risk-consistent methods possess statistically consistent estimators\nto the clean risk (i.e., risk w.r.t. the clean data), while classi\ufb01er-consistent methods guarantee the\nclassi\ufb01er learned from the noisy data is consistent to the optimal classi\ufb01er (i.e., the minimizer of the\nclean risk) [42]. Methods in this category utilize the noise transition matrix, denoting the probabilities\nthat clean labels \ufb02ip into noisy labels, to build consistent algorithms. Let Y denote the variable\nfor the clean label, \u00afY the noisy label, and X the instance/feature. The basic idea is that given the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fnoisy class posterior probability P ( \u00afY|X = x) = [P ( \u00afY = 1|X = x), . . . , P ( \u00afY = C|X = x)]>\n(which can be learned using noisy data) and the transition matrix T (X = x) where Tij(X = x) =\nP ( \u00afY = j|Y = i, X = x), the clean class posterior probability P (Y|X = x) can be inferred,\ni.e., P (Y|X = x) = (T (X = x)>)1P ( \u00afY|X = x). For example, loss functions are modi\ufb01ed to\nensure risk consistency, e.g., [49, 17, 22, 29, 35, 26]; a noise adaptation layer is added to deep neural\nnetworks to design classi\ufb01er-consistent deep learning algorithms [9, 30, 38, 47]. Those algorithms\nare strongly theoretically grounded but heavily rely on the success of learning transition matrices.\nGiven risk-consistent estimators, one stream to learn the transition matrix is the cross-validation\nmethod (using only noisy data) for binary classi\ufb01cation [26]. However, it is prohibited for multi-class\nproblems as its computational complexity grows exponentially to the number of classes. Besides,\nthe current risk-consistent estimators involve the inverse of the transition matrix, making tuning the\ntransition matrix inef\ufb01cient and also leading to performance degeneration [30], especially when the\ntransition matrix is non-invertible. Independent of risk-consistent estimators, another stream to learn\nthe transition matrix is closely related to mixture proportion estimation [40]. A series of assumptions\n[36, 22, 35, 31] were proposed to ef\ufb01ciently learn transition matrices (or mixture parameters) by only\nexploiting the noisy data. All those assumptions require anchor points, i.e., instances belonging to a\nspeci\ufb01c class with probability exactly one or close to one. Nonetheless, without anchor points, the\ntransition matrix could be poorly learned, which will degenerate the accuracies of existing consistent\nalgorithms.\nTherefore, in this paper, to handle the applications where the anchor-point assumptions are violated\n[46, 41], we propose a transition-revision (T -Revision) method to effectively learn transition matrices,\nleading to better classi\ufb01ers. In a high level, we design a deep-learning-based risk-consistent estimator\nto tune the transition matrix accurately. Speci\ufb01cally, we \ufb01rst initialize the transition matrix by\nexploiting examples that are similar to anchor points, namely, those having high estimated noisy class\nposterior probabilities. Then, we modify the initial matrix by adding a slack variable, which will\nbe learned and validated together with the classi\ufb01er by using noisy data only. Note that given true\ntransition matrix, the proposed estimator will converge to the classi\ufb01cation risk w.r.t. clean data by\nincreasing the size of noisy training examples. Our heuristic for tuning the transition matrix is that a\nfavorable transition matrix would make the classi\ufb01cation risk w.r.t. clean data small. We empirically\nshow that the proposed T -Revision method will enable tuned transition matrices to be closer to the\nground truths, which explains why T -Revision is much superior to state-of-the-art algorithms in\nclassi\ufb01cation.\nThe rest of the paper is organized as follows. In Section 2 we review label-noise learning with anchor\npoints. In Section 3, we discuss how to learn the transition matrix and classi\ufb01er without anchor points.\nExperimental results are provided in Section 4. Finally, we conclude the paper in Section 5.\n\n2 Label-Noise Learning with Anchor Points\n\nIn this section, we brie\ufb02y review label-noise learning when there are anchor points.\nPreliminaries Let D be the distribution of a pair of random variables (X, Y ) 2X\u21e5{ 1, 2, . . . , C},\nwhere the feature space X\u2713 Rd and C is the size of label classes. Our goal is to predict a label\ny for any given instance x 2X . However, in many real-world classi\ufb01cation problems, training\nexamples drawn independently from distribution D are unavailable. Before being observed, their true\nlabels are independently \ufb02ipped and what we can obtain is a noisy training sample {(Xi, \u00afYi)}n\ni=1,\nwhere \u00afY denotes the noisy label. Let \u00afD be the distribution of the noisy random variables (X, \u00afY ) 2\nX\u21e5{ 1, 2, . . . , C}.\nTransition matrix The random variables \u00afY and Y are related through a noise transition matrix\nT 2 [0, 1]C\u21e5C [8]. Generally, the transition matrix depends on instances, i.e., Tij(X = x) = P ( \u00afY =\nj|Y = i, X = x). Given only noisy examples, the instance-dependent transition matrix is non-\nidenti\ufb01able without any additional assumption. For example, P ( \u00afY = j|X = x) =PC\ni=1 Tij(X =\nx)P (Y = i|X = x) =PC\ni=1 T 0ij(X = x)P 0(Y = i|X = x) are both valid, when T 0ij(X = x) =\nTij(X = x)P (Y = i|X = x)/P 0( \u00afY = i|X = x). In this paper, we study the class-dependent\nand instance-independent transition matrix, i.e., P ( \u00afY = j|Y = i, X = x) = P ( \u00afY = j|Y = i),\nwhich is identi\ufb01able under mild conditions and on which the vast majority of current methods focus\n[14, 13, 30, 29, 26].\n\n2\n\n\fclean data, i.e., P ( \u00afY = j|X = x) =PC\n\nConsistent algorithms The transition matrix bridges the class posterior probabilities for noisy and\ni=1 TijP (Y = i|X = x). Thus, it has been exploited to build\nconsistent algorithms. Speci\ufb01cally, it has been used to modify loss functions to build risk-consistent\nestimators, e.g., [26, 35, 30], and has been used to correct hypotheses to build classi\ufb01er-consistent\nalgorithms, e.g., [9, 30, 47]. Note that an estimator is risk-consistent if, by increasing the size of\nnoisy examples, the empirical risk calculated by noisy examples and the modi\ufb01ed loss function will\nconverge to the expected risk calculated by clean examples and the original loss function. Similarly,\nan algorithm is classi\ufb01er-consistent if, by increasing the size of noisy examples, the learned classi\ufb01er\nwill converge to the optimal classi\ufb01er learned by clean examples. De\ufb01nitions of the expected and\nempirical risks can be found in Appendix B, where we further discuss how consistent algorithms\nwork.\nAnchor points The successes of consistent algorithms rely on \ufb01rm bridges, i.e., accurately learned\ntransition matrices. To learn transition matrices, the concept of anchor point was proposed [22, 35].\nAnchor points are de\ufb01ned in the clean data domain, i.e., an instance x is an anchor point for the class\ni if P (Y = i|X = x) is equal to one or close to one1. Given an x, if P (Y = i|X = x) = 1, we have\nthat for k 6= i, P (Y = k|X = x) = 0. Then, we have\n\nP ( \u00afY = j|X = x) =\n\nCXk=1\n\nTkjP (Y = k|X = x) = Tij.\n\n(1)\n\n0.5 0.3 0.2\n\n0.5 .25 .25\n\nNamely, T can be obtained via estimating the noisy class posterior probabilities for anchor points [47].\nHowever, the requirement of given anchor points is a bit strong. Thus, anchor points are assumed to\nexist but unknown in datasets, which can be identi\ufb01ed either theoretically [22] or heuristically [30].\nTransition matrix learning is also closely related\nto mixture proportion estimation [40], which is in-\ndependent of classi\ufb01cation. By giving only noisy\ndata, to ensure the learnability and ef\ufb01ciency of\nlearning transition matrices (or mixture parame-\nters), a series of assumptions were proposed, e.g.,\nirreducibility [36], anchor point [22, 35], and sepa-\nrability [31]. All those assumptions require anchor\npoints or instances belonging to a speci\ufb01c class\nwith probability one or approaching one.\nWhen there are no anchor points in datasets/data\ndistributions, all the above mentioned methods\nwill lead to inaccurate transition matrices, which\nwill degenerate the performances of current con-\nsistent algorithms. This motivates us to investigate\nhow to maintain the ef\ufb01cacy of those consistent\nalgorithms without using exact anchor points.\n\n0.5 0.3 0.2\n0.5 0.3\n0.5\n\n0.5 0.3 0.2\n0.5 0.3\n0.5\n\n0.5 0.3 0.2\n\n0.5 0.3 0.2\n\n!\n\n0.2\n0.3 0.2\n\n0.2\n0.3 0.2\n\n!\"\n\nFigure 1: Illustrative experimental results (us-\ning a 5-class classi\ufb01cation problem as an ex-\nample). The noisy class posterior probability\nP ( \u00afY|X = x) can be estimated by exploiting\nnoisy data. Let an example have P ( \u00afY|X =\nx) = [0.141; 0.189; 0.239; 0.281; 0.15].\nIf\nthe true transition matrix T is given, we\ncan infer the clean class posterior probabil-\nity as P (Y|X = x) = (T >)1P ( \u00afY|X =\nx) = [0.15; 0.28; 0.25; 0.3; 0.02] and that the\ninstance belongs to the fourth class. How-\never, if the transition matrix is not accurately\nlearned as \u02dcT (only slightly differing from\nT with two entries in the second row), the\nclean class posterior probability can be in-\nferred as P (Y|X = x) = ( \u02dcT >)1P ( \u00afY|X =\nx) = [0.1587; 0.2697; 0.2796; 0.2593; 0.0325]\nand the instance could be mistakenly classi\ufb01ed\ninto the third class.\n\n3 Label-Noise\nLearning without Anchor Points\n\nThis section presents a deep-learning-based risk-\nconsistent estimator for the classi\ufb01cation risk w.r.t.\nclean data. We employ this estimator to tune the\ntransition matrix effectively without using anchor\npoints, which \ufb01nally leads to better classi\ufb01ers.\n\n1In the literature, the assumption inf x P (Y = i|X = x) ! 1 was introduced as irreducibility [5] to ensure\nthe transition matrix is identi\ufb01able; an anchor point x for class i is de\ufb01ned by P (Y = i|X = x) = 1 [35, 22] to\nensure a fast convergence rate. In this paper, we generalize the de\ufb01nition for the anchor point family, including\ninstances whose class posterior probability P (Y = i|X = x) is equal to or close to one.\n\n3\n\n\f3.1 Motivation\nAccording to Eq. (1), to learn the transition matrix, P ( \u00afY|X = x) needs to be estimated and anchor\npoints need to be given. Note that learning P ( \u00afY|X = x) may introduce error. Even worse, when\nthere are no anchor points, it will be problematic if we use existing methods [36, 22, 35, 31] to learn\ntransition matrices. For example, let P (Y|X = xi) be the i-th column of a matrix L, i = 1, . . . , C.\nIf xi is an anchor point for the i-th class, then L is an identity matrix. According to Eq. (1), if we use\nxi as an anchor point for the i-th class while P (Y = i|X = xi) 6= 1 (e.g., the identi\ufb01ed instances in\n[30] are not guaranteed to be anchor points), the learned transition matrix would be T L, where L is a\nnon-identity matrix. This means that transition matrices will be inaccurately estimated.\nBased on inaccurate transition matrices, the accuracy of current consistent algorithms will signi\ufb01cantly\ndegenerate. To demonstrate this, Figure 1 shows that given a noisy class posterior probability\nP ( \u00afY|X = x), even if the transition matrix changes slightly by two entries, e.g., kT \u02dcTk1/kTk1 =\n0.02 where T and \u02dcT are de\ufb01ned in Figure 1 and kTk1 = Pij |Tij|, the inferred class posterior\nprobability for the clean data may lead to an incorrect classi\ufb01cation. Since anchor points require\nclean class posterior probabilities to be or approach one, which is quite strong to some real-world\napplications [46, 41], we would like to study how to maintain the performances of current consistent\nalgorithms when there are no anchor points and then transition matrices are inaccurately learned.\n\n3.2 Risk-consistent estimator\nIntuitively, the entries of transition matrix can be tuned by minimizing the risk-consistent estimator,\nsince the estimator is asymptotically identical to the expected risk for the clean data and that a\nfavorable transition matrix should make the clean expected risk small. However, existing risk-\nconsistent estimators involve the inverse of transition matrix (more details are provided in Appendix\nB), which degenerates classi\ufb01cation performances [30] and makes tuning the transition matrix\nineffectively. To address this, we propose a risk-consistent estimator that does not involve the inverse\nof the transition matrix.\nThe inverse of transition matrix is involved in risk-consistent estimators, since the noisy class posterior\nprobability P ( \u00afY|X = x) and the transition matrix are explicitly or implicitly used to infer the clean\nclass posterior probability P (Y|X = x), i.e., P (Y|X = x) = (T >)1P ( \u00afY|X = x). To avoid\nthe inverse in building risk-consistent estimators, we directly estimate P (Y|X = x) instead of\ninferring it through P ( \u00afY|X = x). Thanks to the equation T >P (Y|X = x) = P ( \u00afY|X = x),\nP (Y|X = x) and P ( \u00afY|X = x) could be estimated at the same time by adding the true transition\nmatrix to modify the output of the softmax function, e.g., [47, 30]. Speci\ufb01cally, P ( \u00afY|X = x) can\nbe learned by exploiting the noisy data, as shown in Figure 2 by minimizing the unweighted loss\n\u00afRn(f ) = 1/nPn\ni=1 `(f (Xi), \u00afYi), where `(f (X), \u00afY ) is a loss function [25]. Let \u02c6T + T be the\ntrue transition matrix, i.e., \u02c6T + T = T . Due to P ( \u00afY|X = x) = T >P (Y|X = x), the output\nof the softmax function g(x) = \u02c6P (Y|X = x) before the transition matrix is an approximation for\nP (Y|X = x). However, the learned g(x) = \u02c6P (Y|X = x) by minimizing the unweighted loss may\nperform poorly if the true transition matrix is inaccurately learned as explained in the motivation.\nIf having P (Y|X = x) and P ( \u00afY|X = x), we could employ the importance reweighting technique\n[11, 22] to rewrite the expected risk w.r.t. clean data without involving the inverse of transition matrix.\nSpeci\ufb01cally,\n\n\u00af`(f (x), i) =\nwhere D denotes\nPD( \u00afY =i|X=x)\nP \u00afD( \u00afY =i|X=x) `(f (x), i), and the second last equation holds because label noise is assumed to be\n\nthe distribution for clean data,\n\n\u00afD for noisy data,\n\n4\n\nR(f ) = E(X,Y )\u21e0D[`(f (X), Y )] =ZxXi\n=ZxXi\n=ZxXi\n\nPD(X = x, \u00afY = i)\nP \u00afD(X = x, \u00afY = i)\nPD( \u00afY = i|X = x)\nP \u00afD( \u00afY = i|X = x)\n\nP \u00afD(X = x, \u00afY = i)\n\nP \u00afD(X = x, \u00afY = i)\n\n= E(X,Y )\u21e0 \u00afD[\u00af`(f (X), Y )],\n\nPD(X = x, Y = i)`(f (x), i)dx\n\n`(f (x), i)dx\n\n`(f (x), i)dx\n\n(2)\n\n\fNeural Network\n\nNoisy \n\nTraining Sample\n\n\ud835\udc54\ud835\udc4b =\u0de0\ud835\udc43\ud835\udc80|\ud835\udc4b\n\nS\no\nf\nt\n\nm\na\nx\n\n\ud835\udc54\ud835\udc4b =\u0de0\ud835\udc43\ud835\udc80|\ud835\udc4b\n\u0de0\ud835\udc47+\u2206\ud835\udc47 \u22a4\n\n\u0de0\ud835\udc47+\u2206\ud835\udc47 \u22a4\ud835\udc54\ud835\udc4b =\u0de0\ud835\udc43(\u0d25\ud835\udc80|\ud835\udc4b)\nUnweighted Loss \u0d24\ud835\udc45\ud835\udc5b\ud835\udc53\n\n\u0d24\ud835\udc45\ud835\udc5b,\ud835\udc64 \u0de0\ud835\udc47+\u0394\ud835\udc47,\ud835\udc53\n\nWeighted Loss\n\nW\ne\ni\ng\nh\nt\ns\n\nFigure 2: An overview of the proposed method. The proposed method will learn a more accurate\nclassi\ufb01er because the transition matrix is renovated.\n\nAlgorithm 1 Reweight T -Revision (Reweight-R) Algorithm.\nInput: Noisy training sample Dt; Noisy validation set Dv.\nStage 1: Learn \u02c6T\n1: Minimize the unweighted loss to learn \u02c6P ( \u00afY|X = x) without a noise adaption layer;\n2: Initialize \u02c6T according to Eq. (1) by using instances with the highest \u02c6P ( \u00afY = i|X = x) as anchor\npoints for the i-th class;\nStage 2: Learn the classi\ufb01er f and T\n3: Initialize the neural network by minimizing the weighted loss with a noisy adaption layer \u02c6T >;\n4: Minimize the weighted loss to learn f and T with a noisy adaption layer ( \u02c6T + T )>;\n//Stopping criterion for learning \u02c6P ( \u00afY|X = x), f and T : when \u02c6P ( \u00afY|X = x) yields the minimum\nclassi\ufb01cation error on the noisy validation set Dv\nOutput: \u02c6T , T , and f.\n\nindependent of instances. In the rest of the paper, we have omitted the subscript for P when no con-\nfusion is caused. Since P ( \u00afY|X = x) = T >P (Y|X = x) and that the diagonal entries of (learned)\ntransition matrices for label-noise learning are all much larger than zero, PD( \u00afY = i|X = x) 6= 0\nimplies P \u00afD( \u00afY = i|X = x) 6= 0, which also makes the proposed importance reweighting method\nstable without truncating the importance ratios.\nEq. (2) shows that the expected risk w.r.t. clean data and the loss `(f (x), i) is equivalent to an\nexpected risk w.r.t. noisy data and a reweighted loss, i.e., PD( \u00afY =i|X=x)\nP \u00afD( \u00afY =i|X=x) `(f (x), i). The empirical\ncounterpart of the risk in the rightmost-hand side of Eq. (2) is therefore a risk-consistent estimator\nfor label-noise learning. We exploit a deep neural network to build this counterpart. As shown\nin Figure 2, we use the output of the softmax function g(x) to approximate P (Y|X = x), i.e.,\ng(x) = \u02c6P (Y|X = x) \u21e1 P (Y|X = x). Then, T >g(x) (or ( \u02c6T + T )>g(x) in the \ufb01gure) is an\napproximation for P ( \u00afY|X = x), i.e., T >g(x) = \u02c6P ( \u00afY|X = x) \u21e1 P ( \u00afY|X = x). By employing\n\u02c6P (Y = y|X = x)/ \u02c6P ( \u00afY = y|X = x) as weight, we build the risk-consistent estimator as\n\n\u00afRn,w(T, f ) =\n\n`(f (Xi), \u00afYi),\n\n(3)\n\n1\nn\n\nnXi=1\n\ng \u00afYi(Xi)\n\n(T >g) \u00afYi(Xi)\n\nwhere f (X) = arg maxj2{1,...,C} gj(X), gj(X) is an estimate for P (Y = j|X), and the subscript\nw denotes that the loss function is weighted. Note that if the true transition matrix T is given,\n\u00afRn,w(T, f ) only has one argument g to learn.\n\n3.3\n\nImplementation and the T -revision method\n\nWhen the true transition matrix T is unavailable, we propose to use \u00afRn,w( \u02c6T + T, f ) to approximate\nR(f ), as shown in Figure 2. To minimize \u00afRn,w( \u02c6T + T, f ), a two-stage training procedure is\nproposed. Stage 1: \ufb01rst learn P ( \u00afY|X = x) by minimizing the unweighted loss without a noise\nadaption layer and initialize \u02c6T by exploiting examples that have the highest learned \u02c6P ( \u00afY|X = x);\nStage 2: modify the initialization \u02c6T by adding a slack variable T and learn the classi\ufb01er and T\nby minimizing the weighted loss. The procedure is called the Weighted T -Revision method and\nis summarized in Algorithm 1. It is worthwhile to mention that all anchor points based consistent\nestimators for label-noise learning have a similar two-stage training procedure. Speci\ufb01cally, with one\nstage to learn P ( \u00afY|X = x) and the transition matrix and a second stage to learn the classi\ufb01er for the\nclean data.\n\n5\n\n\fThe proposed T -revision method works because we learn T by minimizing the risk-consistent\nestimator, which is asymptotically equal to the expected risk w.r.t. clean data. The learned slack\nvariable can also be validated on the noisy validation set, i.e., to check if \u02c6P ( \u00afY|X = x) \ufb01ts the\nvalidation set. The philosophy of our approach is similar to that of the cross-validation method.\nHowever, the proposed method does not need to try different combinations of parameters (T is\nlearned) and thus is much more computationally ef\ufb01cient. Note that the proposed method will also\nboost the performances of consistent algorithms even there are anchor points as the transition matrices\nand classi\ufb01ers are jointly learned. Note also that if a clean validation set is available, it can be used to\nbetter initialize the transition matrix, to better validate the slack variable T , and to \ufb01ne-tune the\ndeep network.\n\n3.4 Generalization error\nWhile we have discussed the use of the proposed estimator for evaluating the risk w.r.t clean data,\nwe theoretically justify how it generalizes for learning classi\ufb01ers. Assume the neural network has\nd layers, parameter matrices W1, . . . , Wd, and activation functions 1, . . . , d1for each layer. Let\ndenote the mapping of the neural network by h : x 7! Wdd1(Wd1d2(. . . 1(W1x))) 2 RC.\nThen, the output of the softmax is de\ufb01ned by gi(x) = exp (hi(x))/PC\nk=1 exp (hk(x)), i = 1, . . . , C.\nLet \u02c6f = arg maxi2{1,...,C} \u02c6gi be the classi\ufb01er learned from the hypothesis space F determined by\nthe real-valued parameters of the neural network, i.e., \u02c6f = arg minf2F \u00afRn,w(f ).\nTo derive a generalization bound, as the common practice [6, 25], we assume that instances are upper\nbounded by B, i.e., kxk \uf8ff B for all x 2X , and that the loss function is L-Lipschitz continuous w.r.t.\nf (x) and upper bounded by M, i.e., for any f1, f2 2 F and any (x, \u00afy), |`(f1(x), \u00afy) `(f2(x), \u00afy)|\uf8ff\nL|f1(x) f2(x)|, and for any (x, \u00afy), `(f (x), \u00afy) \uf8ff M.\nTheorem 1 Assume the Frobenius norm of\nthe weight matrices W1, . . . , Wd are at most\nM1, . . . , Md. Let the activation functions be 1-Lipschitz, positive-homogeneous, and applied\nelement-wise (such as the ReLU). Let the loss function be the cross-entropy loss, i.e., `(f (x), \u00afy) =\nPC\ni=1 1{\u00afy=i} log(gi(x)). Let \u02c6f and \u02c6T be the learned classi\ufb01er and slack variable. Assume\n \u02c6T is searched from a space of T constituting valid transition matrices2, i.e., 8T and 8i 6= j,\n\u02c6Tij + Tij 0 and \u02c6Tii + Tii > \u02c6Tij + Tij. Then, for any > 0, with probability at least 1 ,\n+ CMr log 1/\nE[ \u00afRn,w( \u02c6T + \u02c6T , \u02c6f )] \u00afRn,w( \u02c6T + \u02c6T , \u02c6f ) \uf8ff\n.\nA detailed proof is provided in Appendix C. The factor (p2d log 2 + 1)\u21e7d\ni=1Mi is induced by the\nhypothesis complexity of the deep neural network [10] (see Theorem 1 therein), which could be\nimproved [27, 48, 16]. Although the proposed reweighted loss is more complex than the traditional\nunweighted loss function, we have derived a generalization error bound not larger than those derived\nfor the algorithms employing the traditional loss [25] (can be seen by Lemma 2 in the proof of the\ntheorem). This shows that the proposed Algorithm 1 does not need a larger training sample to achieve\na small difference between training error ( \u00afRn,w( \u02c6T + \u02c6T , \u02c6f )) and test error (E[ \u00afRn,w( \u02c6T + \u02c6T , \u02c6f )]).\nAlso note that deep learning is powerful in yielding a small training error. If the training sample size n\nis large, then the upper bound in Theorem 1 is small, which implies a small E[ \u00afRn,w( \u02c6T + \u02c6T , \u02c6f )] and\njusti\ufb01es why the proposed method will have small test errors in the experiment section. Meanwhile,\nin the experiment section, we show that the proposed method is much superior to the state-of-the-art\nmethods in classi\ufb01cation accuracy, implying that the small generalization error is not obtained at the\ncost of enlarging the approximation error.\n\n2BCL(p2d log 2 + 1)\u21e7d\n\n2n\n\ni=1Mi\n\npn\n\n4 Experiments\n\nDatasets We verify the effectiveness of the proposed method on three synthetic noisy datasets, i.e.,\nMNIST [19], CIFAR-10 [18], and CIFAR-100 [18], and one real-world noisy dataset, i.e., clothing1M\n2During the training, T + T can be ensured to be a valid transition matrix by \ufb01rst projecting their negative\nentries to be zero and then performing row normalization. In the experiments, T is initialized to be a zero\nmatrix and we haven\u2019t pushed T + T to be a valid matrix when tuning T .\n\n6\n\n\fTable 1: Means and standard deviations (percentage) of classi\ufb01cation accuracy. Methods with \u201c-A\u201d\nmeans that they run on the intact datasets without removing possible anchor points; Methods with\n\u201c-R\u201d means that the transition matrix used is revised by a revision \u02c6T .\n\nMNIST\n\nCIFAR-10\n\nCIFAR-100\n\nSym-20%\nDecoupling-A 95.39\u00b10.29\n96.57\u00b10.18\nMentorNet-A\nCo-teaching-A 97.22\u00b10.18\nForward-A\n98.75\u00b10.08\nReweight-A\n98.71\u00b10.11\nForward-A-R\n98.84\u00b10.09\nReweight-A-R 98.91\u00b10.04\n\nSym-50%\n81.52\u00b10.29\n90.13\u00b10.09\n91.68\u00b10.21\n97.86\u00b10.22\n98.13\u00b10.19\n98.12\u00b10.22\n98.38\u00b10.21\n\nSym-20%\n79.85\u00b10.30\n80.49\u00b10.52\n82.38\u00b10.11\n85.63\u00b10.52\n86.77\u00b10.40\n88.10\u00b10.21\n89.63\u00b10.13\n\nSym-50%\n52.22\u00b10.45\n70.71\u00b10.24\n72.80\u00b10.45\n77.92\u00b10.66\n80.16\u00b10.46\n81.11\u00b10.74\n83.40\u00b10.65\n\nSym-20%\n42.75\u00b10.49\n52.11\u00b10.10\n54.23\u00b10.08\n57.75\u00b10.37\n58.35\u00b10.64\n62.13\u00b12.09\n65.40\u00b11.07\n\nSym-50%\n29.24\u00b10.54\n38.45\u00b10.25\n41.37\u00b10.08\n44.66\u00b11.01\n43.97\u00b10.67\n50.46\u00b10.52\n50.24\u00b11.45\n\nTable 2: Means and standard deviations (percentage) of classi\ufb01cation accuracy. Methods with \u201c-N/A\u201d\nmeans instances with high estimated P (Y |X) are removed from the dataset; Methods with \u201c-R\u201d\nmeans that the transition matrix used is revised by a revision \u02c6T .\n\nMNIST\n\nCIFAR-10\n\nCIFAR-100\n\nSym-20%\nDecoupling-N/A 95.93\u00b10.21\n97.11\u00b10.09\nMentorNet-N/A\nCo-teaching-N/A 97.69\u00b10.23\nForward-N/A\n98.64\u00b10.12\nReweight-N/A\n98.69\u00b10.08\nForward-N/A-R\n98.80\u00b10.06\nReweight-N/A-R 98.85\u00b10.02\n\nSym-50%\n82.55\u00b10.39\n91.44\u00b10.25\n93.58\u00b10.49\n97.74\u00b10.13\n98.05\u00b10.22\n97.96\u00b10.13\n98.37\u00b10.17\n\nSym-20%\n75.37\u00b11.24\n78.51\u00b10.31\n81.72\u00b10.14\n84.75\u00b10.81\n85.53\u00b10.26\n86.93\u00b10.39\n88.90\u00b10.22\n\nSym-50%\n47.19\u00b10.19\n67.37\u00b10.30\n70.44\u00b11.01\n74.32\u00b10.69\n77.70\u00b11.00\n77.14\u00b10.65\n81.55\u00b10.94\n\nSym-20%\n39.59\u00b10.42\n48.62\u00b10.43\n53.21\u00b10.54\n56.23\u00b10.34\n56.60\u00b10.71\n58.72\u00b10.45\n62.00\u00b11.78\n\nSym-50%\n24.04\u00b11.19\n33.53\u00b10.31\n40.06\u00b10.83\n39.28\u00b10.59\n39.28\u00b10.71\n44.60\u00b10.79\n44.75\u00b12.10\n\n[44]. MNIST has 10 classes of images including 60,000 training images and 10,000 test images.\nCIFAR-10 has 10 classes of images including 50,000 training images and 10,000 test images. CIFAR-\n100 also has 50,000 training images and 10,000 test images, but 100 classes. For all the datasets,\nwe leave out 10% of the training examples as a validation set. The three datasets contain clean\ndata. We corrupted the training and validation sets manually according to true transition matrices\nT . Speci\ufb01cally, we employ the symmetry \ufb02ipping setting de\ufb01ned in Appendix D. Sym-50 generates\nheavy label noise and leads almost half of the instances to have noisy labels, while Sym-20 generates\nlight label noise and leads around 20% of instances to have label noise. Note that the pair \ufb02ipping\nsetting [14], where each row of the transition matrix only has two non-zero entries, has also been\nwidely studied. However, for simplicity, we do not pose any constraint on the slack variable T to\nachieve speci\ufb01c speculation of the transition matrix, e.g., sparsity [13]. We leave this for future work.\nBesides reporting the classi\ufb01cation accuracy on test set, we also report the discrepancy between the\nlearned transition matrix \u02c6T + \u02c6T and the true one T . All experiments are repeated \ufb01ve times on\nthose three datasets. Clothing1M consists of 1M images with real-world noisy labels, and additional\n50k, 14k, 10k images with clean labels for training, validation, and testing. We use the 50k clean data\nto help initialize the transition matrix as did in the baseline [30].\nNetwork structure and optimization For fair comparison, we implement all methods with default\nparameters by PyTorch on NVIDIA Tesla V100. We use a LeNet-5 network for MNIST, a ResNet-18\nnetwork for CIFAR-10, a ResNet-34 network for CIFAR-100. For learning the transition matrix\n\u02c6T in the \ufb01rst stage, we follow the optimization method in [30]. During the second stage, we \ufb01rst\nuse SGD with momentum 0.9, weight decay 104, batch size 128, and an initial learning rate of\n102 to initialize the network. The learning rate is divided by 10 after the 40th epoch and 80th\nepoch. 200 epochs are set in total. Then, the optimizer and learning rate are changed to Adam and\n5 \u21e5 107 to learn the classi\ufb01er and slack variable. For CIFAR-10 and CIFAR-100, we perform data\naugmentation by horizontal random \ufb02ips and 32\u21e532 random crops after padding 4 pixels on each\nside. For clothing1M, we use a ResNet-50 pre-trained on ImageNet. Follow [30], we also exploit\nthe 1M noisy data and 50k clean data to initialize the transition matrix. In the second stage, for\ninitialization, we use SGD with momentum 0.9, weight decay 103, batch size 32, and run with\nlearning rates 103 and 104 for 5 epochs each. For learning the classi\ufb01er and slack variable, Adam\nis used and the learning rate is changed to 5 \u21e5 107.\n\n7\n\n\fTable 3: Means and standard deviations (percentage) of classi\ufb01cation accuracy on MNIST with\ndifferent label noise levels. Methods with \u201c-A\u201d means that they run on the intact datasets without\nremoving possible anchor points; Methods with \u201c-R\u201d means that the transition matrix used is revised\nby a revision \u02c6T ; Methods with \u201c-N/A\u201d means instances with high estimated P (Y |X) are removed\nfrom the dataset.\n\nSym-60%\nForward-A\n97.10\u00b10.08\nForward-A-R\n97.65\u00b10.11\nReweight-A\n97.39\u00b10.27\n97.83\u00b10.18\nReweight-A-R\nForward-N/A\n96.82\u00b10.14\nForward-N/A-R\n96.99\u00b10.16\nReweight-N/A\n97.01\u00b10.20\nReweight-N/A-R 97.81\u00b10.12\n\nSym-70%\n96.06\u00b10.41\n96.42\u00b10.35\n96.25\u00b10.26\n97.13\u00b10.08\n94.61\u00b10.28\n95.02\u00b10.17\n95.94\u00b10.14\n96.59\u00b10.15\n\nSym-80%\n91.46\u00b11.03\n91.77\u00b10.22\n93.79\u00b10.52\n94.19\u00b10.45\n85.95\u00b11.01\n86.04\u00b11.03\n91.59\u00b10.70\n91.91\u00b10.65\n\nTable 4: Classi\ufb01cation accuracy (percentage) on Clothing1M.\n\nDecoupling MentorNet Co-teaching\n\n53.98\n\n56.77\n\n58.68\n\nForward Reweight\n71.79\n\n70.95\n\nForward-R Reweight-R\n\n72.25\n\n74.18\n\nBaselines We compare the proposed method with state-of-the-art approaches. Speci\ufb01cally, we\ncompare with the following three inconsistent but well-designed algorithms: Decoupling [24],\nMentorNet [15], and Co-teaching [14], which free the learning of transition matrices. To compare\nwith consistent estimators, we set Forward [30], a classi\ufb01er-consistent algorithm, and the importance\nreweighting method (Reweight), a risk-consistent algorithm, as baselines. The risk-consistent\nestimator involving the inverse of transition matrix, e.g., Backward in [30], has not been included in\nthe comparison, because it has been reported to perform worse than the Forward method [30].\n\n4.1 Comparison for classi\ufb01cation accuracy\nThe importance of anchor points To show the importance of anchor points, we modify the datasets\nby moving possible anchor points, i.e., instances with large estimated class posterior probability\nP (Y |X), before corrupting the training and validation sets. As the MNIST dataset is simple, we\nremoved 40% of the instances with the largest estimated class posterior probabilities in each class.\nFor CIFAR-10 and CIFAR-100, we removed 20% of the instances with the largest estimated class\nposterior probabilities in each class. To make it easy for distinguishing, we mark a \u201c-A\u201d in the\nalgorithm\u2019s name if it runs on the original intact datasets, and mark a \u201c-N/A\u201d in the algorithm\u2019s name\nif it runs on those modi\ufb01ed datasets.\nComparing Decoupling-A, MentorNet-A, and Co-teaching-A in Table 1 with Decoupling-N/A,\nMentorNet-N/A, and Co-teaching-N/A in Table 2, we can \ufb01nd that on MNIST, the methods with\n\u201c-N/A\u201d work better; while on CIFAR-10 and CIFAR-100, the methods with \u201c-A\u201d work better. This is\nbecause those methods are independent of transition matrices but dependent of dataset properties.\nRemoving possible anchors points may not always lead to performance degeneration.\nComparing Forward-A and Reweight-A with Forward-N/A and Reweight-N/A, we can \ufb01nd that the\nmethods without anchor points, i.e., with \u201c-N/A\u201d, degenerate clearly. The degeneration on MNIST\nis slight because the dataset can be well separated and many instances have high class posterior\nprobability even in the modify dataset. Those results show that, without anchor points, the consistent\nalgorithms will have performance degeneration. Speci\ufb01cally, on CIFAR-100, the methods with \u201c-N/A\u201d\nhave much worse performance than the ones with \u201c-A\u201d, with accuracy dropping at least 4%.\nTo discuss the model performances on MNIST with more label noise, we raise the noise rates to 60%,\n70%, 80%. Other experiment settings are unchanged. The results are presented in Table 3. We can\nsee that the proposed model outperforms the baselines more signi\ufb01cantly as the noise rate grows.\nRisk-consistent estimator vs. classi\ufb01er-consistent estimator Comparing Forward-A with\nReweight-A in Table 1 and comparing Forward-N/A with Reweight-N/A in Table 2, it can be\nseen that the proposed Reweight method, a risk-consistent estimator not involving the inverse of\ntransition matrix, works slightly better than or is comparable to Forward, a classi\ufb01er-consistent\n\n8\n\n\f(a) MNIST\n\n(b) CIFAR-10\n\n(c) CIFAR-100\n\nFigure 3: The estimation error of the transition matrix by employing classi\ufb01er-consistent and risk-\nconsistent estimators. The \ufb01rst row is about sym-20 label noise while the second row is about sym-50\nlabel noise. The error bar for standard deviation in each \ufb01gure has been shaded.\n\nalgorithm. Note that in [30], it is reported that Backward, a risk-consistent estimator which involves\nthe inverse of the transition matrix, works worse than Forward, the classi\ufb01er-consistent algorithm.\nThe importance of T -revision Note that for fair comparison, we also set it as a baseline to modify\nthe transition matrix in Forward. As shown in Tables 1 and 2, methods with \u201c-R\u201d means that they use\nthe proposed T -revision method, i.e., modify the learned \u02c6T by adding \u02c6T . Comparing the results in\nTables 1 and 2, we can \ufb01nd that the T -revision method signi\ufb01cantly outperforms the others. Among\nthem, the proposed Reweight-R works signi\ufb01cantly better than the baseline Forward-R. We can \ufb01nd\nthat the T -Revision method boosts the classi\ufb01cation performance even without removing possible\nanchor points. The rationale behind this may be that the network, transition matrix, and classi\ufb01er are\njointly learned and validated and that the identi\ufb01ed anchor points are not reliable.\nComparison on real-world dataset The proposed T -revision method signi\ufb01cantly outperforms the\nbaselines as shown in Table 4, where the highest accuracy is bold faced.\n\n4.2 Comparison for estimating transition matrices\n\nTo show that the proposed risk-consistent estimator is more effective in modifying the transition\nmatrix, we plot the estimation error for the transition matrix, i.e., kT \u02c6T \u02c6Tk1/kTk1. In Figure\n4, we can see that for all cases, the proposed risk-consistent-estimator-based revision leads to smaller\nestimator errors than the classi\ufb01er-consistent algorithm based method (Forward-R), showing that the\nrisk-consistent estimator is more powerful in modifying the transition matrix. This also explains why\nthe proposed method works better. We provide more discussions about Figure 4 in Appendix E.\n\n5 Conclusion\n\nThis paper presents a risk-consistent estimator for label-noise learning without involving the inverse\nof transition matrix and a simple but effective learning paradigm called T -revision, which trains deep\nneural networks robustly under noisy supervision. The aim is to maintain effectiveness and ef\ufb01ciency\nof current consistent algorithms when there are no anchor points and then the transition matrices are\npoorly learned. The key idea is to revise the learned transition matrix and validate the revision by\nexploiting a noisy validation set. We conduct experiments on both synthetic and real-world label\nnoise data to demonstrate that the proposed T -revision can signi\ufb01cantly help boost the performance\nof label-noise learning. In the future, we will extend the work in the following aspects. First, how to\nincorporate some prior knowledge of the transition matrix, e.g., sparsity, into the end-to-end learning\nsystem. Second, how to recursively learn the transition matrix and classi\ufb01er as our experiments show\nthat transition matrices can be re\ufb01ned.\n\n9\n\n\fAcknowledgments\n\nTLL was supported by Australian Research Council Project DP180103424 and DE190101473. NNW\nwas supported by National Natural Science Foundation of China under Grants 61922066, 61876142,\nand the CCF-Tencent Open Fund. CG was supported by NSF of China under Grants 61602246,\n61973162, NSF of Jiangsu Province under Grants BK20171430, the Fundamental Research Funds\nfor the Central Universities under Grants 30918011319, and the \u201cYoung Elite Scientists Sponsorship\nProgram\u201d by CAST under Grants 2018QNRC001. MS was supported by the International Research\nCenter for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.\nXBX and TLL would give special thanks to Haifeng Liu and Brain-Inspired Technology Co., Ltd. for\ntheir support of GPUs used for this research.\n\nReferences\n[1] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343\u2013\n\n370, 1988.\n\n[2] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds\n\nfor neural networks. In NeurIPS, pages 6240\u20136249, 2017.\n\n[3] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classi\ufb01cation, and risk\n\nbounds. Journal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[4] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[5] Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection. Journal\n\nof Machine Learning Research, 11(Nov):2973\u20133009, 2010.\n\n[6] St\u00e9phane Boucheron, Olivier Bousquet, and G\u00e1bor Lugosi. Theory of classi\ufb01cation: A survey\n\nof some recent advances. ESAIM: probability and statistics, 9:323\u2013375, 2005.\n\n[7] St\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration inequalities: A\n\nnonasymptotic theory of independence. Oxford university press, 2013.\n\n[8] Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with\n\nbounded instance-and label-dependent label noise. arXiv preprint arXiv:1709.03768, 2017.\n\n[9] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adapta-\n\ntion layer. In ICLR, 2017.\n\n[10] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of\n\nneural networks. In COLT, pages 297\u2013299, 2018.\n\n[11] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and\nBernhard Sch\u00f6lkopf. Covariate shift by kernel mean matching. Dataset shift in machine\nlearning, pages 131\u2013160, 2009.\n\n[12] Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R Scott,\nand Dinglong Huang. Curriculumnet: Weakly supervised learning from large-scale web images.\nIn ECCV, pages 135\u2013150, 2018.\n\n[13] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi\nSugiyama. Masking: A new perspective of noisy supervision. In NeurIPS, pages 5836\u20135846,\n2018.\n\n[14] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi\nSugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels.\nIn NeurIPS, pages 8527\u20138537, 2018.\n\n[15] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. MentorNet: Learning\ndata-driven curriculum for very deep neural networks on corrupted labels. In ICML, pages\n2309\u20132318, 2018.\n\n10\n\n\f[16] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning.\n\narXiv preprint arXiv:1710.05468, 2017.\n\n[17] Jan Kremer, Fei Sha, and Christian Igel. Robust active label correction. In AISTATS, pages\n\n308\u2013316, 2018.\n\n[18] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[19] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The MNIST database of handwritten\n\ndigits. http://yann.lecun.com/exdb/mnist/.\n\n[20] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and\n\nprocesses. Springer Science & Business Media, 2013.\n\n[21] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning\n\nfrom noisy labels with distillation. In ICCV, pages 1910\u20131918, 2017.\n\n[22] Tongliang Liu and Dacheng Tao. Classi\ufb01cation with noisy labels by importance reweighting.\n\nIEEE Transactions on pattern analysis and machine intelligence, 38(3):447\u2013461, 2016.\n\n[23] Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah M Erfani, Shu-Tao Xia, Sudanthi\nWijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In ICML,\npages 3361\u20133370, 2018.\n\n[24] Eran Malach and Shai Shalev-Shwartz. Decoupling\" when to update\" from\" how to update\". In\n\nNeurIPS, pages 960\u2013970, 2017.\n\n[25] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.\n\nMIT Press, 2018.\n\n[26] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning\n\nwith noisy labels. In NeurIPS, pages 1196\u20131204, 2013.\n\n[27] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring\n\ngeneralization in deep learning. In NeurIPS, pages 5947\u20135956, 2017.\n\n[28] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to\n\nspectrally-normalized margin bounds for neural networks. In ICLR, 2018.\n\n[29] Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with con\ufb01dent examples: Rank\n\npruning for robust classi\ufb01cation with noisy labels. In UAI, 2017.\n\n[30] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu.\nMaking deep neural networks robust to label noise: A loss correction approach. In CVPR, pages\n1944\u20131952, 2017.\n\n[31] Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via\n\nkernel embeddings of distributions. In ICML, pages 2052\u20132060, 2016.\n\n[32] Scott E Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and\nAndrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In\nICLR, 2015.\n\n[33] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples\n\nfor robust deep learning. In ICML, pages 4331\u20134340, 2018.\n\n[34] Clayton Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958\u2013\n\n992, 2012.\n\n[35] Clayton Scott. A rate of convergence for mixture proportion estimation, with application to\n\nlearning from noisy labels. In AISTATS, pages 838\u2013846, 2015.\n\n[36] Clayton Scott, Gilles Blanchard, and Gregory Handy. Classi\ufb01cation with asymmetric label\n\nnoise: Consistency and maximal denoising. In COLT, pages 489\u2013511, 2013.\n\n11\n\n\f[37] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization\n\nframework for learning with noisy labels. In CVPR, pages 5552\u20135560, 2018.\n\n[38] Kiran K Thekumparampil, Ashish Khetan, Zinan Lin, and Sewoong Oh. Robustness of\n\nconditional gans to noisy labels. In NeurIPS, pages 10271\u201310282, 2018.\n\n[39] Arash Vahdat. Toward robustness against label noise in training deep discriminative neural\n\nnetworks. In NeurIPS, pages 5596\u20135605, 2017.\n\n[40] Robert A Vandermeulen and Clayton D Scott. An operator theoretic approach to nonparametric\n\nmixture models. arXiv preprint arXiv:1607.00071, 2016.\n\n[41] Robert A Vandermeulen and Clayton D Scott. An operator theoretic approach to nonparametric\n\nmixture models. accepted to The Annals of Statistics, 2019.\n\n[42] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media,\n\n2013.\n\n[43] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie.\nLearning from noisy large-scale datasets with minimal supervision. In CVPR, pages 839\u2013847,\n2017.\n\n[44] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive\n\nnoisy labeled data for image classi\ufb01cation. In CVPR, pages 2691\u20132699, 2015.\n\n[45] Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor W Tsang, and Masashi Sugiyama. How\n\ndoes disagreement bene\ufb01t co-teaching? In ICML, 2019.\n\n[46] Xiyu Yu, Tongliang Liu, Mingming Gong, Kayhan Batmanghelich, and Dacheng Tao. An\nef\ufb01cient and provable approach for mixture proportion estimation using linear independence\nassumption. In CVPR, pages 4480\u20134489, 2018.\n\n[47] Xiyu Yu, Tongliang Liu, Mingming Gong, and Dacheng Tao. Learning with biased complemen-\n\ntary labels. In ECCV, pages 68\u201383, 2018.\n\n[48] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. In ICLR, 2017.\n\n[49] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks\n\nwith noisy labels. In NeurIPS, pages 8778\u20138788, 2018.\n\n12\n\n\f", "award": [], "sourceid": 3704, "authors": [{"given_name": "Xiaobo", "family_name": "Xia", "institution": "The University of Sydney / Xidian University"}, {"given_name": "Tongliang", "family_name": "Liu", "institution": "The University of Sydney"}, {"given_name": "Nannan", "family_name": "Wang", "institution": "Xidian University"}, {"given_name": "Bo", "family_name": "Han", "institution": "RIKEN"}, {"given_name": "Chen", "family_name": "Gong", "institution": "Nanjing University of Science and Technology"}, {"given_name": "Gang", "family_name": "Niu", "institution": "RIKEN"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}