{"title": "Learning from Bad Data via Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 6044, "page_last": 6055, "abstract": "Bad training data would challenge the learning model from understanding the underlying data-generating scheme, which then increases the difficulty in achieving satisfactory performance on unseen test data. We suppose the real data distribution lies in a distribution set supported by the empirical distribution of bad data. A worst-case formulation can be developed over this distribution set, and then be interpreted as a generation task in an adversarial manner. The connections and differences between GANs and our framework have been thoroughly discussed. We further theoretically show the influence of this generation task on learning from bad data and reveal its connection with a data-dependent regularization. Given different distance measures (\\eg, Wasserstein distance or JS divergence) of distributions, we can derive different objective functions for the problem. Experimental results on different kinds of bad training data demonstrate the necessity and effectiveness of the proposed method.", "full_text": "Learning from Bad Data via Generation\n\nTianyu Guo1,2,\u2217, Chang Xu2,\u2217, Boxin Shi3,4,\u2217, Chao Xu1, Dacheng Tao2\n1Key Laboratory of Machine Perception (MOE), CMIC, School of EECS,\n\nPeking University, 100871, China\n\n2UBTECH Sydney AI Centre, School of Computer Science, Faculty of Engineering,\n\n4Peng Cheng Laboratory, Shenzhen, 518040, China\n\n{tianyuguo, shiboxin}@pku.edu.cn, chaoxu@cis.pku.edu.cn\n\n{c.xu, dacheng.tao}@sydney.edu.au\n\nThe University of Sydney, Darlington, NSW 2008, Australia\n\n3National Engineering Laboratory for Video Technology,\n\nDepartment of Computer Science and Technology, Peking University, Beijing, 100871, China\n\nAbstract\n\nBad training data would challenge the learning model from understanding the\nunderlying data-generating scheme, which then increases the dif\ufb01culty in achieving\nsatisfactory performance on unseen test data. We suppose the real data distribution\nlies in a distribution set supported by the empirical distribution of bad data. A\nworst-case formulation can be developed over this distribution set, and then be\ninterpreted as a generation task in an adversarial manner. The connections and\ndifferences between GANs and our framework have been thoroughly discussed. We\nfurther theoretically show the in\ufb02uence of this generation task on learning from bad\ndata and reveal its connection with a data-dependent regularization. Given different\ndistance measures (e.g., Wasserstein distance or JS divergence) of distributions, we\ncan derive different objective functions for the problem. Experimental results on\ndifferent kinds of bad training data demonstrate the necessity and effectiveness of\nthe proposed method.\n\n1\n\nIntroduction\n\nMachine learning techniques are applied to \ufb01t the data distribution induced by the training set and\nthen make predictions for new examples in various applications, such as image classi\ufb01cation [18, 37,\n39, 20], image generation [22, 40, 35, 14], and semantic segmentation [12, 36, 31, 17]. An important\nassumption underlying the success of these methods is that the training set and the test set are subject\nto the same distribution. It is therefore expected that the models well trained on the training set can\nalso achieve similar performance on the test data that have never been seen before in the training set.\nThe true underlying distribution of the data is unknown and many methods can be applied to\napproximate it. For example, cross-entropy loss is often taken as the objective function of deep\nneural networks in classi\ufb01cation tasks, which is equivalent to a maximum likelihood estimation of\nthe unknown data distribution based on the training data [13]. However, many factors, such as the\nsize of training set [13], the way data is collected [11], and the balance between different categories\nin the training [30], will affect the results of maximum likelihood estimation. If the data distribution\napproximated by the well trained model on the training sample is far from true data distribution,\nperformance on the test set would then hardly be comparable with that on the training set.\nIn real-world applications, there usually exist \"bad\" data that are instantiated from the imbalanced,\nnoisy or reduced training set, resulting in settings where the observed training samples do not well\n\n\u2217Corresponding authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frepresent the true underlying distribution of the data. In this paper, we study the problem of learning\nfrom bad data and propose an adversarial learning strategy based on our theoretical analysis. Instead\nof optimizing risk under the uniform empirical distribution over the observed bad data as classical\nmethods, we turn to an expected loss against a family of distributions that could contain the true\ndata-generating scheme with high con\ufb01dence. Speci\ufb01cally, a deep neural network is introduced to\napproximate the latent data distribution characterized by the observed bad data and the properties\nof true data distribution. Given Wasserstein distance as the measurement between distributions, we\nestablish a three-player game to solve a worst-case problem. We provide theoretical analysis to show\nthat the optimal generator captures the observed empirical distribution and \ufb01ts a worse distribution for\nthe classi\ufb01er. The proposed method roughly corresponds to a data-dependent gradient regularization\nover the empirical distribution, and we provide a performance guarantee for optimization on the\nlearned distribution. Experiments on multiple natural image datasets con\ufb01rm that the proposed\nmethod provides a robust approach to complement bad training data in different scenarios.\n\n2 Proposed Method\nConsider a training set X = {(x1, y1),\u00b7\u00b7\u00b7 , (xm, ym)} containing m examples, which are indepen-\ndently sampled from an unknown data distribution. Ideally, the empirical distribution \u02c6PN is a good\nestimation of the true distribution PN , which means that parameters learned on the empirical distri-\nbution will eventually converge to values learned on the true distribution. However, there is a certain\ndistance between the empirical distribution \u02c6PN and the real distribution PN in practice. This results\nin unsatisfactory performance of the test samples obtained by the model learned from the empirical\ndistribution. The discrepancy between empirical distribution and real distribution could be caused by\nmany reasons, e.g., samples are polluted by noise or samples of some categories are hard to obtain\nwhich reduces the number of samples in particular categories. To restore the performance of models\non bad data, we need to re\ufb02ect the conventional empirical risk minimization over the training data.\nSuppose that the real data distribution is in an ambiguity set supported by the empirical distribution\n\u02c6PN . We thus propose to optimize an upper bound of the loss function over all probability distributions\nin this ambiguity set,\n\n(1)\nwhere the distribution set B\u0001(\u00b7) contains all the distributions Q whose distance from the empirical\ndistribution \u02c6PN does not exceed \u0001. The distribution set B\u0001(\u00b7) is de\ufb01ned as follows\n\nQ\u2208B\u0001(\u02c6PN )\n\nB\u0001(\u02c6PN ) (cid:44) {Q \u2208 M(Z) : d(Q, \u02c6PN ) \u2264 \u0001},\n\n(2)\nwhere d(\u00b7,\u00b7) stands for some pre-de\ufb01ned distance metric, M(Z) denotes the set of probability\nmeasures supported on Z, and Z is the set of possible values of (x, y). According to this de\ufb01nition,\nwe will investigate all possible distributions within a ball centred at \u02c6PN with radius \u0001. We aim to\ndiscover a distribution Q from B\u0001(\u02c6PN ) that corresponds to the worst case, and thus the optimization\nover this worst case distribution would imply an optimization over the entire distribution set B\u0001(\u02c6PN ),\nwhich also assum to includes real data distribution PN .\nEq. (1) is intractable, as the worst-case distribution Q is unknown. As a result, in the following we\nfocus on the inner part of Eq. (1) to \ufb01nd the worst-case distribution Q. Firstly, we re-express the inner\npart of our objective function de\ufb01ned in Eq. (1) as follows,\n\nEQ[(cid:96)\u03b8(x, y)],\n\ninf\n\u03b8\u2208\u0398\n\nsup\n\nEQ[(cid:96)\u03b8(x, y)] =\n\nsup\n\nQ\u2208B\u0001(\u02c6PN )\n\n(cid:96)\u03b8(x, y)Q (d(x, y))\n\n,\n\n(3)\n\n\u03a0 refers to a joint distribution of (x, y) and (x(cid:48), y(cid:48)) with marginals Q and \u02c6PN respectively. With the\nhelp of standard duality augment, we have\n\nZ\n\n(cid:90)\n\n\u03a0,Q\ns.t. d(Q, \u02c6PN ) \u2264 \u0001.\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 sup\n(cid:90)\n(cid:96)\u03b8(x)Q (d(x, y)) + \u03bb \u00b7(cid:16)\n(cid:26)\n(cid:90)\n(cid:90)\n\n\u03bb\u0001 +\n\nZ\n\nZ\n\nZ\n\n2\n\nsup\n\nQ\u2208B\u0001(\u02c6PN )\n\nEQ[(cid:96)\u03b8(x, y)] = sup\n\nQ\u2208M(Z)\n\ninf\n\u03bb\u22650\n\n\u0001 \u2212 d(Q, \u02c6PN )\n\n(cid:27)\n\n(4)\n\n(cid:17)\n\n(cid:27)\n\n(cid:26)\n\n\u2264 inf\n\u03bb\u22650\n\n= inf\n\u03bb\u22650\n\nsup\n\nQi\u2208M(Z)\n\n\u03bb\u0001 + sup\n\nQ\n\n(cid:96)\u03b8(x, y)Q (d(x, y)) \u2212 \u03bb \u00b7 d(Q, \u02c6PN )\n\n(cid:96)\u03b8(x, y)Q (d(x, y)) \u2212 \u03bb \u00b7 d(Q, \u02c6PN )\n\n,\n\n\fwhere \u03bb is Lagrangian multiplier. The \ufb01rst term in Eq. (4) is independent from the distribution Q\nand the loss function, so we get the constraint that the distribution Q should satisfy as following,\n\n[EQ[(cid:96)\u03b8(x, y)] \u2212 \u03bb \u00b7 d(Q, P)].\n\n(5)\nIt can be seen that the distribution Q should make the loss function as large as possible while trying\nto reduce the distance from the empirical distribution \u02c6PN . It is instructive to note that the distance\nmatric d(\u00b7,\u00b7) plays an important role to determine the solution of distribution Q.\n\nsup\nQ\n\n(cid:111)\n\n(cid:110)(cid:90)\n\n(cid:110)(cid:90)\n\n2.1 Learning via Generation\nWe introduce the Wasserstein distance dW (\u00b7,\u00b7) to describe the distance between distributions \u02c6PN and\nQ. Wasserstein distance measures the distance between two distributions as the minimum variation\nrequired to transition from one distribution to another. We de\ufb01ne the Wasserstein distance as follows,\n\ndW (Q1, Q2) (cid:44) min\n\u03a0\u2208M(\u03a0)\n\ns((x1, y1), (x2, y2))\u03a0 (d(x1, y1), d(x2, y2))\n\n(6)\nwhere \u03a0 is de\ufb01ned as the joint distribution of (x1, y1) and (x2, y2) with marginals Q1 and Q2\nrespectively, M(\u03a0) represents the space of all probability of \u03a0, and s is a metric measuring the cost\nof moving (x1, y1) to (x2, y2). The calculation of the Wasserstein distance is not so straightforward\nbecause of the need to \ufb01nd an optimal joint distribution \u03a0 that minimizes the integral value. According\nto [29], the Wasserstein distance can be further calculated as follows:\n\nZ\u00d7Z\n\n,\n\nf (x, y)Q1(d(x, y)) \u2212\n\ndW (Q1, Q2) = sup\nf\u2208F\n\n(7)\nwhere F donates the space of all Lipschitz function with |f (t) \u2212 f (t(cid:48))| \u2264 (cid:107)t \u2212 t(cid:48)(cid:107) for all t and\nt(cid:48) \u2208 T . Eq. (7) replaces the optimal joint distribution \u03a0 involved in the Eq. (6) by \ufb01nding a speci\ufb01c\n(cid:111)\nfunction f in a function set F. We can get Eq. (7) as the distance matric into the Eq. (5).\nf (x, y)Q(d(x, y))\n\nf (x, y)\u02c6PN (d(x, y)) \u2212\n\n(cid:110)(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nsup\n\nZ\n\nZ\n\n,\n\n(cid:96)\u03b8(x, y)Q (d(x, y)) \u2212 \u03bb \u00b7 sup\nf\u2208F\n\nQ\n\nf (x, y)Q2(d(x, y))\n\nX\n\n(cid:111)\n\n(cid:90)\n\n(cid:96)\u03b8(x, y)Qi (d(x, y)) \u2212 \u03bb \u00b7 sup\nf\u2208F\n\nf (xi, yi) \u2212\n\nf (x, y)Qi(d(x, y))\n\n(8)\n\nX\n\n(cid:90)\n\nX\n\n(cid:21)(cid:111)\n\nX\n\n(cid:90)\n\nN(cid:88)\n\n= sup\nQi\n\n=\n\n1\nN\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n(cid:110)\n\nX\n\ni=1\n\nsup\n\n(x,y)\u223cQ\n\n(cid:110) 1\n\nN\n\n(cid:20)\n\ni=1\n\nN(cid:88)\n(cid:111)\n(cid:82)\n\n,\n\n(cid:96)\u03b8(x, y) \u2212 \u03bb \u00b7 [ \u02c6f (xi, yi) \u2212 \u02c6f (x, y)]\n\n(cid:80)N\ni=1 f (xi, yi) \u2212 1\n\n(cid:80)N\n\nN\n\nX f (x, y)Qi(d(x, y)) is the optimized func-\nwhere \u02c6f = argmaxf\u2208F 1\ntion to describe the Wasserstein distance between the desirable distribution Q and the empirical\ndistribution \u02c6PN , Qi is the conditional distribution of (x, y) given (xi, yi), The joint distribution\n(cid:80)N\n\u03a0 of (xi, yi) and (x, y) with marginals \u02c6PN and Q respectively (see Eq. (3)), can be written as\ni=1 \u03b4(xi,yi) \u2297 Qi. According to the law of total probability, we can factorize Q as the \ufb01rst\n\u03a0 = 1\nline of Eq. (8). Eq. (8) bridges the training sample and the distribution Q, and Q is thus de\ufb01ned by:\nN\n\ni=1\n\nN\n\n(cid:96)\u03b8(x, y) \u2212 \u03bb \u00b7 [ \u02c6f (xi, yi) \u2212 \u02c6f (x, y)]\n\n(cid:110)\n\n1\nN\n\nN(cid:88)\nEQ[(cid:96)\u03b8(x, y) + \u03bb \u00b7 \u02c6f (x, y)] \u2212 \u03bb \u00b7 E\u02c6PN\nEQ[(cid:96)\u03b8(x, y) + \u03bb \u00b7 \u02c6f (x, y)].\n\ni=1\n\n(cid:111)\n\nQ = argmax\n\nQ\n\n= argmax\n\nQ\n\n= argmax\n\nQ\n\n[ \u02c6f (x, y)]\n\n(9)\n\nA neural network G(z) can be employed to approximate the distribution Q, and thus Eq. (9) can be\nrewritten as the maximization of EZ [(cid:96)\u03b8(G(z)) + \u03bb \u00b7 \u02c6f (G(z))]. According to Eq. (8), we need to\nsolve an optimal \u02c6f to calculate the Wasserstein distance. We also adopt a neural network D for help,\nX D(x, y))Qi(d(x, y)). Finally, we can\nand propose to maximize 1\nN\nlearn the classi\ufb01er over the bad data by considering the worst distribution case through the following\nobjective function\n\n(cid:80)N\ni=1 D(xi, yi) \u2212 1\n\n[D(x, y)] \u2212 EQ[D(x, y)](cid:1) \u2212 EQ[(cid:96)(C(x), y)].\n\nU (C, G, D) = \u03bb(cid:0)E\u02c6PN\n\n(cid:80)N\n\n(10)\n\n(cid:82)\n\ni=1\n\nN\n\nmin\n\nG\n\nmax\nD,C\n\n3\n\n\fRecall that \u03bb is a Lagrangian multiplier in Eq. (4). According to the analysis in the Lagrange\nmultiplier method, the larger \u03bb corresponds to a smaller epsilon, which implies that the Q distribution\nis closer to the \u02c6PN distribution. Conversely, a smaller \u03bb will allow a larger distribution distance \u0001,\nwhich allows the Q distribution to be explored over a suf\ufb01ciently large range. Following we provide\nintuitive explanation why the worst-case optimization works. Minimizing the loss of the worst-case\ndistribution Q implies an optimization over all distributions within the ball of an appropriate radius \u0001\n(see Eq. (1)), which could also include the unknown real distribution PN . Though the worst-case Q\nmay not be exactly the real PN , the classi\ufb01er (i.e. \u03b8) must have \ufb01tted PN better than (or equivalently\nwith) Q, as the classi\ufb01cation error over Q is the worst. In iterations, the worst-case Q will be\ndynamically determined by the classi\ufb01er, and the classi\ufb01er will \ufb01t the real PN increasingly better in\nan implicit way.\nDifference from GANs. Though we also introduce a generator and investigate an adversarial game,\nour model has several differences from existing GAN models. Compared with WGAN [3], besides the\ncritical network D, our generative network G further plays against the classi\ufb01cation network C. There\nare some three-way GAN models, such as Triple GAN [27], Triangle GAN [10] (actually 4 players),\nand ALI [9]. These models have two opposite generation models, C : x \u2192 y and G : y \u2192 x. At\n\ufb01rst, we have a different motivation to establish the adversarial games, compared with these existing\nmethods. Triple GAN and Triangle GAN are dedicated to a semi-supervised learning, while ALI\nare dedicated to improving the training of D network by learning a set of opposite mappings from y\nto x. In contrast, we aim to learn a classi\ufb01er that can deal with bad data, and our adversarial model\nis to provide an appropriate measure between the worst case distribution and the empirical data\ndistribution. In addition, existing methods for implementing the two sets of opposite mappings shares\nthe same goal, that is to deceive D network, and there is no explicit relationship between the two\ngenerators (i.e. G and C). However, our generator deceives not only the discriminator, but also the\nclassi\ufb01er. That is to say, our two opposite mappings are directly competitive with each other.\n\n3 Theoretical Analysis\n\nN\n\nN\n\ni=1\n\n(cid:82)\n\n(cid:80)N\n\n(cid:80)N\ni=1 f (xi, yi) \u2212 1\n\nIn the proposed method, the classi\ufb01er is optimized on a learned distribution Q. Q represents the\nworst-case distribution within a certain range, which is a key point of the entire algorithm. In this\nsection, we provide a formal technical analysis of the convergence of the three networks to better\nunderstand the relationship between the three networks. Next, by analyzing the difference between\nthe experimental distribution \u02c6PN and the learned distribution Q, we prove that our algorithm can be\nregarded as a data-dependent gradient regularization, which provides a reason for improvement of the\ngeneralization ability provided by the proposed algorithm. See the supplementary material for proofs.\nIn\ufb02uence of optimal D and G. In the framework de\ufb01ned by Eq. (10), the critical network D attempts\nX f (x, y)Qi(d(x, y))\nto \ufb01t the desirable function \u02c6f = argmaxf\u2208F 1\nwhich represents the Wasserstein distance between two distribution. As a result, the optimal critical\n[D\u2217(x, y)] \u2212\nnetwork is expected to describe the Wasserstein distance perfectly, which is E\u02c6PN\nEQ[D\u2217(x, y)] = dW (\u02c6PN , Q).\nNext we analyze the target distribution of the generator G. As described in Eq. (10), the generator\naims to maximize loss \u03bbEQ[D(x, y)] +EQ[(cid:96)(C(x), y)]. In Theorem 1 we summarize the equilibrium\ndistribution obtained by G which is also determined by \u03bb.\nTheorem 1. With the optimal critical network D and the classi\ufb01er C \ufb01xed, the optimization of\ngenerator G is equivalent to minimize \u03bb \u00b7 dW (\u02c6PN , Q) \u2212 DKL(Q||Pc).\nTheorem 1 suggests that the distribution Q will be optimized to be as far as possible away from the\ndistribution Pc while towards the distribution \u02c6PN . Distribution Q will \ufb01t a worse distribution for C\niterative and enforce C to be optimized over the whole distribution set B\u0001.\nData-dependent regularization. Traditional methods used to optimize the classi\ufb01cation network C\nover the empirical distribution \u02c6PN . From the perspective of network complexity, some regularization\nare often introduced to improve the generalization capabilities of the network, such as weight decay\nand dropout. We theoretically show that by setting the distance metric as Wasserstein distance, we\nwill derive a data-dependent gradient regularization.\n\n4\n\n\f\u03b1\n\n,\n\nN\n\n(cid:44) ( 1\n\n\u03b1\u2212\u03b2\u22121\n\u02c6PN\n\ni=1,z\u223c\u02c6PN\n\n(cid:80)N\n\n\u2212 \u0001\u03b2+1(cid:107)h(z)(cid:107)\n\n+ \u0001\u03b2+1(cid:107)h(z)(cid:107)\u03b1\u2217\n\u02c6PN\n\n((cid:96)(z)) \u2264 \u0001(cid:107)\u2207z(cid:96)(z)(cid:107)\u03b1\u2217\n\u02c6PN\n\n\u2264 EQ((cid:96)(z)) \u2212 E\u02c6PN\n((cid:107)f (zi)(cid:107)\u03b1))1/\u03b1, \u03b1\u2217 = \u03b1\n\nTheorem 2. Consider x as the input samples of classi\ufb01er C, and the distribution Q \u2208 B\u0001(\u02c6PN ) lays\nin a Wasserstein ball centered at \u02c6PN with radius \u0001. Then for any \u0001 \u2265 0 and \u03b1 \u2265 1 + \u03b2, we have\n\u0001(cid:107)\u2207z(cid:96)(z)(cid:107)\u03b1\u2217\n(11)\n\u02c6PN\nwhere (cid:107)f (z)(cid:107)\u03b1\n\u03b1\u22121 , and h(z) is a function and \u03b2 \u2208\n\u02c6PN\n(0, 1] a constant which satisfy (cid:107)\u2207z(cid:96)(z1) \u2212 \u2207z(cid:96)(z2)(cid:107) \u2264 h(z2) \u00b7 (cid:107)z1 \u2212 z2(cid:107)\u03b2 for any z = (x, y) \u2208 Z.\nIt shows that optimization over the worst-case distribution Q can be roughly interpreted as a data-\ndriven gradient regularization. By minimizing the loss function (cid:96)(z) over Q, the gradient of the loss\nfunction with respect to the empirical samples \u2207x(cid:96)(C(xi, \u03b8), y) will also be optimized. Furthermore,\ngradient penalties applied over the empirical sample also lead the classi\ufb01er C to react more gently to\nchanges in the sample, which provides another perspective on the effectiveness of our algorithm.\nPerformance guarantees. In this part, we analyze the generalization capabilities of the classi\ufb01ers\nobtained by the proposed method. The generalization ability of the classi\ufb01er is often described as\nthe bias in the performance of the network between training samples x \u2208 Q and the new sample \u02dcx,\nand the smaller the bias corresponds to better generalization. In the following theorem, we propose a\nbound for the predictive performance of the classi\ufb01er on the new sample.\nTheorem 3. For any 0 < \u03b4 < 1, with probability at least 1 \u2212 \u03b4 with respect to the sampling,\n\nE((cid:96)(C(x, \u03b8), y)) \u2264 EQ((cid:96)(C(x, \u03b8), y)) +\n\n(log\n\n\u221a\nn\nR\n\n3\n\n+ 1) +\n\n8 log(2/\u03b4)\n\nN\n\n,\n\n(12)\n\n(cid:114)\n\n\u221a\nR\n\nn\n\n12\n\n(cid:113) 8 log(2/\u03b4)\n\nN\n\n, we have\n\nand for any \u03b6 > 12\n\n\u221a\n\u221a\nn (log n\n3\n\nR\n\nR\n\n+ 1) +\n\nP ((cid:96)(C(x, \u03b8), y) \u2265 EQ((cid:96)(C(x, \u03b8), y)) + \u03b6) \u2264\n\n\u221a\nEQ((cid:96)(C(x, \u03b8), y)) + 12\n\u221a\nn (log n\n\nR\n\n3\n\nR\n\n+ 1) +\n\nEQ((cid:96)(C(x, \u03b8), y)) + \u03b6\n\n(cid:113) 8 log(2/\u03b4)\n\nN\n\n.\n\n(13)\n\nwhere R is only related to the architecture of the neural network.\n\nWe leave the detailed de\ufb01nition of R in the appendix. Theorem 3 provides a bound over the error of\nthe classi\ufb01er on new samples, which contains two probability measures. Wherein, Eq. (12) indicates\nthat the error of the classi\ufb01er on new samples does not exceed 12\n+ 1) at a probability of\nat least 1 \u2212 \u03b4, where N is the number of training samples. Eq. (13) provides an upper bound for the\nprobability that the classi\ufb01er\u2019s error on the new sample exceeds 12\n+ 1). From Eq. (12),\nwe \ufb01nd that the bound of the classi\ufb01er error is related to the number of training samples N, and the\nlarger the number of samples, the smaller the error. In our training framework, the generator G is\nresponsible for producing training samples, and G is also updated along with the classi\ufb01er C. Since\nthat, the training samples far exceed that of the traditional algorithms, which will reduce the error of\nthe classi\ufb01er on the new data, and the generalization ability of the classi\ufb01er will be improved.\n\n\u221a\n\u221a\nn (log n\nR\n\n\u221a\n\u221a\nn (log n\n3\n\nR\n\nR\n\nR\n\n3\n\n4 Extension to Other Distance Measure\n\nThe proposed model can be extended to a standard GAN based game by investigating Jensen-Shannon\n(JS) divergence between distributions. The critical network D is designed to \ufb01t the desirable function\n\u02c6f (\u00b7) and calculate Wasserstein distance between distribution Q and the empirical distribution \u02c6PN .\nIn standard GAN, the discriminator acts as a classi\ufb01er and attempts to distinguish fake samples\ngenerated by G from real samples. The objective function of the discriminator can be written as\n[log(D(x, y))] + EQ[1 \u2212 log(D(x, y))]. By simply replacing the objective function of the\nE\u02c6PN\nnetwork D, we can formulate the three-players game based on a standard GAN as follows,\n[log(D(x, y))] + EQ[1 \u2212 log(D(x, y))]) + EQ[(cid:96)(C(x, \u03b8), y)]\n\n(14)\nAs described in Eq. (14), the generator G whose responsibility is to \ufb01t the objective distribution Q and\ntries to fool both the discriminator D and the classi\ufb01er C. The confrontation with the discriminator\nD leads the generator G to produce samples that are as close as possible to the true distribution. At\nthe same time, these samples also make the performance of the classi\ufb01er C worse.\n\nU (C, G, D) = \u03bb(E\u02c6PN\n\nmin\nC,D\n\nmax\n\nG\n\n5\n\n\fNow we consider the standard GAN based framework and analyze the optimal discriminator and\n[log(D(x, y))] + EQ[1 \u2212 log(D(x, y))], which\ngenerator. The discriminator D is optimized by E\u02c6PN\ncan be considered as distinguishing generated samples (x, y) \u223c Q from the true sample (x, y) \u223c \u02c6PN .\nFollowing the analysis proposed in GAN [14], the optimal distribution D will balance between the\ntrue distribution \u02c6PN and the learned distribution Q.\nTheorem 4. For the generator G and classi\ufb01er C \ufb01xed, the optimal discriminator D is\n\n\u2217\nG,C (x, y) =\n\nD\n\npdata(x, y)\n\npdata(x, y) + pg(x, y)\n\n,\n\n(15)\n\nwhere pg(x) is the distribution generated by G.\n\nWith the optimal discriminator D \ufb01xed, we can reformulate the objective function by replacing\nD(x, y) in Eq. (14) according to Theorem 4. By doing so, we show that the optimal generator G will\nalso balance between the empirical distribution \u02c6PN and the distribution Q which is represented by\nclassi\ufb01er C, as summarized in the following theorem.\nTheorem 5. With the optimal discriminator D and the classi\ufb01er C \ufb01xed, the optimization of generator\nG is equivalent to \u2212 log 4 + 2JSD(\u02c6PN||Q) \u2212 1/\u03bb \u00b7 DKL(Q||Pc).\nJusti\ufb01cation of the standard GAN based framework. The distribution Q obtained by Eq. (10) is a\nstraightforward result of Eq. (3) and satis\ufb01es two conditions. The \ufb01rst one is the distance between\ndistributions \u02c6PN and Q is less than a constant \u0001, and the second one is to make the classi\ufb01cation loss as\nbad as possible. As Eq. (14) is obtained by simply replacing the critical loss with a discriminator loss\n[log(D(x, y))] + EQ[1 \u2212 log(D(x, y))], whether or not the distribution Q obtained in Eq. (14)\nE\u02c6PN\nsatis\ufb01es such conditions of Eq. (3) cannot be easily justi\ufb01ed. The distribution Q obtained by Eq. (14)\nis optimized according to the minimization of \u2212 log 4 + 2JSD(\u02c6PN||Q) \u2212 \u03bb \u00b7 DKL(Q||Pc). By\ncomparing Theorem 5 and Theorem 1, with an ignorance of the constant term \u2212 log 4, we can \ufb01nd\nthat the major difference between the equilibrium distributions of Eq. (10) and Eq. (14) is the choice\nof distance metric , i.e., the Wasserstein distance dW (\u02c6PN||Q) for Eq. (10) and the JS divergence\nJSD(\u02c6PN||Q) for Eq. (14). We now build a relationship between the initial objective and Eq. (14)\nand draw a conclusion that the loss function de\ufb01ned by Eq. (14) can be viewed as the JS divergence\nversion of Eq. (3). The JS divergence based objective function shares the same training procedure with\nthe Wasserstein distance one. Our proposed algorithm is summarized in Algorithm 1 in Appendix.\n\n5 Experiments\n\nIn this section, we evaluate our methods on three kinds of bad data environments: (i) long-tailed\ntraining set classi\ufb01cation on the MNIST [25], FMNIST [42], and CIFAR-10 [23] datasets; (ii)\nclassi\ufb01cation of distorted test set on the CIFAR-10 and SVHN [33] datasets; and (iii) reduced\ntraining set generation task on the FMNIST and CIFAR-10 datasets. We resize images in the MNIST\nand FMNIST datasets to 32\u00d7 32 for convenience. Moreover, we use a conditional version of WGAN-\nGP [15] on all datasets except the CIFAR-10 datasets on which we use the 32 resolution version\nof BigGAN [4] instead. The classi\ufb01er implemented on the MNIST and FMNIST has comparable\narchitecture to Triple GAN [27], and we use VGG-16 [38] and ResNet-101 [18] on the CIFAR-10 and\nSVHN datasets. We implement our experiments based on PyTorch. For generator and discriminator\nwe use a learning rate of 0.0002, while 0.02 is for the classi\ufb01er, the learning rate decay is deployed,\nand the optimizer is Adam. Experiments are conducted on 4 NVIDIA 1080Ti GPUs.\n\n5.1 Classi\ufb01cation results\n\nFor the long tail experiment, we transform the original balanced training set according to an expo-\nnential function n = ni \u00d7 \u00b5i, where constant \u00b5 \u2208 (0, 1), and ni is the original number of sample\nof category i. Under this setting, we follow de\ufb01nition in [5] and de\ufb01ne the imbalance factor as\nthe number of training samples in the largest class divided by the smallest one. We compare the\nproposed method with two the state-of-the-art algorithms which are Class-Balanced [5] and DOS [1]\nrespectively. In experiments on noisy test sets, we introduce a certain intensity of Gaussian noise\nor salt-and-pepper noise into 70% of the test samples to produce noisy test sets. Moreover, in\nexperiment of noisy test sets, GAN-based methods use ResNet-101 as classi\ufb01er and MixQualNet [6]\n\n6\n\n\fTable 1: Accuracy (%) on long-tailed datasets with various imbalance factors.\nCIFAR\n\nFMNIST\n\nMNIST\n\nMethod\n\nImbalanced\nClassi\ufb01er\nDAGAN [2]\n\nTriple GAN [27]\n\n\u2206-GAN [10]\n\nOurs\n\nC-B Loss [5]\n\nDOS [1]\n\n100\n\n20\n\n1\n\n100\n\n20\n\n1\n\n20\n\n10\n\n1\n\n89.77\n89.92\n90.07\n90.25\n92.67\n\n92.90\n90.82\n\n94.62\n95.71\n95.19\n95.60\n96.23\n\n96.74\n97.20\n\n99.35\n99.23\n99.31\n99.28\n99.42\n\n99.38\n99.07\n\n79.89\n77.63\n78.92\n78.85\n83.06\n\n83.77\n82.74\n\n87.33\n86.30\n87.83\n87.62\n89.03\n\n89.97\n89.34\n\n93.46\n92.87\n92.55\n93.06\n93.24\n\n93.43\n93.18\n\n81.51\n70.34\n80.40\n79.99\n81.94\n\n84.36\n81.72\n\n85.57\n77.82\n83.06\n85.47\n86.86\n\n87.49\n86.55\n\n93.04\n91.66\n92.81\n92.87\n93.01\n\n93.64\n92.83\n\nTable 2: Accuracy (%) on the clean train sets and distorted test sets of CIFAR-10 and SVHN.\n\nMethod\n\nVGG16 [38]\nResNet [18]\nDAGAN [2]\n\nTriple GAN [27]\n\n\u2206-GAN [10]\nOurs-VGG\nOurs-ResNet\n\nMixQualNet [6]\nDCTNet [34]\n\nCIFAR-10\n\nSVHN\n\nG(0.2) G(0.3)\n\nS(0.02) Normal G(0.5) G(0.7)\n\nS(0.1) Normal\n\n82.98\n84.41\n80.67\n84.57\n84.46\n85.02\n85.87\n\n86.56\n85.42\n\n63.09\n64.15\n61.19\n63.76\n64.28\n64.43\n65.43\n\n65.70\n65.68\n\n63.87\n64.53\n61.38\n64.51\n64.59\n65.26\n66.80\n\n66.71\n69.13\n\n91.86\n93.04\n91.66\n92.81\n92.87\n92.32\n93.01\n\n89.62\n90.93\n\n94.84\n95.25\n94.64\n95.21\n95.42\n95.08\n95.58\n\n95.48\n95.55\n\n94.09\n94.52\n94.45\n94.60\n94.37\n94.55\n95.02\n\n95.27\n95.10\n\n93.70\n94.23\n93.91\n94.09\n94.14\n94.04\n94.67\n\n94.21\n94.88\n\n97.63\n98.17\n97.82\n97.85\n97.76\n97.69\n98.33\n\n96.80\n97.41\n\nand DCTNet [34] get their results based on VGG16. We also investigated the performance of our\napproach when using VGG and ResNet-101 as classi\ufb01er respectively and reported results in Table 2.\nTable 1 and Table 2 report results obtained on long-tailed and noisy datasets respectively. Note that\nimbalance factor (IF) of 1 means that the class is balance. In Table 2, G and S represent Gaussian\nnoise and salt-and-pepper noise respectively, and the number represents standard deviation in G\nand noise rate in P . It is obviously that the performance of the classi\ufb01er drops signi\ufb01cantly as the\nimbalance factor increases. We implement DAGAN to achieve data augmentation and train classi\ufb01er\nwith these data. However, the improvement is slight on the MNIST dataset, and the performance\neven drops on more complicated datasets such as FMNIST and CIFAR-10. It indicates that samples\ngenerated by GAN help less for classi\ufb01er. Triple GAN and Triangle GAN (\u2206-GAN) show more\nimprovement than that of DAGAN but are not stable enough. This phenomenon can be interpreted\nby their architectures, where the generator pleases the classi\ufb01er rather than playing against it. The\nproposed method outperforms the other GAN-based methods and achieves the best results on most\nconditions. We conclude the reason in following two points. First, generators in existing methods\ntend to \ufb01t the empirical distribution. Given a bad training set, their generated data could be worse.\nSecond, these generators often produce \u201ceasy\u201d samples by cooperating with the classi\ufb01er, and a\nnearly duplicate copy of the given bad training data could be suf\ufb01cient but will be useless to estimate\nthe real data distribution. In contrast, our generator plays against the classi\ufb01er and the capability of\nthe classi\ufb01er can be largely enhanced over a distribution ball. Moreover, most GAN-based methods\nfailed to provide performance improvement on the normal datasets (IF of 1 or \u2018Normal\u2019), but our\nmethod even outperforms classi\ufb01er on the MNIST and SVHN datasets. Those results may because\nthat though training data are clean, there still probably exists a subtle gap between distributions of\ntraining and test data. Moreover, the generator could conduct \u2018data augmentation\u2019 for the classi\ufb01er.\nWe also compared our method with state-of-the-art algorithms. Noting that these algorithms are\ndesigned for a speci\ufb01c type of data defect, the proposed method also achieves comparable results\nwith these algorithms in each situation.\nComparison with data augmentation methods.\nConsidering the generator trained by the proposed\nalgorithm as a learned data augmenter, the proposed\nmethod can be viewed as a data augmentation method.\nIn this part we evaluate two common data augmenta-\ntion methods with the proposed method on the CIFAR-\n10 dataset. The \ufb01rst method is a combination of reg-\n\nTable 3: Accuracy (%) on CIFAR-10.\nMethod\n\nCombination\nMixup [44]\n\n85.63\n86.04\n86.86\n\n83.59\n83.91\n84.60\n\n85.25\n85.66\n85.87\n\nIF = 10 G(0.2)\n\nReduced\n\nOurs\n\n7\n\n\fFigure 1: Generated images obtained by GAN and our method on imbalanced and reduced dataset.\n\nular data augmentation methods, which are randomly\nclip, horizon \ufb02ip, and rotation, and we show its results in the \ufb01rst line of Table 3. The second\ndata augmentation method is Mixup [44]. Table 3 shows that our method outperforms both two\ncomparison methods. Mixup also provides reasonable improvement but is not as outstanding as it on\nregular datasets.\n\n5.2 Analysis\n\nGeneration results We compare the quality of images generated by our generator with GAN and\nTriple GAN on the MNIST and CIFAR-10 datasets with imbalance or reduced training set. The\nimbalance factor used here is 10, and we obtain reduced training set by randomly selecting 30% of\nsamples from the training set. We compared the quality of images generated by our method with\nstandard GAN in Figure 1, and FID scores are reported in Table 4. We obtain the feature used for\ncalculating FID from the speci\ufb01c layer of the pre-trained inception model. FIDs are calculated with\n10,000 samples randomly chosen from training dataset and 10,000 generated samples. On imbalanced\ntraining sets, standard GAN is failed to generate high-quality images for classes with fewer images\nespecially in the CIFAR-10 dataset, while our method generates images with satis\ufb01ed quality. In\nthe MNIST dataset, the proposed method achieves a higher FID score than Triple GAN, but our\nclassi\ufb01er obtains a higher accuracy as showed in Table 1. It indicates that the generator in our method\ndoes not generate images with the best quality but do generate more helpful images for the classi\ufb01er.\nOn the reduced training set, our method outperforms other algorithms. As the true images used to\ncalculate FID scores are sampled from the whole training set instead of the reduced one, the FID\nrepresents how close the generated distribution is to the true distribution, instead of the empirical one.\nMinimizing the worst-case expected loss implies an optimization over all distributions in the \u0001-ball\nwhere the real data distribution is also expected to be included.\n\nTable 4: FIDs with different distributions.\nCIFAR-10\n\nMNIST\n\nIF=10\n\nReduced\n\n33.49\n27.24\n27.60\n\n31.06\n26.30\n26.02\n\nIF=10\nNone\n26.22\n26.10\n\nReduced\n\n37.96\n22.57\n22.63\n\nMethod\n\nGAN\nTriple\nOurs\n\nTable 5: Ablation study results.\n\nCNN\nMethod\nBN+WD 83.19\nNo BN\n75.14\nNo WD\n77.01\nNeither\n71.85\n\nDAGAN\n\nOurs\n\n83.08\n71.92\n75.38\n71.63\n\n84.60\n78.01\n78.71\n75.42\n\nAblation study Weight decay (WD) [24] and Batch Normalization (BN) [21] are considered as\ncommon methods to increase the robustness of the network. To illustrate the effectiveness of our\nmethod, we implement experiments on four classi\ufb01er architectures: (i) classi\ufb01er with BN and WD,\n(ii) classi\ufb01er with BN (without WD), (iii) classi\ufb01er with WD (without BN), and (iv) classi\ufb01er only\n(without BN nor WD). Results in Table 5 are obtained on reduced CIFAR-10 dataset which contains\n20% samples of original training set, and the classi\ufb01er is set as ResNet-101. Table 5 shows that the\nproposed method not only outperforms classi\ufb01er and DAGAN but enjoys a smaller accuracy rate\ndrop when the network structure changes. It proves that our method can provide more challenging\nimages for classi\ufb01er and play the same role as these generalization algorithms.\nHyper-parameter analysis As shown in Eq. (2), a large \u0001 leads to a set B\u0001 of huge capacity, which\ncould be \ufb02ooded with distributions that are far away from both the empirical distribution \u02c6PN and\nthe real data distribution PN . It is therefore reasonable to set \u0001 within an appropriate range, as what\nwe usually do with hyper-parameters in machine learning. For a better understanding of the role\nof \u0001, \u03bb proposed in Eq. (2) and Eq. (9) respectively, we use the long-tailed CIFAR-10 dataset with\nimbalance factor of 10 to show the accuracy of the proposed method in Figure 2 (d). The search for\n\n8\n\nImbalancedReducedImbalancedReducedGANOurs\f\u03bb \u2208 {0.01, 0.1, 0.3, 0.5, 1.0, 2.0}. We have the following observations that\nhyper-parameter is \u0001 = 1\nsmaller value of \u0001 will make the accuracy of the classi\ufb01er closer to the general classi\ufb01er, too large\nvalue of \u0001 will drop the accuracy of classi\ufb01er dramatically, and the best \u0001 is 0.3 on this dataset. These\nresults can be explained by the theoretical analysis section. According to Theorem 1 and 5, a large \u03bb\nwill make the distribution Q close to the empirical distribution \u02c6PN which makes the performance of\nclassi\ufb01er to be similar to the general classi\ufb01er, and a small \u03bb will lead the distribution Q only to cheat\nthe classi\ufb01er while ignoring the quality of the generated image.\n\nFigure 2: (a-c) Feature visualization for different samples. (d) Hyper-parameter analysis.\n\nVisualization In Figure 2 we visualize the second last layer features of images sampled from\nempirical distribution (x \u223c \u02c6PN ), generated by GAN, and generated by the proposed method (x \u223c Q)\non the reduced MNIST dataset (30% of traning set). We obtain features by forwarding images to a\nclassi\ufb01er pre-trained on the reduced training set, and features with a speci\ufb01c category in each \ufb01gure\nare represented in the same color. Figure 2 demonstrates that features from GAN shows less diversity\nand can be easily distinguished (shown in (c)), features from Ptest tend to confuse the classi\ufb01er\n(shown in (b)), and features from our generator are more challenging for the classi\ufb01er (shown in (a)).\nThis shows the effectiveness of the proposed method in learning the worst-case distribution.\n\n6 Related Work\nTo obtain robustness, a straightforward way is to train deep networks with the expected perturba-\ntions [41]. A mixture of the expert classi\ufb01ers that is trained by various types of image perturbation is\nproposed [6] and shows more robustness than previous single model methods [45]. To resolve the\nheavy parameters brought by the ensemble of many networks, additional layers [43] was introduced to\nthe network. They act as undistorted layers and improve the robustness of the network by reconstruct-\ning input images. For long-tailed imbalanced training data, re-sampling and cost-sensitive methods\nare two major strategies. Re-Sampling includes over-sampling which duplicates samples in rare\nclasses and under-sampling, which deletes samples from common classes. Over-sampling [16, 28] is\nlimited by the repeated samples, which leads the network to over\ufb01t, while under-sampling [8] suffers\nfrom the information loss caused by samples deleting. Cost-sensitive methods consider samples with\ndifferent weight when calculating the loss function. There are methods assigning weights according\nto the class frequency [19, 32] and assigning weights to the samples based on how dif\ufb01cult it is to\nbe resolved by the network [29, 7], which is somewhat similar to the proposed method. Reduced\ntraining data is also a challenging task in classi\ufb01cation. Some data augmentation algorithms are\nproposed to relieve the shortage of training data, such as DAGAN [2] and Smart Augmentation [26].\nDAGAN also introduces GAN [14] to generate samples and trains classi\ufb01er. But we are different\nfrom them in that they did not consider the classi\ufb01er in the process of generating samples like what\nwe do. In addition, adversarial data augmentation usually aims to create adversarial copies of training\ndata by adding perturbations. In contrast, we generate new sample from a distribution. Moreover, the\namount of perturbation (pixels for image) is often constrained in classical methods, while we focus\non a distribution ball with a radius bound.\n\n7 Conclusion\n\nWe propose a new adversarial classi\ufb01cation algorithm that improves the performance of the classi\ufb01er\nwhen a gap exists between the unknown true distribution and known empirical distribution. By\ndynamically interacting with classi\ufb01ers and known data distributions, a worst case distribution is\nlearned to help the training progress of classi\ufb01er, which is in contrast to existing robust algorithms for\none speci\ufb01c data defect. Both theoretical analysis and experimental results show that the proposed\nmethod can effectively improve the generalization ability of classi\ufb01ers on bad data sets.\n\n9\n\n67.577.587.50.010.10.30.512Accuracy(%)MNISTFMNISTCIFAR-10(a) Ours(b) Test data(c) GAN(d) Hyper-parameter\fAcknowledgment\n\nWe thank anonymous area chair and reviewers for their helpful comments. This research was\nsupported in part by National Natural Science Foundation of China under Grant No. 61876007 and\n61872012, and Australian Research Council Grant DE-180101438.\n\nReferences\n[1] S. Ando and C. Y. Huang. Deep over-sampling framework for classifying imbalanced data. In Joint\nEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages 770\u2013785.\nSpringer, 2017.\n\n[2] A. Antoniou, A. Storkey, and H. Edwards. Data augmentation generative adversarial networks. arXiv\n\npreprint arXiv:1711.04340, 2017.\n\n[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.\n\n[4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high \ufb01delity natural image synthesis.\n\narXiv preprint arXiv:1809.11096, 2018.\n\n[5] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie. Class-balanced loss based on effective number of\n\nsamples. In Proc. of Computer Vision and Pattern Recognition, pages 9268\u20139277, 2019.\n\n[6] S. F. Dodge and L. J. Karam. Quality robust mixtures of deep neural networks. IEEE Transactions on\n\nImage Processing, 27(11):5553\u20135562, 2018.\n\n[7] Q. Dong, S. Gong, and X. Zhu. Class recti\ufb01cation hard mining for imbalanced deep learning. In Proc. of\n\nInternational Conference on Computer Vision, pages 1851\u20131860, 2017.\n\n[8] C. Drummond, R. C. Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats\nover-sampling. In Workshop on learning from imbalanced datasets II, volume 11, pages 1\u20138. Citeseer,\n2003.\n\n[9] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially\n\nlearned inference. arXiv preprint arXiv:1606.00704, 2016.\n\n[10] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative adversarial\n\nnetworks. In Advances in neural information processing systems, pages 5247\u20135256, 2017.\n\n[11] I. Ginodi and A. Globerson. Gaussian robust classi\ufb01cation. CoRR, abs/1104.0235, 2011.\n\n[12] R. Girshick. Fast r-cnn. In Proc. of International Conference on Computer Vision, 2015.\n\n[13] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.\n\n[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in neural information processing systems, pages 2672\u20132680,\n2014.\n\n[15] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein\n\ngans. In Advances in neural information processing systems, pages 5767\u20135777, 2017.\n\n[16] H. Han, W.-Y. Wang, and B.-H. Mao. Borderline-smote: a new over-sampling method in imbalanced data\n\nsets learning. In International conference on intelligent computing, pages 878\u2013887. Springer, 2005.\n\n[17] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In Proc. of International Conference on\n\nComputer Vision, pages 2980\u20132988, 2017.\n\n[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. of Computer\n\nVision and Pattern Recognition, 2016.\n\n[19] C. Huang, Y. Li, C. Change Loy, and X. Tang. Learning deep representation for imbalanced classi\ufb01cation.\n\nIn Proc. of Computer Vision and Pattern Recognition, pages 5375\u20135384, 2016.\n\n[20] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.\n\nIn Proc. of Computer Vision and Pattern Recognition, pages 4700\u20134708, 2017.\n\n[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In International Conference on Machine Learning, pages 448\u2013456, 2015.\n\n10\n\n\f[22] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.\n\n[23] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[24] A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In Advances in neural\n\ninformation processing systems, pages 950\u2013957, 1992.\n\n[25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 1998.\n\n[26] J. Lemley, S. Bazrafkan, and P. Corcoran. Smart augmentation learning an optimal data augmentation\n\nstrategy. IEEE Access, 5:5858\u20135869, 2017.\n\n[27] C. Li, T. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. In Advances in neural information\n\nprocessing systems, pages 4088\u20134098, 2017.\n\n[28] M. Lin, K. Tang, and X. Yao. Dynamic sampling approach to training neural networks for multiclass\nimbalance classi\ufb01cation. IEEE Transactions on Neural Networks and Learning Systems, 24(4):647\u2013660,\n2013.\n\n[29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\u00e1r. Focal loss for dense object detection. In Proc. of\n\nInternational Conference on Computer Vision, pages 2980\u20132988, 2017.\n\n[30] C. Ling and V. Sheng. Cost-sensitive learning and the class imbalance problem. Encyclopedia of Machine\n\nLearning: Springer 2011.\n\n[31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. of\n\nComputer Vision and Pattern Recognition, pages 3431\u20133440, 2015.\n\n[32] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words\nand phrases and their compositionality. In Advances in neural information processing systems, pages\n3111\u20133119, 2013.\n\n[33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\n\nunsupervised feature learning. 2011.\n\n[34] C. J. Ng and A. B. J. Teoh. Dctnet: A simple learning-free approach for face recognition. In 2015\nAsia-Paci\ufb01c Signal and Information Processing Association Annual Summit and Conference (APSIPA),\npages 761\u2013768. IEEE, 2015.\n\n[35] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint\n\narXiv:1601.06759, 2016.\n\n[36] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\nproposal networks. In Advances in neural information processing systems, 2015.\n\n[37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[39] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Advances in neural\n\ninformation processing systems, pages 2377\u20132385, 2015.\n\n[40] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation\n\nwith pixelcnn decoders. In Advances in neural information processing systems, 2016.\n\n[41] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich. Examining the impact of blur on recognition by\n\nconvolutional networks. arXiv preprint arXiv:1611.05760, 2016.\n\n[42] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine\n\nlearning algorithms, 2017.\n\n[43] J. Yim and K.-A. Sohn. Enhancing the performance of convolutional neural networks on quality degraded\ndatasets. In International Conference on Digital Image Computing: Techniques and Applications (DICTA),\npages 1\u20138. IEEE, 2017.\n\n11\n\n\f[44] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In\n\nInternational Conference on Learning Representations, 2018.\n\n[45] Y. Zhou, S. Song, and N.-M. Cheung. On classi\ufb01cation of distorted images with deep convolutional neural\nnetworks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages\n1213\u20131217. IEEE, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3262, "authors": [{"given_name": "Tianyu", "family_name": "Guo", "institution": "Peking University"}, {"given_name": "Chang", "family_name": "Xu", "institution": "University of Sydney"}, {"given_name": "Boxin", "family_name": "Shi", "institution": "Peking University"}, {"given_name": "Chao", "family_name": "Xu", "institution": "Peking University"}, {"given_name": "Dacheng", "family_name": "Tao", "institution": "University of Sydney"}]}