{"title": "A Bayesian Data Augmentation Approach for Learning Deep Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2797, "page_last": 2806, "abstract": "Data augmentation is an essential part of the training process applied to deep learning models.  The motivation is that a robust training process for deep learning models depends on large annotated datasets, which are expensive to be acquired, stored and processed.  Therefore a reasonable alternative is to be able to automatically generate new annotated training samples using a process known as data augmentation. The dominant data augmentation approach in the field assumes that new training samples can be obtained via random geometric or appearance transformations applied to annotated training samples, but this is a strong assumption because it is unclear if this is a reliable generative model for producing new training samples. In this paper, we provide a novel Bayesian formulation to data augmentation, where new annotated training points are treated as missing variables and generated based on the distribution learned from the training set. For learning, we introduce a theoretically sound algorithm --- generalised Monte Carlo expectation maximisation, and demonstrate one possible implementation via an extension of the Generative Adversarial Network (GAN). Classification results on MNIST, CIFAR-10 and CIFAR-100 show the better performance of our proposed method compared to the current dominant data augmentation approach mentioned above --- the results also show that our approach produces better classification results than similar GAN models.", "full_text": "A Bayesian Data Augmentation Approach for\n\nLearning Deep Models\n\nToan Tran1, Trung Pham1, Gustavo Carneiro1, Lyle Palmer2 and Ian Reid1\n\n1School of Computer Science, 2School of Public Health\n\nThe University of Adelaide, Australia\n\n{toan.m.tran, trung.pham, gustavo.carneiro,\n\nlyle.palmer, ian.reid} @adelaide.edu.au\n\nAbstract\n\nData augmentation is an essential part of the training process applied to deep\nlearning models. The motivation is that a robust training process for deep learning\nmodels depends on large annotated datasets, which are expensive to be acquired,\nstored and processed. Therefore a reasonable alternative is to be able to automat-\nically generate new annotated training samples using a process known as data\naugmentation. The dominant data augmentation approach in the \ufb01eld assumes\nthat new training samples can be obtained via random geometric or appearance\ntransformations applied to annotated training samples, but this is a strong assump-\ntion because it is unclear if this is a reliable generative model for producing new\ntraining samples. In this paper, we provide a novel Bayesian formulation to data\naugmentation, where new annotated training points are treated as missing variables\nand generated based on the distribution learned from the training set. For learning,\nwe introduce a theoretically sound algorithm \u2014 generalised Monte Carlo expecta-\ntion maximisation, and demonstrate one possible implementation via an extension\nof the Generative Adversarial Network (GAN). Classi\ufb01cation results on MNIST,\nCIFAR-10 and CIFAR-100 show the better performance of our proposed method\ncompared to the current dominant data augmentation approach mentioned above \u2014\nthe results also show that our approach produces better classi\ufb01cation results than\nsimilar GAN models.\n\n1\n\nIntroduction\n\nDeep learning has become the \u201cbackbone\u201d of several state-of-the-art visual object classi\ufb01cation\n[19, 14, 25, 27], speech recognition [17, 12, 6], and natural language processing [4, 5, 31] systems.\nOne of the many reasons that explains the success of deep learning models is that their large capacity\nallows for the modeling of complex, high dimensional data patterns. The large capacity allowed by\ndeep learning is enabled by millions of parameters estimated within annotated training sets, where\ngeneralization tends to improve with the size of these training sets. One way of acquiring large\nannotated training sets is via the manual (or \u201chand\u201d) labeling of training samples by human experts \u2014\na dif\ufb01cult and sometimes subjective task that is expensive and prone to mistakes. Another way of\nproducing such large training sets is to arti\ufb01cially enlarge existing training datasets \u2014 a process that\nis commonly known in computer science as data augmentation (DA).\nIn computer vision applications, DA has been predominantly developed with the application of simple\ngeometric and appearance transformations on existing annotated training samples in order to generate\nnew training samples, where the transformation parameters are sampled with additive Gaussian or\nuniform noise. For instance, for ImageNet classi\ufb01cation [8], new training images can be generated by\napplying random rotations, translations or color perturbations to the annotated images [19]. Such a\nDA process based on \u201clabel-preserving\u201d transformations assumes that the noise model over these\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ftransformation spaces can represent with \ufb01delity the processes that have produced the labelled images.\nThis is a strong assumption that to the best of our knowledge has not been properly tested. In fact,\nthis commonly used DA process is known as \u201cpoor man\u2019s\u201d data augmentation (PMDA) [28] in the\nstatistical learning community because new synthetic samples are generated from a distribution\nestimated only once at the beginning of the training process.\n\nFigure 1: An overview of our Bayesian data augmentation algorithm for learning deep models. In\nthis analytic framework, the generator and classi\ufb01er networks are jointly learned, and the synthesized\ntraining set is continuously updated as the training progresses.\n\nIn the current manuscript, we propose a novel Bayesian DA approach for training deep learning\nmodels. In particular, we treat synthetic data points as instances of a random latent variable, which\nare drawn from a distribution learned from the given annotated training set. Effectively, rather than\ngenerating new synthetic training data prior to the training process using pre-de\ufb01ned transformation\nspaces and noise models, our approach generates new training data as the training progresses using\nsamples obtained from an iteratively learned training data distribution. Fig. 1 shows an overview of\nour proposed data augmentation algorithm.\nThe development of our approach is inspired by DA using latent variables proposed by the statistical\nlearning community [29], where the motivation is to introduce latent variables to facilitate the compu-\ntation of posterior distributions. However, directly applying this idea to deep learning is challenging\nbecause sampling millions of network parameters is computationally dif\ufb01cult. By replacing the\nestimation of the posterior distribution by the estimation of the maximum a posteriori (MAP) proba-\nbility, one can employ the Expectation Maximization (EM) algorithm, if the maximisation of such\naugmented posteriors is feasible. Unfortunately, this is not the case for deep learning models, where\nthe posterior maximisation cannot reliably produce a global optimum. An additional challenge for\ndeep learning models is that it is nontrivial to compute the expected value of the network parameters\ngiven the current estimate of the network parameters and the augmented data.\nIn order to address such challenges, we propose a novel Bayesian DA algorithm, called Generalized\nMonte Carlo Expectation Maximization (GMCEM), which jointly augments the training data and\noptimises the network parameters. Our algorithm runs iteratively, where at each iteration we sample\nnew synthetic training points and use Monte Carlo to estimate the expected value of the network\nparameters given the previous estimate. Then, the parameter values are updated with stochastic\ngradient decent (SGD). We show that the augmented learning loss function is actually equivalent to\nthe expected value of the network parameters, and that therefore we can guarantee weak convergence.\nMoreover, our method depends on the de\ufb01nition of predictive distributions over the latent variables,\nbut the design of such distributions is hard because they need to be suf\ufb01ciently expressive to model\nhigh-dimensional data, such as images. We address this challenge by leveraging the recent advances\nreached by deep generative models [11], where data distributions are implicitly represented via deep\nneural networks whose parameters are learned from annotated data.\nWe demonstrate our Bayesian DA algorithm in the training of deep learning classi\ufb01cation models [15,\n16]. Our proposed algorithm is realised by extending a generative adversarial network (GAN)\nmodel [11, 22, 24] with a data generation model and two discriminative models (one to discriminate\nbetween real and fake images and another to discriminate between the dataset classes). One important\ncontribution of our approach is the fact that the modularity of our method allows us to test different\nmodels for the generative and discriminative models \u2013 in particular, we are able to test several recently\nproposed deep learning models [15, 16] for the dataset class classi\ufb01cation. Experiments on MNIST,\nCIFAR-10 and CIFAR-100 datasets show the better classi\ufb01cation performance of our proposed\nmethod compared to the current dominant DA approach.\n\n2\n\n\f2 Related Work\n\n2.1 Data Augmentation\n\nData augmentation (DA) has become an essential step in training deep learning models, where\nthe goal is to enlarge the training sets to avoid over-\ufb01tting. DA has also been explored by the\nstatistical learning community [29, 7] for calculating posterior distributions via the introduction of\nlatent variables. Such DA techniques are useful in cases where the likelihood (or posterior) density\nfunctions are hard to maximize or sample, but the augmented density functions are easier to work.\nAn important caveat is that in statistical learning, latent variables may not lie in the same space of the\nobserved data, but in deep learning, the latent variables representing the synthesized training samples\nbelong to the same space as the observed data.\nSynthesizing new training samples from the original training samples is a widely used DA method\nfor training deep learning models [30, 26, 19]. The usual idea is to apply either additive Gaussian or\nuniform noise over pre-determined families of transformations to generate new synthetic training\nsamples from the original annotated training samples. For example, Yaeger et al. [30] proposed the\n\u201cstroke warping\" technique for word recognition, which adds small changes in skew, rotation, and\nscaling into the original word images. Simard et al. [26] used a related approach for visual document\nanalysis. Similarly, Krizhevsky et al. [19] used horizontal re\ufb02ections and color perturbations for\nimage classi\ufb01cation. Hauberg et al. [13] proposed a manifold learning approach that is run once\nbefore the classi\ufb01er training begins, where this manifold describes the geometric transformations\npresent in the training set.\nNevertheless, the DA approaches presented above have several limitations. First, it is unclear how\nto generate diverse data samples. As pointed out by Fawzi et al. [10], the transformations should\nbe \u201csuf\ufb01ciently small\u201d so that the ground truth labels are preserved. In other words, these methods\nimplicitly assume a small scale noise model over a pre-determined \u201ctransformation space\" of the\ntraining samples. Such an assumption is likely too restrictive and has not been tested properly.\nMoreover, these DA mechanisms do not adapt with the progress of the learning process\u2014 instead, the\naugmented data are generated only once and prior to the training process. This is, in fact, analogous to\nthe Poor Man\u2019s Data Augmentation (PMDA) [28] algorithm in statistical learning as it is non-iterative.\nIn contrast, our Bayesian DA algorithm iteratively generates novel training samples as the training\nprogresses, and the \u201cgenerator\u201d is adaptively learned. This is crucial because we do not make a noise\nmodel assumption over pre-determined transformation spaces to generate new synthetic training\nsamples.\n\n2.2 Deep Generative Models\n\nDeep learning has been widely applied in training discriminative models with great success, but\nthe progress in learning generative models has proven to be more dif\ufb01cult. One noteworthy work\nin training deep generative models is the Generative Adversarial Networks (GAN) proposed by\nGoodfellow et al. [11], which, once trained, can be used to sample synthetic images. GAN consists\nof one generator and one discriminator, both represented by deep learning models. In \u201cadversarial\ntraining\u201d, the generator and discriminator play a \u201ctwo-player minimax game\u201d, in which the generator\ntries to fool the discriminator by rendering images as similar as possible to the real images, and the\ndiscriminator tries to distinguish the real and fake ones. Nonetheless, the synthetic images generated\nby GAN are of low quality when trained on the datasets with high variability [9]. Variants of GAN\nhave been proposed to improve the quality of the synthetic images [22, 3, 23, 24]. For instance,\nconditional GAN [22] improves the original GAN by making the generator conditioned on the class\nlabels. Auxiliary classi\ufb01er GAN (AC-GAN) [24] additionally forces the discriminator to classify both\nreal-or-fake sources as well as the class labels of the input samples. These two works have shown\nsigni\ufb01cant improvement over the original GAN in generating photo-realistic images. So far these\ngenerative models mainly aim at generating samples of high-quality, high-resolution photo-realistic\nimages. In contrast, we explore generative models (in the form of GANs) in our proposed Bayesian\nDA algorithm for improving classi\ufb01cation models.\n\n3\n\n\f3 Data Augmentation Algorithm in Deep Learning\n\n3.1 Bayesian Neural Networks\n\nOur goal is to estimate the parameters of a deep learning model using an annotated training set\ndenoted by Y = {yn}N\nn=1, where y = (t, x), with annotations t \u2208 {1, ..., K} (K = # Classes), and\ndata samples represented by x \u2208 RD. Denoting the model parameters by \u03b8, the training process is\nde\ufb01ned by the following optimisation problem:\n\n\u03b8\u2217 = arg max\n\nlog p(\u03b8|y),\n\n(1)\n\nwhere the observed posterior p(\u03b8|y) = p(\u03b8|t, x) \u221d p(t|x, \u03b8)p(x|\u03b8)p(\u03b8).\nAssuming that the data samples in Y are conditionally independent, the cost function that maximises\n(1) is de\ufb01ned as [1]:\n\n\u03b8\n\nN(cid:88)\n\nn=1\n\n1\nN\n\nlog p(\u03b8|y) \u2248 log p(\u03b8) +\n\n(log p(tn|xn, \u03b8) + log p(xn|\u03b8)),\n\n(2)\nwhere p(\u03b8) denotes a prior on the distribution of the deep learning model parameters, p(tn|xn, \u03b8)\nrepresents the conditional likelihood of label tn, and p(xn|\u03b8) is the likelihood of the data x.\nIn general, the training process to estimate the model parameters \u03b8 tends to over-\ufb01t the training set Y\ngiven the large dimensionality of \u03b8 and the fact that Y does not have a suf\ufb01ciently large amount of\ntraining samples. One of the main approaches designed to circumvent this over-\ufb01tting issue is the\nautomated generation of synthetic training samples \u2014 a process known as data augmentation (DA).\nIn this work, we propose a novel Bayesian approach to augment the training set, targeting a more\nrobust training process.\n\n3.2 Data Augmentation using Latent Variable Methods\n\nThe DA principle is to increase the observed training data y using a latent variable z that represents\nthe synthesised data, so that the augmented posterior p(\u03b8|y, z) can be easily estimated [28], leading\nto a more robust estimation of p(\u03b8|y). The latent variable is de\ufb01ned by z = (ta, xa), where xa \u2208 RD\nrefers to a synthesized data point, and ta \u2208 {1, ..., K} denotes the associated label.\nThe most commonly chosen optimization method in these types of training processes involving\na latent variable is the expectation-maximisation (EM) algorithm [7]. In EM, let \u03b8i denote the\nestimated parameters of the model of p(\u03b8|y) at iteration i, and p(z|\u03b8i, y) represents the conditional\npredictive distribution of z. Then, the E-step computes the expectation of log p(\u03b8|y, z) with respect\nto p(z|\u03b8i, y), as follows:\n\nQ(\u03b8, \u03b8i) = Ep(z|\u03b8i,y) log p(\u03b8|y, z) =\n\nlog p(\u03b8|y, z)p(z|\u03b8i, y)dz.\n\n(3)\n\n(cid:90)\n\nThe parameter estimation at the next iteration, \u03b8i+1, is then obtained at the M-step by maximizing\nthe Q function:\n\nz\n\n\u03b8\n\nQ(\u03b8, \u03b8i).\n\n\u03b8i+1 = arg max\n\n(4)\nThe algorithm iterates until ||\u03b8i+1 \u2212 \u03b8i|| is suf\ufb01ciently small, and the optimal \u03b8\u2217 is selected from the\nlast iteration. The EM algorithm guarantees that the sequence {\u03b8i}i=1,2,... converges to a stationary\npoint of p(\u03b8|y) [7, 28], given that the expectation in (3) and the maximization in (4) can be computed\nexactly. In the convergence proof [7, 28], it is assumed that \u03b8i converges to \u03b8\u2217 as the number of\niterations i increases, then the proof consists of showing that \u03b8\u2217 is a critical point of p(\u03b8|y).\nHowever, in practice, either the E-step or M-step or both can be dif\ufb01cult to compute exactly, especially\nwhen working with deep learning models. In such cases, we need to rely on approximation methods.\nFor instance, Monte Carlo sampling method can approximate the integration in (3) (the E-step).\nThis technique is known as Monte Carlo EM (MCEM) algorithm [28]. Furthermore, when the\nestimation of the global maximiser of Q(\u03b8, \u03b8i) in (4) is dif\ufb01cult, Dempster et al. [7] proposed the\nGeneralized EM (GEM) algorithm, which relaxes this requirement with the estimation of \u03b8i+1, where\nQ(\u03b8i+1, \u03b8i) > Q(\u03b8i, \u03b8i). The GEM algorithm is proven to have weak convergence [28], by showing\nthat p(\u03b8i+1|y) > p(\u03b8i|y), given that Q(\u03b8i+1, \u03b8i) > Q(\u03b8i, \u03b8i).\n\n4\n\n\fM(cid:88)\n\nm=1\n\nM(cid:88)\n\nm=1\n\n3.3 Generalized Monte Carlo EM Algorithm\nWith the latent variable z, the augmented posterior p(\u03b8|y, z) becomes:\n\np(\u03b8|y, z) =\n\np(y, z, \u03b8)\np(y, z)\n\n=\n\np(z|y, \u03b8)p(\u03b8|y)p(y)\n\np(z|y)p(y)\n\n=\n\np(z|y, \u03b8)p(\u03b8|y)\n\np(z|y)\n\n,\n\n(5)\n\nwhere the E-step is represented by the following Monte-Carlo estimation of Q(\u03b8, \u03b8i):\n\n\u02c6Q(\u03b8, \u03b8i) =\n\n1\nM\n\nlog p(\u03b8|y, zm) = log p(\u03b8|y) +\n\n1\nM\n\n(log p(zm|y, \u03b8) \u2212 log p(zm|y)),\n\n(6)\n\nwhere zm \u223c p(z|y, \u03b8i), for m \u2208 {1, ..., M}. In (6), if the label ta\nzm is known, then xa\nm can be sampled from the distribution p(xa\ndistribution p(z|y, \u03b8) can be decomposed as:\n\nm of the mth synthesized sample\nm|\u03b8, y, ta\nm). Hence, the conditional\n\np(z|y, \u03b8) = p(ta, xa|y, \u03b8) = p(ta|xa, y, \u03b8)p(xa|y, \u03b8),\n\n(7)\n\nwhere (ta, xa) are conditionally independent of y given that all the information from the training set\ny is summarized in \u03b8 \u2014 this means that p(ta|xa, y, \u03b8) = p(ta|xa, \u03b8), and p(xa|y, \u03b8) = p(xa|\u03b8).\nThe maximization of \u02c6Q(\u03b8, \u03b8i) with respect to \u03b8 for the M-step is re-formulated by \ufb01rst removing all\nterms that are independent of \u03b8, which allows us to reach the following derivation (making the same\nassumption as in (2)):\n\n\u02c6Q(\u03b8, \u03b8i) = log p(\u03b8) +\n\n(log p(tn|xn, \u03b8) + log p(xn|\u03b8)) +\n\nlog p(zm|y, \u03b8)\n\n(8)\n\nN(cid:88)\n\nn=1\n\n1\nN\n\nN(cid:88)\n\nn=1\n\nM(cid:88)\n\nm=1\n\n1\nM\n\nM(cid:88)\n\nm=1\n\n1\nM\n\n= log p(\u03b8) +\n\n1\nN\n\n(log p(tn|xn, \u03b8) + log p(xn|\u03b8)) +\n\n(log p(ta\n\nm|xa\n\nm, \u03b8) + log p(xa\n\nm|\u03b8)).\n\nGiven that there is no analytical solution for the optimization in (8), we follow the same strategy\nemployed in the GEM algorithm, where we estimate \u03b8i+1 so that \u02c6Q(\u03b8i+1, \u03b8i) > \u02c6Q(\u03b8i, \u03b8i).\nAs the function \u02c6Q(\u00b7, \u03b8i) is differentiable, we can \ufb01nd such \u03b8i+1 by running one step of gradient\ndecent. It can be seen that our proposed optimization consists of a marriage between MCEM and\nGEM algorithms, which we name: Generalized Monte Carlo EM (GMCEM). The weak convergence\nproof of GMCEM is provided by Lemma 1.\nLemma 1. Assuming that \u02c6Q(\u03b8i+1, \u03b8i) > \u02c6Q(\u03b8i, \u03b8i), which is guaranteed from (8), then the weak\nconvergence (i.e. p(\u03b8i+1|y) > p(\u03b8i|y)) will be ful\ufb01lled.\n\nProof. Given \u02c6Q(\u03b8i+1, \u03b8i) > \u02c6Q(\u03b8i, \u03b8i), then by taking the expectation on both sides, that is\nEp(z|y,\u03b8i)[ \u02c6Q(\u03b8i+1, \u03b8i)] > Ep(z|y,\u03b8i)[ \u02c6Q(\u03b8i, \u03b8i)], we obtain Q(\u03b8i+1, \u03b8i) > Q(\u03b8i, \u03b8i), which is the\ncondition for p(\u03b8i+1|y) > p(\u03b8i|y) proven from [28].\n\nSo far, we have presented our Bayesian DA algorithm in a very general manner. The speci\ufb01c forms\nthat the probability terms in (8) take in our implementation are presented in the next section.\n\n4\n\nImplementation\n\nIn general, our proposed DA algorithm can be implemented using any deep generative and classi\ufb01ca-\ntion models which have differentiable optimisation functions. This is in fact an important advantage\nthat allows us to use the most sophisticated extant models available in the \ufb01eld for the implementa-\ntion of our algorithm. In this section, we present a speci\ufb01c implementation of our approach using\nstate-of-the-art discriminative and generative models.\n\n5\n\n\f4.1 Network Architecture\n\nOur network architecture consists of two models: a classi\ufb01er and a generator. For the classi\ufb01er,\nmodern deep convolutional neural networks [15, 16] can be used. For the generator, we select\nthe adversarial generative networks (GAN) [11], which include a generative model (represented\nby a deconvolutional neural network) and an authenticator model (represented by a convolutional\nneural network). This authenticator component is mainly used for facilitating the adversarial\ntraining. As a result, our network consists of a classi\ufb01er (C) with parameters \u03b8C, a generator (G)\nwith parameters \u03b8G and an Authenticator (A) with parameters \u03b8A. Fig. 2 compares our network\narchitecture with other variants of GAN recently proposed [11, 22, 24]. On the surface, our network\nappears similar to AC-GAN [24], where the only difference is the separation of the classi\ufb01er network\nfrom the authenticator network. However, this crucial modularisation enables our DA algorithm\nto replace GANs by other generative models that may become available in the future; likewise,\nwe can use the most sophisticated classi\ufb01cation models for C. Furthermore, unlike our model,\nthe classi\ufb01cation subnetwork introduced in AC-GAN mainly aims for improving the quality of\nsynthesized samples, rather than for classi\ufb01cation tasks. Nonetheless, one can consider AC-GAN\nas one possible implementation of our DA algorithm. Finally, our proposed GAN model is similar\nto the recently proposed triplet GAN [21] 1, but it is important to emphasise that triplet GAN was\nproposed in order to improve the training procedure for GANs, while our model represents a particular\nrealisation of the proposed Bayesian DA algorithm, which is the main contribution of this paper.\n\nFigure 2: A comparison of different network architectures including GAN[11], C-GAN [22], AC-\nGAN [24] and ours. G: Generator, A: Authenticator, C: Classi\ufb01er, D: Discriminator.\n\n4.2 Optimization Function\nLet us de\ufb01ne x \u2208 RD, \u03b8C \u2208 RC, \u03b8A \u2208 RA, \u03b8G \u2208 RG, u \u2208 R100, c \u2208 {1, ..., K}, the classi\ufb01er C, the\nauthenticator A and the generator G are respectively de\ufb01ned by\n\nfC : RD \u00d7 RC \u2192 [0, 1]K;\nfA : RD \u00d7 RA \u2192 [0, 1]2;\nfG : R100 \u00d7 Z+ \u00d7 RG \u2192 RD.\n\nN(cid:88)\n\nn=1\n\nN(cid:88)\n\nn=1\n\nM(cid:88)\n\nm=1\n\nM(cid:88)\n\nm=1\n\nThe optimisation function used to train the classi\ufb01er C is de\ufb01ned as:\n\nJC(\u03b8C) =\n\n1\nN\n\nlC(tn|xn, \u03b8C) +\n\n1\nM\n\nlC(ta\n\nm|xa\n\nm, \u03b8C),\n\nwhere lC(tn|xn, \u03b8C) = \u2212 log (softmax(fC(tn = c; xn, \u03b8C))).\nThe optimisation functions for the authenticator and generator networks are de\ufb01ned by [11]:\n\nJAG(\u03b8A, \u03b8G) =\n\n1\nN\n\nlA(xn|\u03b8A) +\n\n1\nM\n\nlAG(xa\n\nm|\u03b8A, \u03b8G),\n\n1The triplet GAN [21] was proposed in parallel to this NIPS submission.\n\n6\n\n(9)\n(10)\n(11)\n\n(12)\n\n(13)\n\n\fwhere\n\nlAG(xa\n\nlA(xn|\u03b8A) = \u2212 log (softmax(fA(input = real, xn, \u03b8A)) ;\nm|\u03b8A, \u03b8G) = \u2212 log (1 \u2212 softmax(fA(input = real, xa\n\n(14)\n(15)\nFollowing the same training procedure used to train GANs [11, 24], the optimisation is divided into\ntwo steps: the training of the discriminative part, consisting of minimising JC(\u03b8C) + JAG(\u03b8A, \u03b8G)\nand the training of the generative part consisting of minimising JC(\u03b8C) \u2212 JAG(\u03b8A, \u03b8G). This loss\nfunction can be linked to (8), as follows:\n\nm, \u03b8G, \u03b8A))) .\n\nlC(tn|xn, \u03b8C) = \u2212 log p(tn|xn, \u03b8),\nlC(ta\nm, \u03b8),\n\nm|xa\nm, \u03b8C) = \u2212 log p(ta\nm|xa\nlA(xn|\u03b8A) = \u2212 log p(xn|\u03b8),\nm|\u03b8A, \u03b8G) = \u2212 log p(xa\nm|\u03b8).\n\nlAG(xa\n\n(16)\n(17)\n(18)\n(19)\n\n4.3 Training\n\nTraining the network parameters \u03b8 follows the proposed GMCEM algorithm presented in Sec. 3.\nAccordingly, at each iteration we need to \ufb01nd \u03b8i+1 so that \u02c6Q(\u03b8i+1, \u03b8i) > \u02c6Q(\u03b8i, \u03b8i), which can be\nachieved using gradient decent. However, since the number of training and augmented samples\n(i.e., N + M) is large, evaluating the sum of the gradients over this whole set is computationally\nexpensive. A similar issue was observed in contrastive divergence [2], where the computation of the\napproximate gradient required in theory an in\ufb01nite number of Markov chain Monte Carlo (MCMC)\ncycles, but in practice, it was noted that only a few cycles were needed to provide a robust gradient\napproximation. Analogously, following the same principle, we propose to replace gradient decent by\nstochastic gradient decent (SGD), where the update from \u03b8i to \u03b8i+1 is estimated using only a sub-set\nof the M + N training samples. In practice, we divide the training set into batches, and the updated\n\u03b8i+1 is obtained by running SGD through all batches (i.e, one epoch). We found that such strategy\nworks well empirically, as shown in the experiments (Sec. 5).\n\n5 Experiments\n\nIn this section, we compare our proposed Bayesian DA algorithm with the commonly used DA\ntechnique [19] (denoted as PMDA) on several image classi\ufb01cation tasks (code available at: https:\n//github.com/toantm/keras-bda). This comparison is based on experiments using the\nfollowing three datasets: MNIST [20] (containing 60, 000 training and 10, 000 testing images of 10\nhandwritten digits), CIFAR-10[18] (consisting of 50, 000 training and 10, 000 testing images of 10\nvisual classes like car, dog, cat, etc.), and CIFAR-100 [18] (containing the same amount of training\nand testing samples as CIFAR-10, but with 100 visual classes).\nThe experimental results are based on the top-1 classi\ufb01cation accuracy as a function of the amount of\ndata augmentation used \u2013 in particular, we try the following amounts of synthesized images M: a)\nM = N (i.e., 2\u00d7 DA), M = 4N (5\u00d7 DA), and M = 9N (10\u00d7 DA). The PMDA is based on the\nuse of a uniform noise model over a rotation range of [\u221210, 10] degrees, and a translation range of at\nmost 10% of the image width and height. Other transformations were tested, but these two provided\nthe best results for PMDA on the datasets considered in this paper. We also include an experiment\nthat does not use DA in order to illustrate the importance of DA in deep learning.\nAs mentioned in Sec. 1, one important contribution of our method is its ability to use arbitrary deep\nlearning generative and classi\ufb01cation models. For the generative model, we use the C-GAN [22] 2, and\nfor the classi\ufb01cation model we rely on the ResNet18 [15] and ResNetpa [16]. The architectures of the\ngenerator and authenticator networks, which are kept unchanged for all three datasets, can be found\nin the supplementary material. For training, we use Adadelta (with learning rate=1.0, decay rate=0.95\nand epsilon=1e \u2212 8) for the Classi\ufb01er (C), Adam (with learning rate 0.0002, and exponential decay\nrate 0.5) for the Generator (G) and SDG (with learning rate 0.01) for the Authenticator (A). The\nnoise vector used by the Generator G is based on a standard Gaussian noise. In all experiments, we\nuse training batches of size 100.\nComparison results using ResNet18 and ResNetpa networks are shown in Figures 3 and 4. First, in all\ncases it is clear that DA provides a signi\ufb01cant improvement in the classi\ufb01cation accuracy \u2013 in general,\n\n2The code was adapted from: https://github.com/lukedeo/keras-acgan\n\n7\n\n\f(a) MNIST\n\n(b) CIFAR-10\n\n(c) CIFAR-100\n\nFigure 3: Performance comparison using ResNet18 [15] classi\ufb01er.\n\n(a) MNIST\n\n(b) CIFAR-10\n\n(c) CIFAR-100\n\nFigure 4: Performance comparison using ResNetpa [16] classi\ufb01er.\n\nlarger augmented training set sizes lead to more accurate classi\ufb01cation. More importantly, the results\nreveal that our Bayesian DA algorithm outperforms PMDA by a large margin in all datasets. Given\nthe similarity between the model used by our proposed Bayesian DA algorithm (using ResNetpa [16])\nand AC-GAN, it is relevant to present a comparison between these two models, which is shown in\nFig. 5 \u2013 notice that our approach is far superior to AC-GAN. Finally, it is also important to show the\nevolution of the test classi\ufb01cation accuracy as a function of training time \u2013 this is reported in Fig. 6.\nAs expected, it is clear that PMDA produces better classi\ufb01cation results at the \ufb01rst training stages, but\nafter a certain amount of training, our Bayesian DA algorithm produces better results. In particular,\nusing the ResNet18 [15] classi\ufb01er, on CIFAR-100, our method is better than PMDA after two hours\nof training; while for MNIST, our method is better after \ufb01ve hours of training.\nIt is worth emphasizing that the main goal of the proposed Bayesian DA is to improve the training\nprocess of the classi\ufb01er C. Nevertheless, it is also of interest to investigate the quality of the\nimages produced by the generator G. In Fig. 7, we display several examples of the synthetic images\nproduced by G after the training process has converged. In general, the images look reasonably\nrealistic, particularly the handwritten digits, where the synthesized images would be hard to generate\n\n(a) MNIST\n\n(b) CIFAR-10\n\n(c) CIFAR-100\n\nFigure 5: Performance comparison with AC-GAN using ResNetpa [16]\n\n8\n\n2X5X10X Increase size of training data99.299.399.499.599.699.7Accuracy rateResNet18 on MNISTWithout DAPMDAOurs2X5X10X Increase size of training data7580859095Accuracy rateResNet18 on CIFAR-10Without DAPMDAOurs2X5X10X Increase size of training data4050607080Accuracy rateResNet18 on CIFAR-100Without DAPMDAOurs2X5X10X Increase size of training data99.5599.699.6599.799.75Accuracy rateResNetPA on MNISTWithout DAPMDAOurs2X5X10X Increase size of training data848688909294Accuracy rateResNetPA on CIFAR-10Without DAPMDAOurs2X5X10X Increase size of training data5560657075Accuracy rateResNetPA on CIFAR-100Without DAPMDAOurs2X5X10X Increase size of training data9999.299.499.699.8Accuracy rate Comparison with AC-GAN on MNISTAC-GANResNetpa without DAResNetpa with ours2X5X10X Increase size of training data80859095Accuracy rate Comparison with AC-GAN on CIFAR-10AC-GANResNetpa without DAResNetpa with ours2X5X10X Increase size of training data505560657075Accuracy rate Comparison with AC-GAN on CIFAR-100AC-GANResNetpa without DAResNetpa with ours\f(a) MNIST\n\n(b) CIFAR-100\n\nFigure 6: Classi\ufb01cation accuracy (as a function of the training time) using PMDA and our proposed\ndata augmentation on ResNet18 [15]\n\n(a) MNIST\n\n(b) CIFAR-10\n\n(c) CIFAR-100\n\nFigure 7: Synthesized images generated using our model trained on MNIST (a), CIFAR-10 (b) and\nCIFAR-100 (c). Each column is conditioned on a class label: a) classes are 0, ..., 9; b) classes are\nairplane, automobile, bird and ship; and c) classes are apple, aquarium \ufb01sh, rose and lobster.\n\nby the application of Gaussian or uniform noise on pre-determined geometric and appearance\ntransformations.\n\n6 Conclusions\n\nIn this paper we have presented a novel Bayesian DA that improves the training process of deep\nlearning classi\ufb01cation models. Unlike currently dominant methods that apply random transformations\nto the observed training samples, our method is theoretically sound; the missing data are sampled\nfrom the distribution learned from the annotated training set. However, we do not train the generator\ndistribution independently from the training of the classi\ufb01cation model. Instead, both models are\njointly optimised based on our proposed Bayesian DA formulation that connects the classical latent\nvariable method in statistical learning with modern deep generative models. The advantages of\nour data augmentation approach are validated using several image classi\ufb01cation tasks with clear\nimprovements over standard DA methods and also over the recently proposed AC-GAN model [24].\n\nAcknowledgments\n\nTT gratefully acknowledges the support by Vietnam International Education Development (VIED).\nTP, GC and IR gratefully acknowledge the support of the Australian Research Council through the\nCentre of Excellence for Robotic Vision (project number CE140100016) and Laureate Fellowship\nFL130100102 to IR.\n\n9\n\n0.1hr1hr2hrs5hrs10hrs24hrs Training time9092949698100Accuracy rate ResNet18 on MNISTWith PMDAWith ours0.1hr1hr2hrs5hrs10hrs24hrs Training time304050607080Accuracy rate ResNet18 on CIFAR-100With PMDAWith ours\fReferences\n[1] C. Bishop. Pattern recognition and machine learning (information science and statistics), 1st edn. 2006.\n\ncorr. 2nd printing edn. Springer, New York, 2007.\n\n[2] M. A. Carreira-Perpinan and G. E. Hinton. On contrastive divergence learning. In AISTATS, volume 10,\n\npages 33\u201340. Citeseer, 2005.\n\n[3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel.\n\ninterpretable\nrepresentation learning by information maximizing generative adversarial nets. In Advances in Neural\nInformation Processing Systems, 2016.\n\n[4] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural networks\nwith multitask learning. In Proceedings of the 25th international conference on Machine learning, pages\n160\u2013167. ACM, 2008.\n\n[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing\n\nInfogan:\n\n(almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493\u20132537, 2011.\n\n[6] X. Cui, V. Goel, and B. Kingsbury. Data augmentation for deep neural network acoustic modeling.\nIEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23(9):1469\u20131477, 2015.\n[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em\n\nalgorithm. Journal of the royal statistical society. Series B (methodological), pages 1\u201338, 1977.\n\n[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In IEEE Conference on Computer Vision and Pattern Recognition, 2009, 2009.\n\n[9] E. L. Denton, S. Chintala, a. szlam, and R. Fergus. Deep generative image models using a laplacian pyramid\nof adversarial networks. In Advances in Neural Information Processing Systems 28, pages 1486\u20131494.\n2015.\n\n[10] A. Fawzi, H. Samulowitz, D. Turaga, and P. Frossard. Adaptive data augmentation for image classi\ufb01cation.\n\nIn Image Processing (ICIP), 2016 IEEE International Conference on, pages 3688\u20133692. IEEE, 2016.\n\n[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in neural information processing systems, pages 2672\u20132680,\n2014.\n\n[12] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In\nAcoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645\u20136649.\nIEEE, 2013.\n\n[13] S. Hauberg, O. Freifeld, A. B. L. Larsen, J. Fisher, and L. Hansen. Dreaming more data: Class-dependent\ndistributions over diffeomorphisms for learned data augmentation. In Arti\ufb01cial Intelligence and Statistics,\npages 342\u2013350, 2016.\n\n[14] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904\u20131916, 2015.\n\n[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference\n\non Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[17] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,\nT. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views\nof four research groups. IEEE Signal Processing Magazine, 29(6):82\u201397, 2012.\n\n[18] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[21] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. CoRR, abs/1703.02291, 2017.\n[22] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.\narXiv preprint\n[23] A. Odena.\n\nSemi-supervised learning with generative adversarial networks.\n\n[24] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classi\ufb01er gans. arXiv preprint\n\narXiv:1606.01583, 2016.\n\narXiv:1610.09585, 2016.\n\n[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[26] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to\nvisual document analysis. In Proceedings of the Seventh International Conference on Document Analysis\nand Recognition - Volume 2, 2003.\n\n[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\nin Statistics, 67, 1991.\n\n[28] M. A. Tanner. Tools for statistical inference: Observed data and data augmentation methods. Lecture Notes\n\n[29] M. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation. Journal of\n\nthe American statistical Association, 82(398):528\u2013540, 1987.\n\n[30] L. Yaeger, R. Lyon, and B. Webb. Effective training of a neural network character classi\ufb01er for word\n\nrecognition. In NIPS, volume 9, pages 807\u2013813, 1996.\n\n[31] X. Zhang and Y. LeCun. Text understanding from scratch. arXiv preprint arXiv:1502.01710, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1583, "authors": [{"given_name": "Toan", "family_name": "Tran", "institution": "The University of Adelaide"}, {"given_name": "Trung", "family_name": "Pham", "institution": "The University of Adelaide"}, {"given_name": "Gustavo", "family_name": "Carneiro", "institution": "The University of Adelaide"}, {"given_name": "Lyle", "family_name": "Palmer", "institution": "The University of Adelaide"}, {"given_name": "Ian", "family_name": "Reid", "institution": "University of Adelaide"}]}