{"title": "Learning to Compose Domain-Specific Transformations for Data Augmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 3236, "page_last": 3246, "abstract": "Data augmentation is a ubiquitous technique for increasing the size of labeled training sets by leveraging task-specific data transformations that preserve class labels. While it is often easy for domain experts to specify individual transformations, constructing and tuning the more sophisticated compositions typically needed to achieve state-of-the-art results is a time-consuming manual task in practice. We propose a method for automating this process by learning a generative sequence model over user-specified transformation functions using a generative adversarial approach. Our method can make use of arbitrary, non-deterministic transformation functions, is robust to misspecified user input, and is trained on unlabeled data. The learned transformation model can then be used to perform data augmentation for any end discriminative model. In our experiments, we show the efficacy of our approach on both image and text datasets, achieving improvements of 4.0 accuracy points on CIFAR-10, 1.4 F1 points on the ACE relation extraction task, and 3.4 accuracy points when using domain-specific transformation operations on a medical imaging dataset as compared to standard heuristic augmentation approaches.", "full_text": "Learning to Compose Domain-Speci\ufb01c\nTransformations for Data Augmentation\n\nAlexander J. Ratner\u2217, Henry R. Ehrenberg\u2217, Zeshan Hussain,\n\n{ajratner,henryre,zeshanmh,jdunnmon,chrismre}@cs.stanford.edu\n\nJared Dunnmon, Christopher R\u00e9\n\nStanford University\n\nAbstract\n\nData augmentation is a ubiquitous technique for increasing the size of labeled train-\ning sets by leveraging task-speci\ufb01c data transformations that preserve class labels.\nWhile it is often easy for domain experts to specify individual transformations,\nconstructing and tuning the more sophisticated compositions typically needed to\nachieve state-of-the-art results is a time-consuming manual task in practice. We\npropose a method for automating this process by learning a generative sequence\nmodel over user-speci\ufb01ed transformation functions using a generative adversarial\napproach. Our method can make use of arbitrary, non-deterministic transformation\nfunctions, is robust to misspeci\ufb01ed user input, and is trained on unlabeled data.\nThe learned transformation model can then be used to perform data augmentation\nfor any end discriminative model. In our experiments, we show the ef\ufb01cacy of our\napproach on both image and text datasets, achieving improvements of 4.0 accuracy\npoints on CIFAR-10, 1.4 F1 points on the ACE relation extraction task, and 3.4 ac-\ncuracy points when using domain-speci\ufb01c transformation operations on a medical\nimaging dataset as compared to standard heuristic augmentation approaches.\n\n1\n\nIntroduction\n\nModern machine learning models, such as deep neural networks, may have billions of free parameters\nand accordingly require massive labeled data sets for training. In most settings, labeled data is not\navailable in suf\ufb01cient quantities to avoid over\ufb01tting to the training set. The technique of arti\ufb01cially\nexpanding labeled training sets by transforming data points in ways which preserve class labels\n\u2013 known as data augmentation \u2013 has quickly become a critical and effective tool for combatting\nthis labeled data scarcity problem. Data augmentation can be seen as a form of weak supervision,\nproviding a way for practitioners to leverage their knowledge of invariances in a task or domain.\nAnd indeed, data augmentation is cited as essential to nearly every state-of-the-art result in image\nclassi\ufb01cation [3, 7, 11, 24] (see Supplemental Materials), and is becoming increasingly common in\nother modalities as well [20].\nEven on well studied benchmark tasks, however, the choice of data augmentation strategy is known to\ncause large variances in end performance and be dif\ufb01cult to select [11, 7], with papers often reporting\ntheir heuristically found parameter ranges [3]. In practice, it is often simple to formulate a large set\nof primitive transformation operations, but time-consuming and dif\ufb01cult to \ufb01nd the parameterizations\nand compositions of them needed for state-of-the-art results. In particular, many transformation\noperations will have vastly different effects based on parameterization, the set of other transformations\nthey are applied with, and even their particular order of composition. For example, brightness and\nsaturation enhancements might be destructive when applied together, but produce realistic images\nwhen paired with geometric transformations.\n\n\u2217Authors contributed equally\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRotate\n\nRotate\n\nFlip\n\nShiftHue\n\nZoomOut ShiftHue\n\nFlip\n\nBrighten\n\nprograms\n2 | w1, w0)\n\nP (w(cid:48)\n\nRachel writes code for WebCo.\n\nE1\n\nNN\n\nE2\n\nFigure 1: Three examples of transformation functions (TFs) in different domains: Two example\nsequences of incremental image TFs applied to CIFAR-10 images (left); a conditional word-swap\nTF using an externally trained language model and speci\ufb01cally targeting nouns (NN) between entity\nmentions (E1,E2) for a relation extraction task (middle); and an unsupervised segementation-based\ntranslation TF applied to mass-containing mammography images (right).\n\nGiven the dif\ufb01culty of searching over this con\ufb01guration space, the de facto norm in practice consists\nof applying one or more transformations in random order and with random parameterizations selected\nfrom hand-tuned ranges. Recent lines of work attempt to automate data augmentation entirely, but\neither rely on large quantities of labeled data [1, 21], restricted sets of simple transformations [8, 13],\nor consider only local perturbations that are not informed by domain knowledge [1, 22] (see Section 4).\nIn contrast, our aim is to directly and \ufb02exibly leverage domain experts\u2019 knowledge of invariances as a\nvaluable form of weak supervision in real-world settings where labeled training data is limited.\nIn this paper, we present a new method for data augmentation that directly leverages user domain\nknowledge in the form of transformation operations, and automates the dif\ufb01cult process of composing\nand parameterizing them. We formulate the problem as one of learning a generative sequence model\nover black-box transformation functions (TFs): user-speci\ufb01ed operators representing incremental\ntransformations to data points that need not be differentiable nor deterministic. For example, TFs\ncould rotate an image by a small degree, swap a word in a sentence, or translate a segmented structure\nin an image (Fig. 1). We then design a generative adversarial objective [9] which allows us to train\nthe sequence model to produce transformed data points which are still within the data distribution of\ninterest, using unlabeled data. Because the TFs can be stochastic or non-differentiable, we present a\nreinforcement learning-based training strategy for this model. The learned model can then be used to\nperform data augmentation on labeled training data for any end discriminative model.\nGiven the \ufb02exibility of our representation of the data augmentation process, we can apply our\napproach in many different domains, and on different modalities including both text and images.\nOn a real-world mammography image task, we achieve a 3.4 accuracy point boost above randomly\ncomposed augmentation by learning to appropriately combine standard image TFs with domain-\nspeci\ufb01c TFs derived in collaboration with radiology experts. Using novel language model-based\nTFs, we see a 1.4 F1 boost over heuristic augmentation on a text relation extraction task from\nthe ACE corpus. And on a 10%-subsample of the CIFAR-10 dataset, we achieve a 4.0 accuracy\npoint gain over a standard heuristic augmentation approach and are competitive with comparable\nsemi-supervised approaches. Additionally, we show empirical results suggesting that the proposed\napproach is robust to misspeci\ufb01ed TFs. Our hope is that the proposed method will be of practical\nvalue to practitioners and of interest to researchers, so we have open-sourced the code at https:\n//github.com/HazyResearch/tanda.\n\n2 Modeling Setup and Motivation\n\nIn the standard data augmentation setting, our aim is to expand a labeled training set by leveraging\nknowledge of class-preserving transformations. For a practitioner with domain expertise, providing\nindividual transformations is straightforward. However, high performance augmentation techniques\nuse compositions of \ufb01nely tuned transformations to achieve state-of-the-art results [7, 3, 11], and\nheuristically searching over this space of all possible compositions and parameterizations for a new\ntask is often infeasible. Our goal is to automate this task by learning to compose and parameterize a\nset of user-speci\ufb01ed transformation operators in ways that are diverse but still preserve class labels.\nIn our method, transformations are modeled as sequences of incremental user-speci\ufb01ed operations,\ncalled transformation functions (TFs) (Fig. 1). Rather than making the strong assumption that all\nthe provided TFs preserve class labels, as existing approaches do, we assume a weaker form of class\n\n2\n\n\fFigure 2: A high-level diagram of our method. Users input a set of transformation functions h1, ..., hK\nand unlabeled data. A generative adversarial approach is then used to train a null class discriminator,\nD\u2205, and a generator, G, which produces TF sequences h\u03c41, ..., h\u03c4L. Finally, the trained generator is\nused to perform data augmentation for an end discriminative model Df .\n\ninvariance which enables us to use unlabeled data to learn a generative model over transformation\nsequences. We then propose two representative model classes to handle modeling both commutative\nand non-commutative transformations.\n\n2.1 Augmentation as Sequence Modeling\n\nIn our approach, we represent transformations as sequences of incremental operations. In this setting,\nthe user provides a set of K TFs, hi : X (cid:55)\u2192 X , i \u2208 [1, K]. Each TF performs an incremental\ntransformation: for example, hi could rotate an image by \ufb01ve degrees, swap a word in a sentence, or\nmove a segmented tumor mass around a background mammography image (see Fig. 1). In order to\naccommodate a wide range of such user-de\ufb01ned TFs, we treat them as black-box functions which\nneed not be deterministic nor differentiable.\nThis formulation gives us a tractable way to tune both the parameterization and composition of the\nTFs in a discretized but \ufb01ne-grained manner. Our representation can be thought of as an implicit\nbinning strategy for tuning parameterizations \u2013 e.g. a 15 degree rotation might be represented as\nthree applications of a \ufb01ve-degree rotation TF. It also provides a direct way to represent compositions\nof multiple transformation operations. This is critical as a multitude of state-of-the-art results in\nthe literature show the importance of using compositions of more than one transformations per\nimage [7, 3, 11], which we also con\ufb01rm experimentally in Section 5.\n\n2.2 Weakening the Class-Invariance Assumption\n\nAny data augmentation technique fundamentally relies on some assumption about the transformation\noperations\u2019 relation to the class labels. Previous approaches make the unrealistic assumption that all\nprovided transformation operations preserve class labels for all data points. That is,\n\ny(h\u03c4L \u25e6 . . . \u25e6 h\u03c41 (x)) = y(x)\n\n(1)\n\nfor label mapping function y, any sequence of TF indices \u03c41, ..., \u03c4L, and all data points x.\nThis assumption puts a large burden of precise speci\ufb01cation on the user, and based on our observations,\nis violated by many real-world data augmentation strategies. Instead, we consider a weaker modeling\nassumption. We assume that transformation operations will not map between classes, but might\ndestructively map data points out of the distribution of interest entirely:\n\ny(h\u03c4L \u25e6 . . . \u25e6 h\u03c41 (x)) \u2208 {y(x), y\u2205}\n\n(2)\nwhere y\u2205 represents an out-of-distribution null class. Intuitively, this weaker assumption is motivated\nby the categorical image classi\ufb01cation setting, where we observe that transformation operations\nprovided by the user will almost never turn, for example, a plane into a car, but may often turn a\nplane into an indistinguishable \u201cgarbage\u201d image (Fig. 3). We are the \ufb01rst to consider this weaker\ninvariance assumption, which we believe more closely matches various practical data augmentation\nsettings of interest. In Section 5, we also provide empirical evidence that this weaker assumption is\nuseful in binary classi\ufb01cation settings and over modalities other than image data. Critically, it also\nenables us to learn a model of TF sequences using unlabeled data alone.\n\n3\n\n\fOriginal\n\nPlane\n\nAuto\n\nBird\n\nCat\n\nDeer\n\nPlane\n\nAuto\n\nBird\n\nFigure 3: Our modeling assumption is that transformations may\nmap out of the natural distribution of interest, but will rarely\nmap between classes. As a demonstration, we take images from\nCIFAR-10 (each row) and randomly search for a transformation\nsequence that best maps them to a different class (each column),\naccording to a trained discriminative model. The matches rarely\nresemble the target class but often no longer look like \u201cnormal\u201d\nimages at all. Note that we consider a \ufb01xed set of user-provided\nTFs, not adversarially selected ones.\n\nFigure 4: Some example trans-\nformed images generated us-\ning an augmentation genera-\ntive model trained using our\napproach. Note that this is\nnot meant as a comparison to\nFig. 3.\n\n2.3 Minimizing Null Class Mappings Using Unlabeled Data\n\nGiven assumption (2), our objective is to learn a model G\u03b8 which generates sequences of TF indices\n\u03c4 \u2208 {1, K}L with \ufb01xed length L, such that the resulting TF sequences h\u03c41, ..., h\u03c4L are not likely to\nmap data points into y\u2205. Crucially, this does not involve using the class labels of any data points,\nand so we can use unlabeled data. Our goal is then to minimize the the probability of a generated\nsequence mapping unlabeled data points into the null class, with respect to \u03b8:\nEx\u223cU [P (y(h\u03c4L \u25e6 . . . \u25e6 h\u03c41 (x)) = y\u2205)]\n\nJ\u2205 = E\u03c4\u223cG\u03b8\n\n(3)\n\nwhere U is some distribution of unlabeled data.\n\nGenerative Adversarial Objective\njointly train the generator G\u03b8 and a discriminative model D\u2205\n(GAN) objective [9], now minimizing with respect to \u03b8 and maximizing with respect to \u03c6:\n\u03c6(x(cid:48)))\n\n(cid:105)\n\u03c6(h\u03c4L \u25e6 . . . \u25e6 h\u03c41 (x)))\n\nIn order to approximate P (y(h\u03c41 \u25e6 . . . \u25e6 h\u03c4L(x)) = y\u2205), we\n\u03c6 using a generative adversarial network\n\n\u02dcJ\u2205 = E\u03c4\u223cG\u03b8\n\nlog(1 \u2212 D\u2205\n\nEx\u223cU\n\n(cid:104)\n\n+ Ex(cid:48)\u223cU\n\nlog(D\u2205\n\n(4)\n\n(cid:104)\n\n(cid:105)\n\nAs in the standard GAN setup, the training procedure can be viewed as a minimax game in which the\ndiscriminator\u2019s goal is to assign low values to transformed, out-of-distribution data points and high\nvalues to real in-distribution data points, while simultaneously, the generator\u2019s goal is to generate\ntransformation sequences which produce data points that are indistinguishable from real data points\naccording to the discriminator. For D\u2205\n\u03c6, we use an all-convolution CNN as in [23]. For further details,\nsee Supplemental Materials.\n\nDiversity Objective An additional concern is that the model will learn a variety of null transforma-\ntion sequences (e.g. rotating \ufb01rst left than right repeatedly). Given the potentially large state-space of\nactions, and the black-box nature of the user-speci\ufb01ed TFs, it seems infeasible to hard-code sets of\ninverse operations to avoid. To mitigate this, we instead consider a second objective term:\n\nJd = E\u03c4\u223cG\u03b8\n\nEx\u223cU [d(h\u03c4L \u25e6 . . . \u25e6 h\u03c41 (x), x)]\n\n(5)\nwhere d : X \u00d7 X \u2192 R is some distance function. For d, we evaluated using both distance in the raw\ninput space, and in the feature space learned by the \ufb01nal pre-softmax layer of the discriminator D\u2205\n\u03c6.\nCombining eqns. 4 and 5, our \ufb01nal objective is then J = \u02dcJ\u2205 +\u03b1J\u22121\nd where \u03b1 > 0 is a hyperparameter.\nWe minimize J with respect to \u03b8 and maximize with respect to \u03c6.\n\n4\n\n\f2.4 Modeling Transformation Sequences\n\nWe now consider two model classes for G\u03b8:\n\nIndependent Model We \ufb01rst consider a mean \ufb01eld model in which each sequential TF is chosen\nindependently. This reduces our task to one of learning K parameters, which we can think of as\nrepresenting the task-speci\ufb01c \u201caccuracies\u201d or \u201cfrequencies\u201d of each TF. For example, we might want to\nlearn that elastic deformations or swirls should only rarely be applied to images in CIFAR-10, but that\nsmall rotations can be applied frequently. In particular, a mean \ufb01eld model also provides a simple way\nof effectively learning stochastic, discretized parameterizations of the TFs. For example, if we have a\nTF representing \ufb01ve-degree rotations, Rotate5Deg, a marginal value of PG\u03b8 (Rotate5Deg) = 0.1\ncould be thought of as roughly equivalent to learning to rotate 0.5L degrees on average.\n\nState-Based Model There are important cases, however, where the independent representation\nlearned by the mean \ufb01eld model could be overly limited. In many settings, certain TFs may have\nvery different effects depending on which other TFs are applied with them. As an example, certain\nsimilar pairs of image transformations might be overly lossy when applied together, such as a blur\nand a zoom operation, or a brighten and a saturate operation. A mean \ufb01eld model could not represent\nsuch disjunctions as these. Another scenario where an independent model fails is where the TFs are\nnon-commutative, such as with lossy operators (e.g. image transformations which use aliasing). In\nboth of these cases, modeling the sequences of transformations could be important. Therefore we\nconsider a long short-term memory (LSTM) network as as a representative sequence model. The\noutput from each cell of the network is a distribution over the TFs. The next TF in the sequence is\nthen sampled from this distribution, and is fed as a one-hot vector to the next cell in the network.\n\n3 Learning a Transformation Sequence Model\n\nThe core challenge that we now face in learning G\u03b8 is that it generates sequences over TFs which are\nnot necessarily differentiable or deterministic. This constraint is a critical facet of our approach from\nthe usability perspective, as it allows users to easily write TFs as black-box scripts in the language of\ntheir choosing, leveraging arbitrary subfunctions, libraries, and methods. In order to work around this\nconstraint, we now describe our model in the syntax of reinforcement learning (RL), which provides a\nconvenient framework and set of approaches for handling computation graphs with non-differentiable\nor stochastic nodes [27].\n\nReinforcement Learning Formulation Let \u03c4i be the index of the ith TF applied, and \u02dcxi be the\nresulting incrementally transformed data point. Then we consider st = (x, \u02dcx1, \u02dcx2, ..., \u02dcxt, \u03c41, ...., \u03c4t)\nas the state after having applied t of the incremental TFs. Note that we include the incrementally\ntransformed data points \u02dcx1, ..., \u02dcxt in st since the TFs may be stochastic. Each of the model classes\nconsidered for G\u03b8 then uses a different state representation \u02c6s. For the mean \ufb01eld model, the state\nrepresentation used is \u02c6sMF\n= LSTM(\u03c4t, st\u22121), the state\nupdate operation performed by a standard LSTM cell parameterized by \u03b8.\n\nt = \u2205. For the LSTM model, we use \u02c6sLSTM\n\nt\n\nPolicy Gradient with Incremental Rewards Let (cid:96)t(x, \u03c4 ) = log(1 \u2212 D\u2205\n\u03c6(\u02dcxt)) be the cumulative\n\u03c6(x)). Let R(st) = (cid:96)t(x, \u03c4 ) \u2212\nloss for a data point x at step t, with (cid:96)0(x) = (cid:96)0(x, \u03c4 ) \u2261 log(1 \u2212 D\u2205\n(cid:96)t\u22121(x, \u03c4 ) be the incremental reward, representing the difference in discriminator loss at incremental\ntransformation step t. We can now recast the \ufb01rst term of our objective \u02dcJ\u2205 as an expected sum of\n(cid:35)\nincremental rewards:\n\n(cid:34)\n\n(cid:105)\n\u03c6(h\u03c41 \u25e6 . . . \u25e6 h\u03c4L(x)))\n\nL(cid:88)\n\n(cid:104)\n\nU (\u03b8) \u2261 E\u03c4\u223cG\u03b8\n\nEx\u223cU\n\nlog(1 \u2212 D\u2205\n\n= E\u03c4\u223cG\u03b8\n\nEx\u223cU\n\n(cid:96)0(x) +\n\nR(st)\n\nWe omit (cid:96)0 in practice, equivalent to using the loss of x as a baseline term. Next, let \u03c0\u03b8 be the\nstochastic transition policy implictly de\ufb01ned by G\u03b8. We compute the recurrent policy gradient [32]\n\nt=1\n\n(6)\n\n5\n\n\fof the objective U (\u03b8) as:\n\n\u2207\u03b8U (\u03b8) = E\u03c4\u223cG\u03b8\n\nEx\u223cU\n\n(cid:35)\n\nR(st)\u2207\u03b8 log \u03c0\u03b8(\u03c4t | \u02c6st\u22121)\n\n(cid:34) L(cid:88)\n\nt=1\n\n(7)\n\nFollowing standard practice, we approximate this quantity by sampling batches of n data points and\nm sampled action sequences per data point. We also use standard techniques of discounting with\nfactor \u03b3 \u2208 [0, 1] and considering only future rewards [12]. See Supplemental Materials for details.\n\n4 Related Work\n\nWe now review related work, both to motivate comparisons in the experiments section and to present\ncomplementary lines of work.\n\nHeuristic Data Augmentation Most state-of-the-art image classi\ufb01cation pipelines use some lim-\nited form of data augmentation [11, 7]. This generally consists of applying crops, \ufb02ips, or small af\ufb01ne\ntransformations, in \ufb01xed order or at random, and with parameters drawn randomly from hand-tuned\nranges. In addition, various studies have applied heuristic data augmentation techniques to modalities\nsuch as audio [31] and text [20]. As reported in the literature, the selection of these augmentation\nstrategies can have large performance impacts, and thus can require extensive selection and tuning by\nhand [3, 7] (see Supplemental Materials as well).\n\nInterpolation-Based Techniques Some techniques have explored generating augmented training\nsets by interpolating between labeled data points. For example, the well-known SMOTE algorithm\napplies this basic technique for oversampling in class-imbalanced settings [2], and recent work\nexplores using a similar interpolation approach in a learned feature space [5]. [13] proposes learning\na class-conditional model of diffeomorphisms interpolating between nearest-neighbor labeled data\npoints as a way to perform augmentation. We view these approaches as complementary but orthogonal,\nas our goal is to directly exploit user domain knowledge of class-invariant transformation operations.\n\nAdversarial Data Augmentation Several lines of recent work have explored techniques which\ncan be viewed as forms of data augmentation that are adversarial with respect to the end classi\ufb01cation\nmodel. In one set of approaches, transformation operations are selected adaptively from a given set in\norder to maximize the loss of the end classi\ufb01cation model being trained [30, 8]. These procedures\nmake the strong assumption that all of the provided transformations will preserve class labels, or\nuse bespoke models over restricted sets of operations [28]. Another line of recent work has showed\nthat augmentation via small adversarial linear perturbations can act as a regularizer [10, 22]. While\ncomplimentary, this work does not consider taking advantage of non-local transformations derived\nfrom user knowledge of task or domain invariances.\nFinally, generative adversarial networks (GANs) [9] have recently made great progress in learning\ncomplete data generation models from unlabeled data. These can be used to augment labeled training\nsets as well. Class-conditional GANs [1, 21] generate arti\ufb01cial data points but require large sets of\nlabeled training data to learn from. Standard unsupervised GANs can be used to generate additional\nout-of-class data points that can then augment labeled training sets [25, 29]. We compare our proposed\napproach with these methods empirically in Section 5.\n\n5 Experiments\n\nWe experimentally validate the proposed framework by learning augmentation models for several\nbenchmark and real-world data sets, exploring both image recognition and natural language under-\nstanding tasks. Our focus is on the performance of end classi\ufb01cation models trained on labeled\ndatasets augmented with our approach and others used in practice. We also examine robustness to\nuser misspeci\ufb01cation of TFs, and sensitivity to core hyperparameters.\n\n5.1 Datasets and Transformation Functions\n\nBenchmark Image Datasets We ran experiments on the MNIST [18] and CIFAR-10 [17] datasets,\nusing only a subset of the class labels to train the end classi\ufb01cation models and treating the rest\n\n6\n\n\fas unlabeled data. We used a generic set of TFs for both MNIST and CIFAR-10: small rotations,\nshears, central swirls, and elastic deformations. We also used morphologic operations for MNIST,\nand adjustments to hue, saturation, contrast, and brightness for CIFAR-10.\n\nBenchmark Text Dataset We applied our approach to the Employment relation extraction subtask\nfrom the NIST Automatic Content Extraction (ACE) corpus [6], where the goal is to identify mentions\nof employer-employee relations in news articles. Given the standard class imbalance in information\nextraction tasks like this, we used data augmentation to oversample the minority positive class. The\n\ufb02exibility of our TF representation allowed us to take a straightforward but novel approach to data\naugmentation in this setting. We constructed a trigram language model using the ACE corpus and\nReuters Corpus Volume I [19] from which we can sample a word conditioned on the preceding\nwords. We then used this model as the basis for a set of TFs that select words to swap based on the\npart-of-speech tag and location relative to entities of interest (see Supplemental Materials for details).\n\nMammography Tumor-Classi\ufb01cation Dataset To demonstrate the effectiveness of our approach\non real-world applications, we also considered the task of classifying benign versus malignant tumors\nfrom images in the Digital Database for Screening Mammography (DDSM) dataset [15, 4, 26], which\nis a class-balanced dataset consisting of 1506 labeled mammograms. In collaboration with domain\nexperts in radiology, we constructed two basic TF sets. The \ufb01rst set consisted of standard image\ntransformation operations subselected so as not to break class-invariance in the mammography setting.\nFor example, brightness operations were excluded for this reason. The second set consisted of both\nthe \ufb01rst set as well as several novel segmentation-based transplantation TFs. Each of these TFs\nutilized the output of an unsupervised segmentation algorithm to isolate the tumor mass, perform\na transformation operation such as rotation or shifting, and then stitch it into a randomly-sampled\nbenign tissue image. See Fig. 1 (right panel) for an illustrative example, and Supplemental Materials\nfor further details.\n\n5.2 End Classi\ufb01er Performance\n\nWe evaluated our approach by using it to augment labeled training sets for the tasks mentioned above,\nand show that we achieve strong gains over heuristic baselines. In particular, for a given set of TFs,\nwe evaluate the performance of mean \ufb01eld (MF) and LSTM generators trained using our approach\nagainst two standard data augmentation techniques used in practice. The \ufb01rst (Basic) consists of\napplying random crops to images, or performing simple minority class duplication for the ACE\nrelation extraction task. The second (Heur.) is the standard heuristic approach of applying random\ncompositions of the given set of transformation operations, the most common technique used in\npractice [3, 11, 14]. For both our approaches (MF and LSTM) and Heur., we additionally use the\nsame random cropping technique as in the Basic approach. We present these results in Table 1, where\nwe report test set accuracy (or F1 score for ACE), and use a random subsample of the available\nlabeled training data. Additionally, we include an extra row for the DDSM task highlighting the\nimpact of adding domain-speci\ufb01c (DS) TFs \u2013 the segmentation-based operations described above \u2013\non performance.\nIn Table 2 we additionally compare to two related generative-adversarial methods, the Categorical\nGAN (CatGAN) [29], and the semi-supervised GAN (SS-GAN) from [25]. Both of these methods\nuse GAN-based architectures trained on unlabeled data to generate new out-of-class data points\nwith which to augment a labeled training set. Following their protocol for CIFAR-10, we train our\ngenerator on the full set of unlabeled data, and our end discriminator on ten disjoint random folds of\nthe labeled training set not including the validation set (i.e. n = 4000 each), averaging the results.\nIn all settings, we train our TF sequence generator on the full set of unlabeled data. We select a \ufb01xed\nsequence length for each task via an initial calibration experiment (Fig. 5b). We use L = 5 for ACE,\nL = 7 for DDSM + DS, and L = 10 for all other tasks. We note that our \ufb01ndings here mirrored\nthose in the literature, namely that compositions of multiple TFs lead to higher end model accuracies.\nWe selected hyperparameters of the generator via performance on a validation set. We then used\nthe trained generator to transform the entire training set at each epoch of end classi\ufb01cation model\ntraining. For MNIST and DDSM we use a four-layer all-convolutional CNN, for CIFAR10 we use a\n56-layer ResNet [14], and for ACE we use a bi-directional LSTM. Additionally, we incorporate a\nbasic transformation regularization term as in [24] (see Supplemental Materials), and train for the last\nten epochs without applying any transformations as in [11]. In all cases, we use hyperparameters as\n\n7\n\n\fTask\nMNIST\n\nCIFAR-10\n\nACE (F1)\nDDSM\nDDSM + DS\n\n% None Basic Heur. MF LSTM\n96.7\n1\n99.1\n10\n81.5\n10\n100\n94.0\n64.2\n100\n61.0\n62.7\n\n96.5\n99.2\n79.8\n94.4\n62.9\n58.2\n59.9\n\n95.9\n99.0\n77.5\n92.3\n62.8\n59.3\n53.7\n\n95.3\n98.7\n73.1\n91.9\n59.9\n\n58.8\n\n90.2\n97.3\n66.0\n87.8\n62.7\n\n57.6\n\n10\n\nTable 1: Test set performance of end models trained on subsamples\nof the labeled training data (%), not including validation splits, using\nvarious data augmentation approaches. None indicates performance\nwith no augmentation. All tasks are measured in accuracy, except\nACE which is measured by F1 score.\n\nModel\nAcc. (%)\nCatGAN 80.42 \u00b1 0.58\nSS-GAN 81.37 \u00b1 2.32\nLSTM 81.47 \u00b1 0.46\nTable 2: Reported end\nmodel accuracies, averaged\nacross 10% subsample folds,\non CIFAR-10 for compara-\nble GAN methods.\n\n(a)\n\n(b)\n\nFigure 5: (a) Learned TF frequency parameters for misspeci\ufb01ed and normal TFs on MNIST. The\nmean \ufb01eld model correctly learns to avoid the misspeci\ufb01ed TFs. (b) Larger sequence lengths lead\nto higher end model accuracy on CIFAR-10, while random performs best with shorter sequences,\naccording to a sequence length calibration experiment.\n\nreported in the literature. For further details of generator and end model training see the Supplemental\nMaterials.\nWe see that across the applications studied, our approach outperforms the heuristic data augmentation\napproach most commonly used in practice. Furthermore, the LSTM generator outperforms the\nsimple mean \ufb01eld one in most settings, indicating the value of modeling sequential structure in data\naugmentation. In particular, we realize signi\ufb01cant gains over standard heuristic data augmentation on\nCIFAR-10, where we are competitive with comparable semi-supervised GAN approaches, but with\nsigni\ufb01cantly smaller variance. We also train the same CIFAR-10 end model using the full labeled\ntraining dataset, and again see strong relative gains (2.1 pts. in accuracy over heuristic), coming\nwithin 2.1 points of the current state-of-the-art [16] using our much simpler end model.\nOn the ACE and DDSM tasks, we also achieve strong performance gains, showing the ability of our\nmethod to productively incorporate more complex transformation operations from domain expert\nusers. In particular, in DDSM we observe that the addition of the segmentation-based TFs causes\nthe heuristic augmentation approach to perform signi\ufb01cantly worse, due to a large number of new\nfailure modes resulting from combinations of the segmentation-based TFs \u2013 which use gradient-based\nblending \u2013 and the standard TFs such as zoom and rotate. In contrast, our LSTM model learns to\navoid these destructive subsequences and achieves the highest score, resulting in a 9.0 point boost\nover the comparable heuristic approach.\n\n8\n\n\fRobustness to TF Misspeci\ufb01cation One of the high-level goals of our approach is to enable an\neasier interface for users by not requiring that the TFs they specify be completely class-preserving.\nThe lack of any assumption of well-speci\ufb01ed transformation operations in our approach, and the\nstrong empirical performance realized, is evidence of this robustness. To additionally illustrate the\nrobustness of our approach to misspeci\ufb01ed TFs, we train a mean \ufb01eld generator on MNIST using the\nstandard TF set, but with two TFs (shear operations) parameterized so as to map almost all images to\nthe null class. We see in Fig. 5a that the generator learns to avoid applying the misspeci\ufb01ed TFs (red\nlines) almost entirely.\n\n6 Conclusion and Future Work\n\nWe presented a method for learning how to parameterize and compose user-provided black-box\ntransformation operations used for data augmentation. Our approach is able to model arbitrary TFs,\nallowing practitioners to leverage domain knowledge in a \ufb02exible and simple manner. By training a\ngenerative sequence model over the speci\ufb01ed transformation functions using reinforcement learning\nin a GAN-like framework, we are able to generate realistic transformed data points which are useful\nfor data augmentation. We demonstrated that our method yields strong gains over standard heuristic\napproaches to data augmentation for a range of applications, modalities, and complex domain-speci\ufb01c\ntransformation functions. There are many possible future directions of research for learning data\naugmentation strategies in the proposed model, such as conditioning the generator\u2019s stochastic policy\non a featurized version of the data point being transformed, and generating TF sequences of dynamic\nlength. More broadly, we are excited about further formalizing data augmentation as a novel form of\nweak supervision, allowing users to directly encode domain knowledge about invariants into machine\nlearning models.\n\nAcknowledgements We would like to thank Daniel Selsam, Ioannis Mitliagkas, Christopher De\nSa, William Hamilton, and Daniel Rubin for valuable feedback and conversations. We gratefully\nacknowledge the support of the Defense Advanced Research Projects Agency (DARPA) SIMPLEX\nprogram under No. N66001-15-C-4043, the DARPA D3M program under No. FA8750-17-2-\n0095, DARPA programs No. FA8750-12-2-0335 and FA8750-13-2-0039, DOE 108845, National\nInstitute of Health (NIH) U54EB020405, the Of\ufb01ce of Naval Research (ONR) under awards No.\nN000141210041 and No. N000141310129, the Moore Foundation, the Okawa Research Grant,\nAmerican Family Insurance, Accenture, Toshiba, and Intel. This research was also supported in part\nby af\ufb01liate members and other supporters of the Stanford DAWN project: Intel, Microsoft, Teradata,\nand VMware. This material is based on research sponsored by DARPA under agreement number\nFA8750-17-2-0095. The U.S. Government is authorized to reproduce and distribute reprints for\nGovernmental purposes notwithstanding any copyright notation thereon. Any opinions, \ufb01ndings,\nand conclusions or recommendations expressed in this material are those of the authors and do not\nnecessarily re\ufb02ect the views, policies, or endorsements, either expressed or implied, of DARPA,\nAFRL, NSF, NIH, ONR, or the U.S. Government.\n\nReferences\n[1] S. Baluja and I. Fischer. Adversarial transformation networks: Learning to generate adversarial\n\nexamples. arXiv preprint arXiv:1703.09387, 2017.\n\n[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority\n\nover-sampling technique. Journal of arti\ufb01cial intelligence research, 16:321\u2013357, 2002.\n\n[3] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets\n\nexcel on handwritten digit recognition, 2010. Cited on, 80.\n\n[4] K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maf\ufb01tt,\nM. Pringle, L. Tarbox, and F. Prior. The cancer imaging archive (TCIA): Maintaining and\noperating a public information repository. Journal of Digital Imaging, 26(6):1045\u20131057, 2013.\n\n[5] T. DeVries and G. W. Taylor. Dataset augmentation in feature space.\n\narXiv:1702.05538, 2017.\n\narXiv preprint\n\n9\n\n\f[6] G. R. Doddington, A. Mitchell, M. A. Przybocki, L. A. Ramshaw, S. Strassel, and R. M.\nWeischedel. The automatic content extraction (ace) program-tasks, data, and evaluation. In\nLREC, volume 2, page 1, 2004.\n\n[7] A. Dosovitskiy, P. Fischer, J. Springenberg, M. Riedmiller, and T. Brox. Discriminative\nunsupervised feature learning with exemplar convolutional neural networks, arxiv preprint.\narXiv preprint arXiv:1506.02753, 2015.\n\n[8] A. Fawzi, H. Samulowitz, D. Turaga, and P. Frossard. Adaptive data augmentation for image\nclassi\ufb01cation. In Image Processing (ICIP), 2016 IEEE International Conference on, pages\n3688\u20133692. IEEE, 2016.\n\n[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[10] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.\n\narXiv preprint arXiv:1412.6572, 2014.\n\n[11] B. Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014.\n\n[12] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates\nin reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471\u20131530, 2004.\n\n[13] S. Hauberg, O. Freifeld, A. B. L. Larsen, J. Fisher, and L. Hansen. Dreaming more data:\nClass-dependent distributions over diffeomorphisms for learned data augmentation. In Arti\ufb01cial\nIntelligence and Statistics, pages 342\u2013350, 2016.\n\n[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n770\u2013778, 2016.\n\n[15] M. Heath, K. Bowyer, D. Kopans, R. Moore, and W. P. Kegelmeyer. The digital database\nfor screening mammography. In Proceedings of the 5th international workshop on digital\nmammography, pages 212\u2013218. Medical Physics Publishing, 2000.\n\n[16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional\n\nnetworks. arXiv preprint arXiv:1608.06993, 2016.\n\n[17] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text\n\ncategorization research. Journal of machine learning research, 5(Apr):361\u2013397, 2004.\n\n[20] X. Lu, B. Zheng, A. Velivelli, and C. Zhai. Enhancing text categorization with semantic-enriched\nrepresentation and training data augmentation. Journal of the American Medical Informatics\nAssociation, 13(5):526\u2013535, 2006.\n\n[21] M. Mirza and S. Osindero. Conditional generative adversarial nets.\n\narXiv:1411.1784, 2014.\n\narXiv preprint\n\n[22] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing with\n\nvirtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.\n\n[23] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[24] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and\n\nperturbations for deep semi-supervised learning. CoRR, abs/1606.04586, 2016.\n\n10\n\n\f[25] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pages\n2226\u20132234, 2016.\n\n[26] R. Sawyer Lee, F. Gimenez, A. Hoogi, and D. Rubin. Curated breast imaging subset of DDSM.\n\nIn The Cancer Imaging Archive, 2016.\n\n[27] J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic com-\nputation graphs. In Advances in Neural Information Processing Systems, pages 3528\u20133536,\n2015.\n\n[28] L. Sixt, B. Wild, and T. Landgraf. Rendergan: Generating realistic labeled data. arXiv preprint\n\narXiv:1611.01331, 2016.\n\n[29] J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative\n\nadversarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n[30] C. H. Teo, A. Globerson, S. T. Roweis, and A. J. Smola. Convex learning with invariances. In\n\nAdvances in neural information processing systems, pages 1489\u20131496, 2008.\n\n[31] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji. Improving\nmusic source separation based on deep neural networks through data augmentation and network\nblending. Submitted to ICASSP, 2017.\n\n[32] D. Wierstra, A. F\u00f6rster, J. Peters, and J. Schmidhuber. Recurrent policy gradients. Logic Journal\n\nof IGPL, 18(5):620\u2013634, 2010.\n\n11\n\n\f", "award": [], "sourceid": 1836, "authors": [{"given_name": "Alexander", "family_name": "Ratner", "institution": "Stanford"}, {"given_name": "Henry", "family_name": "Ehrenberg", "institution": "Stanford University"}, {"given_name": "Zeshan", "family_name": "Hussain", "institution": "Stanford University"}, {"given_name": "Jared", "family_name": "Dunnmon", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford"}]}