{"title": "Learning Data Manipulation for Augmentation and Weighting", "book": "Advances in Neural Information Processing Systems", "page_first": 15764, "page_last": 15775, "abstract": "Manipulating data, such as weighting data examples or augmenting with new instances, has been increasingly used to improve model training. Previous work has studied various rule- or learning-based approaches designed for specific types of data manipulation. In this work, we propose a new method that supports learning different manipulation schemes with the same gradient-based algorithm. Our approach builds upon a recent connection of supervised learning and reinforcement learning (RL), and adapts an off-the-shelf reward learning algorithm from RL for joint data manipulation learning and model training. Different parameterization of the ``data reward'' function instantiates different manipulation schemes. We showcase data augmentation that learns a text transformation network, and data weighting that dynamically adapts the data sample importance. Experiments show the resulting algorithms significantly improve the image and text classification performance in low data regime and class-imbalance problems.", "full_text": "Learning Data Manipulation for Augmentation and\n\nWeighting\n\nZhiting Hu1,2\u2217, Bowen Tan1\u2217, Ruslan Salakhutdinov1, Tom Mitchell1, Eric P. Xing1,2\n\n{zhitingh,btan2,rsalakhu,tom.mitchell}@cs.cmu.edu, eric.xing@petuum.com\n\n1Carnegie Mellon University, 2Petuum Inc.\n\nAbstract\n\nManipulating data, such as weighting data examples or augmenting with new\ninstances, has been increasingly used to improve model training. Previous work\nhas studied various rule- or learning-based approaches designed for speci\ufb01c types\nof data manipulation. In this work, we propose a new method that supports learning\ndifferent manipulation schemes with the same gradient-based algorithm. Our\napproach builds upon a recent connection of supervised learning and reinforcement\nlearning (RL), and adapts an off-the-shelf reward learning algorithm from RL for\njoint data manipulation learning and model training. Different parameterization\nof the \u201cdata reward\u201d function instantiates different manipulation schemes. We\nshowcase data augmentation that learns a text transformation network, and data\nweighting that dynamically adapts the data sample importance. Experiments show\nthe resulting algorithms signi\ufb01cantly improve the image and text classi\ufb01cation\nperformance in low data regime and class-imbalance problems.\n\n1\n\nIntroduction\n\nThe performance of machines often crucially depend on the amount and quality of the data used for\ntraining. It has become increasingly ubiquitous to manipulate data to improve learning, especially in\nlow data regime or in presence of low-quality datasets (e.g., imbalanced labels). For example, data\naugmentation applies label-preserving transformations on original data points to expand the data size;\ndata weighting assigns an importance weight to each instance to adapt its effect on learning; and data\nsynthesis generates entire arti\ufb01cial examples. Different types of manipulation can be suitable for\ndifferent application settings.\nCommon data manipulation methods are usually designed manually, e.g., augmenting by \ufb02ipping\nan image or replacing a word with synonyms, and weighting with inverse class frequency or loss\nvalues [10, 32]. Recent work has studied automated approaches, such as learning the composition of\naugmentation operators with reinforcement learning [38, 5], deriving sample weights adaptively from\na validation set via meta learning [39], or learning a weighting network by inducing a curriculum [21].\nThese learning-based approaches have alleviated the engineering burden and produced impressive\nresults. However, the algorithms are usually designed speci\ufb01cally for certain types of manipulation\n(e.g., either augmentation or weighting) and thus have limited application scope in practice.\nIn this work, we propose a new approach that enables learning for different manipulation schemes\nwith the same single algorithm. Our approach draws inspiration from the recent work [46] that\nshows equivalence between the data in supervised learning and the reward function in reinforcement\nlearning. We thus adapt an off-the-shelf reward learning algorithm [52] to the supervised setting\nfor automated data manipulation. The marriage of the two paradigms results in a simple yet general\nalgorithm, where various manipulation schemes are reduced to different parameterization of the\n\n\u2217equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdata reward. Free parameters of manipulation are learned jointly with the target model through\nef\ufb01cient gradient descent on validation examples. We demonstrate instantiations of the approach for\nautomatically \ufb01ne-tuning an augmentation network and learning data weights, respectively.\nWe conduct extensive experiments on text and image classi\ufb01cation in challenging situations of very\nlimited data and imbalanced labels. Both augmentation and weighting by our approach signi\ufb01cantly\nimprove over strong base models, even though the models are initialized with large-scale pretrained\nnetworks such as BERT [7] for text and ResNet [14] for images. Our approach, besides its generality,\nalso outperforms a variety of dedicated rule- and learning-based methods for either augmentation\nor weighting, respectively. Lastly, we observe that the two types of manipulation tend to excel in\ndifferent contexts: augmentation shows superiority over weighting with a small amount of data\navailable, while weighting is better at addressing class imbalance problems.\nThe way we derive the manipulation algorithm represents a general means of problem solving through\nalgorithm extrapolation between learning paradigms, which we discuss more in section 6.\n\n2 Related Work\n\nRich types of data manipulation have been increasingly used in modern machine learning pipelines.\nPrevious work each has typically focused on a particular manipulation type. Data augmentation\nthat perturbs examples without changing the labels is widely used especially in vision [44, 26] and\nspeech [24, 36] domains. Common heuristic-based methods on images include cropping, mirroring,\nrotation [26], and so forth. Recent work has developed automated augmentation approaches [5, 38,\n28, 37, 47]. Xie et al. [50] additionally use large-scale unlabeled data. Cubuk et al. [5], Ratner\net al. [38] learn to induce the composition of data transformation operators. Instead of treating\ndata augmentation as a policy in reinforcement learning [5], we formulate manipulation as a reward\nfunction and use ef\ufb01cient stochastic gradient descent to learn the manipulation parameters. Text\ndata augmentation has also achieved impressive success, such as contextual augmentation [25, 49],\nback-translation [42], and manual approaches [48, 2]. In addition to perturbing the input text as in\nclassi\ufb01cation tasks, text generation problems expose opportunities to adding noise also in the output\ntext, such as [35, 51]. Recent work [46] shows output nosing in sequence generation can be treated as\nan intermediate approach in between supervised learning and reinforcement learning, and developed\na new sequence learning algorithm that interpolates between the spectrum of existing algorithms. We\ninstantiate our approach for text contextual augmentation as in [25, 49], but enhance the previous\nwork by additionally \ufb01ne-tuning the augmentation network jointly with the target model.\nData weighting has been used in various algorithms, such as AdaBoost [10], self-paced learning [27],\nhard-example mining [43], and others [4, 22]. These algorithms largely de\ufb01ne sample weights based\non training loss. Recent work [21, 8] learns a separate network to predict sample weights. Of\nparticular relevance to our work is [39] which induces sample weights using a validation set. The data\nweighting mechanism instantiated by our framework has a key difference in that samples weights are\ntreated as parameters that are updated iteratively, instead of re-estimated from scratch at each step.\nWe show improved performance of our approach. Besides, our data manipulation approach is derived\nbased on a different perspective of reward learning, instead of meta-learning as in [39].\nAnother popular type of data manipulation involves data synthesis, which creates entire arti\ufb01cial\nsamples from scratch. GAN-based approaches have achieved impressive results for synthesizing\nconditional image data [3, 34]. In the text domain, controllable text generation [17] presents a\nway of co-training the data generator and classi\ufb01er in a cyclic manner within a joint VAE [23] and\nwake-sleep [15] framework. It is interesting to explore the instantiation of the present approach for\nadaptive data synthesis in the future.\n\n3 Background\n\nWe \ufb01rst present the relevant work upon which our automated data manipulation is built. This section\nalso establishes the notations used throughout the paper.\nLet x denote the input and y the output. For example, in text classi\ufb01cation, x can be a sentence and y\nis the sentence label. Denote the model of interest as p\u03b8(y|x), where \u03b8 is the model parameters to be\n\n2\n\n\flearned. In supervised setting, given a set of training examples D = {(x\u2217, y\u2217)}, we learn the model\nby maximizing the data log-likelihood.\n\nEquivalence between Data and Reward The recent work [46] introduced a unifying perspective\nof reformulating maximum likelihood supervised learning as a special instance of a policy optimiza-\ntion framework. In this perspective, data examples providing supervision signals are equivalent to a\nspecialized reward function. Since the original framework [46] was derived for sequence generation\nproblems, here we present a slightly adapted formulation for our context of data manipulation.\nTo connect the maximum likelihood supervised learning with policy optimization, consider the\nmodel p\u03b8(y|x) as a policy that takes \u201caction\u201d y given the \u201cstate\u201d x. Let R(x, y|D) \u2208 R denote\na reward function, and p(x) be the empirical data distribution which is known given D. Further\nassume a variational distribution q(x, y) that factorizes as q(x, y) = p(x)q(y|x). A variational\npolicy optimization objective is then written as:\n\nL(q, \u03b8) = Eq(x,y) [R(x, y|D)] \u2212 \u03b1KL(cid:0)q(x, y)(cid:107)p(x)p\u03b8(y|x)(cid:1) + \u03b2H(q),\n\n(1)\nwhere KL(\u00b7(cid:107)\u00b7) is the Kullback\u2013Leibler divergence; H(\u00b7) is the Shannon entropy; and \u03b1, \u03b2 > 0 are\nbalancing weights. The objective is in the same form with the RL-as-inference formalism of policy\noptimization [e.g., 6, 29, 1]. Intuitively, the objective maximizes the expected reward under q, and\nenforces the model p\u03b8 to stay close to q, with a maximum entropy regularization over q. The problem\nis solved with an EM procedure that optimizes q and \u03b8 alternatingly:\n\n(cid:26) \u03b1 log p(x)p\u03b8(y|x) + R(x, y|D)\n\n(cid:27)\n\n/ Z,\n\n(2)\n\nE-step:\n\nq\n\n(cid:48)\n\nM-step: \u03b8\n\n(x, y) = exp\n(cid:48)\n\n= arg max\u03b8\n\n(cid:2) log p\u03b8(y|x)(cid:3),\n\n\u03b1 + \u03b2\n\nEq(cid:48)(x,y)\n\nwhere Z is the normalization term. With the established framework, it is easy to show that the above\noptimization procedure reduces to maximum likelihood learning by taking \u03b1 \u2192 0, \u03b2 = 1, and the\nreward function:\n\n(cid:26) 1\n\nR\u03b4(x, y|D) =\n\nif (x, y) \u2208 D\n\n\u2212\u221e otherwise.\n\n(3)\n\nThat is, a sample (x, y) receives a unit reward only when it matches a training example in the\ndataset, while the reward is negative in\ufb01nite in all other cases. To make the equivalence to maximum\nlikelihood learning clearer, note that the above M-step now reduces to\n\n(4)\nwhere the joint distribution p(x) exp{R\u03b4}/Z equals the empirical data distribution, which means the\nM-step is in fact maximizing the data log-likelihood of the model p\u03b8.\n\n= arg max\u03b8\n\n\u03b8\n\nEp(x) exp{R\u03b4}/Z\n\n(cid:48)\n\n(cid:2) log p\u03b8(y|x)(cid:3),\n\nGradient-based Reward Learning There is a rich line of research on learning the reward in\nreinforcement learning. Of particular interest to this work is [52] which learns a parametric intrinsic\nreward that additively transforms the original task reward (a.k.a extrinsic reward) to improve the\npolicy optimization. For consistency of notations with above, formally, let p\u03b8(y|x) be a policy where\ny is an action and x is a state. Let Rin\n\u03c6 be the intrinsic reward with parameters \u03c6. In each iteration,\nthe policy parameter \u03b8 is updated to maximize the joint rewards, through:\n\n(cid:48)\n\n= \u03b8 + \u03b3\u2207\u03b8Lex+in(\u03b8, \u03c6),\n\n\u03b8\n\n(5)\nwhere Lex+in is the expectation of the sum of extrinsic and intrinsic rewards; and \u03b3 is the step size.\nThe equation shows \u03b8(cid:48) depends on \u03c6, thus we can write as \u03b8(cid:48) = \u03b8(cid:48)(\u03c6).\nThe next step is to optimize the intrinsic reward parameters \u03c6. Recall that the ultimate measure of\nthe performance of a policy is the value of extrinsic reward it achieves. Therefore, a good intrinsic\nreward is supposed to, when the policy is trained with it, increase the eventual extrinsic reward. The\nupdate to \u03c6 is then written as:\n\n(6)\nThat is, we want the expected extrinsic reward Lex(\u03b8(cid:48)) of the new policy \u03b8(cid:48) to be maximized. Since\n\u03b8(cid:48) is a function of \u03c6, we can directly backpropagate the gradient through \u03b8(cid:48) to \u03c6.\n\n(\u03c6)).\n\n(cid:48)\n\u03c6\n\n= \u03c6 + \u03b3\u2207\u03c6Lex(\u03b8\n\n(cid:48)\n\n3\n\n\fAlgorithm 1 Joint Learning of Model and Data Manipulation\nInput: The target model p\u03b8(y|x)\n\nThe data manipulation function R\u03c6(x, y|D)\nTraining set D, validation set Dv\n\n1: Initialize model parameter \u03b8 and manipulation parameter \u03c6\n2: repeat\n3:\n\nOptimize \u03b8 on D enriched with data manipulation\nthrough Eq.(7)\nOptimize \u03c6 by maximizing data log-likelihood on Dv\nthrough Eq.(8)\n5: until convergence\nOutput: Learned model p\u03b8\u2217 (y|x) and manipulation R\u03c6\u2217 (y, x|D)\n\n4:\n\nFigure 1: Algorithm Computation.\nBlue arrows denote learning model \u03b8.\nRed arrows denote learning manipula-\ntion \u03c6. Solid arrows denote forward\npass. Dashed arrows denote backward\npass and parameter updates.\n\n4 Learning Data Manipulation\n\n4.1 Method\n\nParameterizing Data Manipulation We now develop our approach of learning data manipulation,\nthrough a novel marriage of supervised learning and the above reward learning. Speci\ufb01cally, from\nthe policy optimization perspective, due to the \u03b4-function reward (Eq.3), the standard maximum\nlikelihood learning is restricted to use only the exact training examples D in a uniform way. A natural\nidea of enabling data manipulation is to relax the strong restrictions of the \u03b4-function reward and\ninstead use a relaxed reward R\u03c6(x, y|D) with parameters \u03c6. The relaxed reward can be parameterized\nin various ways, resulting in different types of manipulation. For example, when a sample (x, y)\nmatches a data instance, instead of returning constant 1 by R\u03b4, the new R\u03c6 can return varying reward\nvalues depending on the matched instance, resulting in a data weighting scheme. Alternatively, R\u03c6\ncan return a valid reward even when x matches a data example only in part, or (x, y) is an entire new\nsample not in D, which in effect makes data augmentation and data synthesis, respectively, in which\ncases \u03c6 is either a data transformer or a generator. In the next section, we demonstrate two particular\nparameterizations for data augmentation and weighting, respectively.\nWe thus have shown that the diverse types of manipulation all boil down to a parameterized data\nreward R\u03c6. Such an concise, uniform formulation of data manipulation has the advantage that, once\nwe devise a method of learning the manipulation parameters \u03c6, the resulting algorithm can directly\nbe applied to automate any manipulation type. We present a learning algorithm next.\n\nLearning Manipulation Parameters To learn the parameters \u03c6 in the manipulation reward\nR\u03c6(x, y|D), we could in principle adopt any off-the-shelf reward learning algorithm in the lit-\nerature. In this work, we draw inspiration from the above gradient-based reward learning (section 3)\ndue to its simplicity and ef\ufb01ciency. Brie\ufb02y, the objective of \u03c6 is to maximize the ultimate measure\nof the performance of model p\u03b8(y|x), which, in the context of supervised learning, is the model\nperformance on a held-out validation set.\nThe algorithm optimizes \u03b8 and \u03c6 alternatingly, corresponding to Eq.(5) and Eq.(6), respectively.\nMore concretely, in each iteration, we \ufb01rst update the model parameters \u03b8 in analogue to Eq.(5)\nwhich optimizes intrinsic reward-enriched objective. Here, we optimize the log-likelihood of the\ntraining set enriched with data manipulation. That is, we replace R\u03b4 with R\u03c6 in Eq.(4), and obtain\nthe augmented M-step:\n\n= arg max\u03b8\n\n(7)\nBy noticing that the new \u03b8(cid:48) depends on \u03c6, we can write \u03b8(cid:48) as a function of \u03c6, namely, \u03b8(cid:48) = \u03b8(cid:48)(\u03c6). The\npractical implementation of the above update depends on the actual parameterization of manipulation\nR\u03c6, which we discuss in more details in the next section.\n\nEp(x) exp{R\u03c6(x,y|D)}/Z\n\n(cid:48)\n\n\u03b8\n\n(cid:2) log p\u03b8(y|x)(cid:3).\n\n4\n\nTrain Data!ValData!\"\u2112(%,')%\u2032(')Model%\u2112(%\u2032('))''\u2032Manipulation\fThe next step is to optimize \u03c6 in terms of the model validation performance, in analogue to Eq.(6).\nFormally, let Dv be the validation set of data examples. The update is then:\n\n(cid:48)\n\u03c6\n\n= arg max\u03c6\n= arg max\u03c6\n\nEp(x) exp{R\u03b4 (x,y|Dv )}/Z\nE(x,y)\u223cDv\n\n(cid:2) log p\u03b8(cid:48) (y|x)(cid:3),\n\n(cid:2) log p\u03b8(cid:48) (y|x)(cid:3)\n\n(8)\n\nwhere, since \u03b8(cid:48) is a function of \u03c6, the gradient is backpropagated to \u03c6 through \u03b8(cid:48)(\u03c6). Taking data\nweighting for example where \u03c6 is the training sample weights (more details in section 4.2), the update\nis to optimize the weights of training samples so that the model performs best on the validation set.\nThe resulting algorithm is summarized in Algorithm 1. Figure 1 illustrates the computation \ufb02ow.\nLearning the manipulation parameters effectively uses a held-out validation set. We show in our\nexperiments that a very small set of validation examples (e.g., 2 labels per class) is enough to\nsigni\ufb01cantly improve the model performance in low data regime.\nIt is worth noting that some previous work has also leveraged validation examples, such as learning\ndata augmentation with policy gradient [5] or inducing data weights with meta-learning [39]. Our\napproach is inspired from a distinct paradigm of (intrinsic) reward learning. In contrast to [5] that\ntreats data augmentation as a policy, we instead formulate manipulation as a reward function and\nenable ef\ufb01cient stochastic gradient updates. Our approach is also more broadly applicable to diverse\ndata manipulation types than [39, 5].\n\n4.2\n\nInstantiations: Augmentation & Weighting\n\nAs a case study, we show two parameterizations of R\u03c6 which instantiate distinct data manipulation\nschemes. The \ufb01rst example learns augmentation for text data, a domain that has been less studied in\nthe literature compared to vision and speech [25, 12]. The second instance focuses on automated data\nweighting, which is applicable to any data domains.\n\nFine-tuning Text Augmentation\nThe recent work [25, 49] developed a novel contextual augmentation approach for text data, in which\na powerful pretrained language model (LM), such as BERT [7], is used to generate substitutions\nof words in a sentence. Speci\ufb01cally, given an observed sentence x\u2217, the method \ufb01rst randomly\nmasks out a few words. The masked sentence is then fed to BERT which \ufb01lls the masked positions\nwith new words. To preserve the original sentence class, the BERT LM is retro\ufb01tted as a label-\nconditional model, and trained on the task training examples. The resulting model is then \ufb01xed and\nused to augment data during the training of target model. We denote the augmentation distribution as\ng\u03c60 (x|x\u2217, y\u2217), where \u03c60 is the \ufb01xed BERT LM parameters.\nThe above process has two drawbacks. First, the LM is \ufb01xed after \ufb01tting to the task data. In the\nsubsequent phase of training the target model, the LM augments data without knowing the state of\nthe target model, which can lead to sub-optimal results. Second, in the cases where the task dataset\nis small, the LM can be insuf\ufb01ciently trained for preserving the labels faithfully, resulting in noisy\naugmented samples.\nTo address the dif\ufb01culties, it is bene\ufb01cial to apply the proposed learning data manipulation algorithm\nto additionally \ufb01ne-tune the LM jointly with target model training. As discussed in section 4, this\nreduces to properly parameterizing the data reward function:\n\n(cid:26) 1\n\n\u03c6 (x, y|D) =\nRaug\n\n\u2212\u221e otherwise.\n\nif x \u223c g\u03c6(x|x\u2217, y), (x\u2217, y) \u2208 D\n\nThat is, a sample (x, y) receives a unit reward when y is the true label and x is the augmented sample\nby the LM (instead of the exact original data x\u2217). Plugging the reward into Eq.(7), we obtain the\ndata-augmented update for the model parameters:\n\nEx\u223cg\u03c6(x|x\u2217,y), (x\u2217,y)\u223cD(cid:2) log p\u03b8(y|x)(cid:3).\n\n(cid:48)\n\n\u03b8\n\n= arg max\u03b8\n\nThat is, we pick an example from the training set, and use the LM to create augmented samples,\nwhich are then used to update the target model. Regarding the update of augmentation parameters \u03c6\n(Eq.8), since text samples are discrete, to enable ef\ufb01cient gradient propagation through \u03b8(cid:48) to \u03c6, we\nuse a gumbel-softmax approximation [20] to x when sampling substitution words from the LM.\n\n5\n\n(9)\n\n(10)\n\n\fLearning Data Weights\nWe now demonstrate the instantiation of data weighting. We aim to assign an importance weight to\neach training example to adapt its effect on model training. We automate the process by learning the\ndata weights. This is achieved by parameterizing R\u03c6 as:\n\n(cid:26) \u03c6i\n\n\u03c6 (x, y|D) =\nRw\n\n\u2212\u221e otherwise,\n\nif (x, y) = (x\u2217\n\ni , y\u2217\n\ni ), (x\u2217\n\ni , y\u2217\n\ni ) \u2208 D\n\n(11)\n\n(12)\n\nwhere \u03c6i \u2208 R is the weight associated with the ith example. Plugging Rw\nweighted update for the model \u03b8:\n\n\u03c6 into Eq.(7), we obtain the\n\n(cid:48)\n\n\u03b8\n\n= arg max\u03b8\n= arg max\u03b8\n\nE(x\u2217\nE(x\u2217\n\ni ,y\u2217\ni ,y\u2217\n\n(cid:2) log p\u03b8(y\ni )\u223cD(cid:2)softmax(\u03c6i) log p\u03b8(y\n\ni )\u2208D, i\u223csoftmax(\u03c6i)\n\ni )(cid:3)\ni )(cid:3)\n\ni |x\n\u2217\n\u2217\ni |x\n\u2217\n\u2217\n\nIn practice, when minibatch stochastic optimization is used, we approximate the weighted sampling\nby taking the softmax over the weights of only the minibatch examples. The data weights \u03c6 are\nupdated with Eq.(8). It is worth noting that the previous work [39] similarly derives data weights\nbased on their gradient directions on a validation set. Our algorithm differs in that the data weights are\nparameters maintained and updated throughout the training, instead of re-estimated from scratch in\neach iteration. Experiments show the parametric treatment achieves superior performance in various\nsettings. There are alternative parameterizations of R\u03c6 other than Eq.(11). For example, replacing \u03c6i\nin Eq.(11) with log \u03c6i in effect changes the softmax normalization in Eq.(12) to linear normalization,\nwhich is used in [39].\n\n5 Experiments\n\nWe empirically validate the proposed data manipulation approach through extensive experiments on\nlearning augmentation and weighting. We study both text and image classi\ufb01cation, in two dif\ufb01cult\nsettings of low data regime and imbalanced labels1.\n\n5.1 Experimental Setup\n\nBase Models. We choose strong pretrained networks as our base models for both text and image\nclassi\ufb01cation. Speci\ufb01cally, on text data, we use the BERT (base, uncased) model [7]; while on\nimage data, we use ResNet-34 [14] pretrained on ImageNet. We show that, even with the large-\nscale pretraining, data manipulation can still be very helpful to boost the model performance on\ndownstream tasks. Since our approach uses validation sets for manipulation parameter learning, for a\nfair comparison with the base model, we train the base model in two ways. The \ufb01rst is to train the\nmodel on the training sets as usual and select the best step using the validation sets; the second is to\ntrain on the merged training and validation sets for a \ufb01xed number of steps. The step number is set to\nthe average number of steps selected in the \ufb01rst method. We report the results of both methods.\nComparison Methods. We compare our approach with a variety of previous methods that were\ndesigned for speci\ufb01c manipulation schemes: (1) For text data augmentation, we compare with the\nlatest model-based augmentation [49] which uses a \ufb01xed conditional BERT language model for\nword substitution (section 4.2). As with base models, we also tried \ufb01tting the augmentatin model to\nboth the training data and the joint training-validation data, and did not observe signi\ufb01cant difference.\nFollowing [49], we also study a conventional approach that replaces words with their synonyms\nusing WordNet [33]. (2) For data weighting, we compare with the state-of-the-art approach [39]\nthat dynamically re-estimates sample weights in each iteration based on the validation set gradient\ndirections. We follow [39] and also evaluate the commonly-used proportion method that weights\ndata by inverse class frequency.\nTraining.\nFor both the BERT classi\ufb01er and the augmentation model (which is also based on\nBERT), we use Adam optimization with an initial learning rate of 4e-5. For ResNets, we use SGD\noptimization with a learning rate of 1e-3. For text data augmentation, we augment each minibatch\nby generating two or three samples for each data points (each with 1, 2 or 3 substitutions), and use\nboth the samples and the original data to train the model. For data weighting, to avoid exploding\nvalue, we update the weight of each data point in a minibatch by decaying the previous weight value\n\n1Code available at https://github.com/tanyuqian/learning-data-manipulation\n\n6\n\n\fModel\nBase model: BERT [7]\nBase model + val-data\n\nAugment\n\nWeight\n\nFixed augmentation [49]\nOurs: Fine-tuned augmentation\nRen et al. [39]\nOurs\n\nSST-5 (40+2)\n33.32 \u00b1 4.04\n35.86 \u00b1 3.03\nSynonym 32.45 \u00b1 4.59\n34.84 \u00b1 2.76\n37.03 \u00b1 2.05\n36.09 \u00b1 2.26\n36.51 \u00b1 2.54\n\nIMDB (40+5)\n63.55 \u00b1 5.35\n63.65 \u00b1 3.32\n62.68 \u00b1 3.94\n63.65 \u00b1 3.21\n65.62 \u00b1 3.32\n63.01 \u00b1 3.33\n64.78 \u00b1 2.72\n\nTREC (40+5)\n88.25 \u00b1 2.81\n88.42 \u00b1 4.90\n88.26 \u00b1 2.76\n88.28 \u00b1 4.50\n89.15 \u00b1 2.41\n88.60 \u00b1 2.85\n89.01 \u00b1 2.39\n\nTable 1: Accuracy of Data Manipulation on Text Classi\ufb01cation. All results are averaged over 15 runs\n\u00b1 one standard deviation. The numbers in parentheses next to the dataset names indicate the size of\nthe datasets. For example, (40+2) denotes 40 training instances and 2 validation instances per class.\n\nPretrained\n37.69 \u00b1 3.03\n38.09 \u00b1 1.87\n38.02 \u00b1 2.14\n38.95 \u00b1 2.03\n\nModel\nBase model: ResNet-34\nBase model + val-data\nRen et al. [39]\nOurs\n\nNot-Pretrained\n22.98 \u00b1 2.81\n23.42 \u00b1 1.47\n23.44 \u00b1 1.63\n24.92 \u00b1 1.57\nTable 2: Accuracy of Data Weighting on Image Clas-\nsi\ufb01cation. The small subset of CIFAR10 used here has\n40 training instances and 2 validation instances for each\nclass. The \u201cpretrained\u201d column is the results by initial-\nizing the ResNet-34 [14] base model with ImageNet-\npretrained weights. In contrast, \u201cNot-Pretrained\u201d de-\nnotes the base model is randomly initialized. Since\nevery class has the same number of examples, the\nproportion-based weighting degenerates to base model\ntraining and thus is omitted here.\n\nFigure 2: Words predicted with the high-\nest probabilities by the augmentation LM.\nTwo tokens \u201cstriking\u201d and \u201cgrey\u201d are\nmasked for substitution. The boxes in re-\nspective colors list the predicted words af-\nter training epoch 1 and 3, respectively.\nE.g., \u201cstunning\u201d is the most probable sub-\nstitution for \u201cstriking\u201d in epoch 1.\n\nwith a factor of 0.1 and then adding the gradient. All experiments were implemented with PyTorch\n(pytorch.org) and were performed on a Linux machine with 4 GTX 1080Ti GPUs and 64GB RAM.\nAll reported results are averaged over 15 runs \u00b1 one standard deviation.\n\n5.2 Low Data Regime\n\nWe study the problem where only very few labeled examples for each class are available. Both\nof our augmentation and weighting boost base model performance, and are superior to respective\ncomparison methods. We also observe that augmentation performs better than weighting in the\nlow-data setting.\n\nSetup For text classi\ufb01cation, we use the popular benchmark datasets, including SST-5 for 5-class\nsentence sentiment [45], IMDB for binary movie review sentiment [31], and TREC for 6-class\nquestion types [30]. We subsample a small training set on each task by randomly picking 40 instances\nfor each class. We further create small validation sets, i.e., 2 instances per class for SST-5, and 5\ninstances per class for IMDB and TREC, respectively. The reason we use slightly more validation\nexamples on IMDB and TREC is that the model can easily achieve 100% validation accuracy if the\nvalidation sets are too small. Thus, the SST-5 task has 210 labeled examples in total, while IMDB has\n90 labels and TREC has 270. Such extremely small datasets pose signi\ufb01cant challenges for learning\ndeep neural networks. Since the manipulation parameters are trained using the small validation sets,\nto avoid possible over\ufb01tting we restrict the training to small number (e.g., 5 or 10) of epochs. For\nimage classi\ufb01cation, we similarly create a small subset of the CIFAR10 data, which includes 40\ninstances per class for training, and 2 instances per class for validation.\n\n7\n\n(cid:7)(cid:5)(cid:8)(cid:7)(cid:2)(cid:1)Although visually strikingand slickly s sstaged, it\u2019s also cold, grey, antiseptic a a and emotionally desiccated.(cid:6)(cid:3)(cid:4)(cid:5)(cid:6)(cid:2)(cid:1)negative stunningblandfantasticdazzlinglivelysharpcharmingheroismdemandingrevealingtaboodarknegativemisleadingmessybittergoofyslowtrivialdry(cid:7)(cid:13)(cid:12)(cid:6)(cid:9)(cid:1)(cid:2)(cid:7)(cid:13)(cid:12)(cid:6)(cid:9)(cid:1)(cid:3)(cid:7)(cid:13)(cid:12)(cid:6)(cid:9)(cid:1)(cid:2)(cid:7)(cid:13)(cid:12)(cid:6)(cid:9)(cid:1)(cid:3)(cid:9)(cid:10)(cid:8)(cid:9)(cid:7)(cid:14)(cid:13)(cid:14)(cid:12)(cid:5)(cid:4)(cid:5)(cid:10)(cid:11)(cid:10)(cid:15)(cid:16)\fModel\nBase model: BERT [7]\nBase model + val-data\nProportion\nRen et al. [39]\nOurs\n\n20 : 1000\n54.91 \u00b1 5.98\n52.58 \u00b1 4.58\n57.42 \u00b1 7.91\n74.61 \u00b1 3.54\n75.08 \u00b1 4.98\n\n50 : 1000\n67.73 \u00b1 9.20\n55.90 \u00b1 4.18\n71.14 \u00b1 6.71\n76.89 \u00b1 5.07\n79.35 \u00b1 2.59\n\n100 : 1000\n75.04 \u00b1 4.51\n68.21 \u00b1 5.28\n76.14 \u00b1 5.8\n80.73 \u00b1 2.19\n81.82 \u00b1 1.88\n\nTable 3: Accuracy of Data Weighting on Imbalanced SST-2. The \ufb01rst row shows the number of\ntraining examples in each of the two classes.\n\nModel\nBase model: ResNet [14]\nBase model + val-data\nProportion\nRen et al. [39]\nOurs\n\n20 : 1000\n72.20 \u00b1 4.70\n64.66 \u00b1 4.81\n72.29 \u00b1 5.67\n74.35 \u00b1 6.37\n75.32 \u00b1 6.36\n\n50 : 1000\n81.65 \u00b1 2.93\n69.51 \u00b1 2.90\n81.49 \u00b1 3.83\n82.25 \u00b1 2.08\n83.11 \u00b1 2.08\n\n100 : 1000\n86.42 \u00b1 3.15\n79.38 \u00b1 2.92\n84.26 \u00b1 4.58\n86.54 \u00b1 2.69\n86.99 \u00b1 3.47\n\nTable 4: Accuracy of Data Weighting on Imbalanced CIFAR10. The \ufb01rst row shows the number of\ntraining examples in each of the two classes.\n\nResults Table 1 shows the manipulation results on text classi\ufb01cation. For data augmentation, our\napproach signi\ufb01cantly improves over the base model on all the three datasets. Besides, compared to\nboth the conventional synonym substitution and the approach that keeps the augmentation network\n\ufb01xed, our adaptive method that \ufb01ne-tunes the augmentation network jointly with model training\nachieves superior results. Indeed, the heuristic-based synonym approach can sometimes harm the\nmodel performance (e.g., SST-5 and IMDB), as also observed in previous work [49, 25]. This\ncan be because the heuristic rules do not \ufb01t the task or datasets well. In contrast, learning-based\naugmentation has the advantage of adaptively generating useful samples to improve model training.\nTable 1 also shows the data weighting results. Our weight learning consistently improves over the base\nmodel and the latest weighting method [39]. In particular, instead of re-estimating sample weights\nfrom scratch in each iteration [39], our approach treats the weights as manipulation parameters\nmaintained throughout the training. We speculate that the parametric treatment can adapt weights\nmore smoothly and provide historical information, which is bene\ufb01cial in the small-data context.\nIt is interesting to see from Table 1 that our augmentation method consistently outperforms the\nweighting method, showing that data augmentation can be a more suitable technique than data\nweighting for manipulating small-size data. Our approach provides the generality to instantiate\ndiverse manipulation types and learn with the same single procedure.\nTo investigate the augmentation model and how the \ufb01ne-tuning affects the augmentation results, we\nshow in Figure 2 the top-5 most probable word substitutions predicted by the augmentation model\nfor two masked tokens, respectively. Comparing the results of epoch 1 and epoch 3, we can see\nthe augmentation model evolves and dynamically adjusts the augmentation behavior as the training\nproceeds. Through \ufb01ne-tuning, the model seems to make substitutions that are more coherent with\nthe conditioning label and relevant to the original words (e.g., replacing the word \u201cstriking\u201d with\n\u201cbland\u201d in epoch 1 v.s. \u201ccharming\u201d in epoch 3).\nTable 2 shows the data weighting results on image classi\ufb01cation. We evaluate two settings with the\nResNet-34 base model being initialized randomly or with pretrained weights, respectively. Our data\nweighting consistently improves over the base model and [39] regardless of the initialization.\n\n5.3\n\nImbalanced Labels\n\nWe next study a different problem setting where the training data of different classes are imbalanced.\nWe show the data weighting approach greatly improves the classi\ufb01cation performance. It is also\nobserved that, the LM data augmentation approach, which performs well in the low-data setting, fails\non the class-imbalance problems.\n\n8\n\n\fSetup Though the methods are broadly applicable to multi-way classi\ufb01cation problems, here\nwe only study binary classi\ufb01cation tasks for simplicity. For text classi\ufb01cation, we use the SST-2\nsentiment analysis benchmark [45]; while for image, we select class 1 and 2 from CIFAR10 for binary\nclassi\ufb01cation. We use the same processing on both datasets to build the class-imbalance setting.\nSpeci\ufb01cally, we randomly select 1,000 training instances of class 2, and vary the number of class-1\ninstances in {20, 50, 100}. For each dataset, we use 10 validation examples in each class. Trained\nmodels are evaluated on the full binary-class test set.\n\nResults Table 3 shows the classi\ufb01cation results on SST-2 with varying imbalance ratios. We can\nsee our data weighting performs best across all settings. In particular, the improvement over the\nbase model increases as the data gets more imbalanced, ranging from around 6 accuracy points on\n100:1000 to over 20 accuracy points on 20:1000. Our method is again consistently better than [39],\nvalidating that the parametric treatment is bene\ufb01cial. The proportion-based data weighting provides\nonly limited improvement, showing the advantage of adaptive data weighting. The base model trained\non the joint training-validation data for \ufb01xed steps fails to perform well, partly due to the lack of a\nproper mechanism for selecting steps.\nTable 4 shows the results on imbalanced CIFAR10 classi\ufb01cation. Similarly, our method outperforms\nother comparison approaches. In contrast, the \ufb01xed proportion-based method sometimes harms the\nperformance as in the 50:1000 and 100:1000 settings.\nWe also tested the text augmentation LM on the SST-2 imbalanced data. Interestingly, the augmen-\ntation tends to hinder model training and yields accuracy of around 50% (random guess). This is\nbecause the augmentation LM is \ufb01rst \ufb01t to the imbalanced data, which makes label preservation\ninaccurate and introduces lots of noise during augmentation. Though a more carefully designed\naugmentation mechanism can potentially help with imbalanced classi\ufb01cation (e.g., augmenting only\nthe rare classes), the above observation further shows that the varying data manipulation schemes\nhave different applicable scopes. Our approach is thus favorable as the single algorithm can be\ninstantiated to learn different schemes.\n\n6 Discussions: Algorithm Extrapolation between Learning Paradigms\n\nConclusions. We have developed a new method of learning different data manipulation schemes with\nthe same single algorithm. Different manipulation schemes reduce to just different parameterization\nof the data reward function. The manipulation parameters are trained jointly with the target model\nparameters. We instantiate the algorithm for data augmentation and weighting, and show improved\nperformance over strong base models and previous manipulation methods. We are excited to explore\nmore types of manipulations such as data synthesis, and in particular study the combination of\ndifferent manipulation schemes.\nThe proposed method builds upon the connections between supervised learning and reinforcement\nlearning (RL) [46] through which we extrapolate an off-the-shelf reward learning algorithm in the RL\nliterature to the supervised setting. The way we obtained the manipulation algorithm represents a\ngeneral means of innovating problem solutions based on unifying formalisms of different learning\nparadigms. Speci\ufb01cally, a unifying formalism not only offers new understandings of the seemingly\ndistinct paradigms, but also allows us to systematically apply solutions to problems in one paradigm\nto similar problems in another. Previous work along this line has made fruitful results in other\ndomains. For example, an extended formulation of [46] that connects RL and posterior regularization\n(PR) [11, 16] has enabled to similarly export a reward learning algorithm to the context of PR\nfor learning structured knowledge [18]. By establishing a uniform abstration of GANs [13] and\nVAEs [23], Hu et al. [19] exchange techniques between the two families and get improved generative\nmodeling. Other work in the similar spirit includes [40, 41, 9, etc].\nBy extrapolating algorithms between paradigms, one can go beyond crafting new algorithms from\nscratch as in most existing studies, which often requires deep expertise and yields unique solutions in\na dedicated context. Instead, innovation becomes easier by importing rich ideas from other paradigms,\nand is repeatable as a new algorithm can be methodically extrapolated to multiple different contexts.\n\n9\n\n\fReferences\n[1] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a\n\nposteriori policy optimisation. In ICLR, 2018.\n\n[2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question\n\nanswering. arXiv preprint arXiv:1601.01705, 2016.\n\n[3] S. Baluja and I. Fischer. Adversarial transformation networks: Learning to generate adversarial examples.\n\narXiv preprint arXiv:1703.09387, 2017.\n\n[4] H.-S. Chang, E. Learned-Miller, and A. McCallum. Active bias: Training more accurate neural networks\n\nby emphasizing high variance samples. In NeurIPS, pages 1002\u20131012, 2017.\n\n[5] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation\n\npolicies from data. In CVPR, 2019.\n\n[6] P. Dayan and G. E. Hinton. Using expectation-maximization for reinforcement learning. Neural Computa-\n\ntion, 9(2):271\u2013278, 1997.\n\n[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers\n\nfor language understanding. In NAACL, 2019.\n\n[8] Y. Fan, F. Tian, T. Qin, X.-Y. Li, and T.-Y. Liu. Learning to teach. In ICLR, 2018.\n\n[9] C. Finn, P. Christiano, P. Abbeel, and S. Levine. A connection between generative adversarial networks,\n\ninverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016.\n\n[10] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n\n[11] K. Ganchev, J. Gillenwater, B. Taskar, et al. Posterior regularization for structured latent variable models.\n\nJournal of Machine Learning Research, 11(Jul):2001\u20132049, 2010.\n\n[12] P. K. B. Giridhara, M. Chinmaya, R. K. M. Venkataramana, S. S. Bukhari, and A. Dengel. A study of\n\nvarious text augmentation techniques for relation classi\ufb01cation in free text. In ICPRAM, 2019.\n\n[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\n[15] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The\" wake-sleep\" algorithm for unsupervised neural\n\nnetworks. Science, 268(5214):1158, 1995.\n\n[16] Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing. Harnessing deep neural networks with logic rules. In ACL,\n\n2016.\n\n[17] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. In\n\nICML, 2017.\n\n[18] Z. Hu, Z. Yang, R. Salakhutdinov, X. Liang, L. Qin, H. Dong, and E. Xing. Deep generative models with\n\nlearnable knowledge constraints. In NIPS, 2018.\n\n[19] Z. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing. On unifying deep generative models. In ICLR, 2018.\n\n[20] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint\n\narXiv:1611.01144, 2016.\n\n[21] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very\n\ndeep neural networks on corrupted labels. In ICML, 2018.\n\n[22] A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with importance\n\nsampling. In ICML, 2018.\n\n[23] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.\n\n[24] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur. Audio augmentation for speech recognition.\n\nINTERSPEECH, 2015.\n\nIn\n\n10\n\n\f[25] S. Kobayashi. Contextual augmentation: Data augmentation by words with paradigmatic relations. In\n\nNAACL, 2018.\n\n[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[27] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NeurIPS, pages\n\n1189\u20131197, 2010.\n\n[28] J. Lemley, S. Bazrafkan, and P. Corcoran. Smart augmentation learning an optimal data augmentation\n\nstrategy. IEEE Access, 5:5858\u20135869, 2017.\n\n[29] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv\n\npreprint arXiv:1805.00909, 2018.\n\n[30] X. Li and D. Roth. Learning question classi\ufb01ers. In Proceedings of the 19th international conference on\n\nComputational linguistics, pages 1\u20137, 2002.\n\n[31] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment\nanalysis. In Proceedings of the 49th annual meeting of the association for computational linguistics:\nHuman language technologies-volume 1, pages 142\u2013150. Association for Computational Linguistics, 2011.\n\n[32] T. Malisiewicz, A. Gupta, A. A. Efros, et al. Ensemble of exemplar-SVMs for object detection and beyond.\n\nIn ICCV, volume 1, page 6. Citeseer, 2011.\n\n[33] G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 38(11):39\u201341, 1995.\n\n[34] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.\n\n[35] M. Norouzi, S. Bengio, N. Jaitly, M. Schuster, Y. Wu, D. Schuurmans, et al. Reward augmented maximum\nlikelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pages\n1723\u20131731, 2016.\n\n[36] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. Specaugment: A simple\n\ndata augmentation method for automatic speech recognition. arXiv preprint arXiv 1904.08779.\n\n[37] X. Peng, Z. Tang, F. Yang, R. S. Feris, and D. Metaxas. Jointly optimize data augmentation and network\n\ntraining: Adversarial data augmentation in human pose estimation. In CVPR, 2018.\n\n[38] A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. R\u00e9. Learning to compose domain-speci\ufb01c\n\ntransformations for data augmentation. In NeurIPS, 2017.\n\n[39] M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. In\n\nICML, 2018.\n\n[40] S. Roweis and Z. Ghahramani. A unifying review of linear gaussian models. Neural computation, 1999.\n\n[41] R. Samdani, M.-W. Chang, and D. Roth. Uni\ufb01ed expectation maximization. In ACL, 2012.\n\n[42] R. Sennrich, B. Haddow, and A. Birch. Improving neural machine translation models with monolingual\n\ndata. In ACL, 2016.\n\n[43] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example\n\nmining. In CVPR, pages 761\u2013769, 2016.\n\n[44] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recogni-\ntion\u2014tangent distance and tangent propagation. In Neural networks: tricks of the trade, pages 239\u2013274.\nSpringer, 1998.\n\n[45] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for\n\nsemantic compositionality over a sentiment treebank. In EMNLP, pages 1631\u20131642, 2013.\n\n[46] B. Tan, Z. Hu, Z. Yang, R. Salakhutdinov, and E. Xing. Connecting the dots between MLE and RL for\n\nsequence generation. arXiv preprint arXiv:1811.09740, 2018.\n\n[47] T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid. A Bayesian data augmentation approach for learning\n\ndeep models. In NeurIPS, pages 2797\u20132806, 2017.\n\n[48] J. W. Wei and K. Zou. EDA: Easy data augmentation techniques for boosting performance on text\n\nclassi\ufb01cation tasks. arXiv preprint arXiv:1901.11196, 2019.\n\n11\n\n\f[49] X. Wu, S. Lv, L. Zang, J. Han, and S. Hu. Conditional BERT contextual augmentation. arXiv preprint\n\narXiv:1812.06705, 2018.\n\n[50] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and L. Q. V. Unsupervised data augmentation. arXiv preprint arXiv\n\n1904.12848, 2019.\n\n[51] Z. Xie, S. I. Wang, J. Li, D. L\u00e9vy, A. Nie, D. Jurafsky, and A. Y. Ng. Data noising as smoothing in neural\n\nnetwork language models. In ICLR, 2017.\n\n[52] Z. Zheng, J. Oh, and S. Singh. On learning intrinsic rewards for policy gradient methods. In NeurIPS,\n\n2018.\n\n12\n\n\f", "award": [], "sourceid": 9224, "authors": [{"given_name": "Zhiting", "family_name": "Hu", "institution": "Carnegie Mellon University"}, {"given_name": "Bowen", "family_name": "Tan", "institution": "CMU"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Tom", "family_name": "Mitchell", "institution": "Carnegie Mellon University"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. / Carnegie Mellon University"}]}