{"title": "Learning to Repair Software Vulnerabilities with Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7933, "page_last": 7943, "abstract": "Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation.", "full_text": "Learning to Repair Software Vulnerabilities\n\nwith Generative Adversarial Networks\n\nJacob A. Harer1,2, Onur Ozdemir1, Tomo Lazovich3\u2217, Christopher P. Reale1,\n\nRebecca L. Russell1, Louis Y. Kim1, Peter Chin2\n\n1Draper, Cambridge, MA\n\n3Lightmatter, Boston, MA\n\n2Department of Computer Science, Boston University, Boston, MA\n\n{jharer,oozdemir,creale,rrussell,lkim}@draper.com,\n\ntomo@lightmatter.ai, spchin@cs.bu.edu\n\nAbstract\n\nMotivated by the problem of automated repair of software vulnerabilities,\nwe propose an adversarial learning approach that maps from one discrete\nsource domain to another target domain without requiring paired labeled\nexamples or source and target domains to be bijections. We demonstrate\nthat the proposed adversarial learning approach is an e\ufb00ective technique for\nrepairing software vulnerabilities, performing close to seq2seq approaches\nthat require labeled pairs. The proposed Generative Adversarial Network\napproach is application-agnostic in that it can be applied to other problems\nsimilar to code repair, such as grammar correction or sentiment translation.\n\n1 Introduction\nSecurity vulnerabilities in software programs pose serious risks to computer systems. Malicious\nusers can compromise programs through their vulnerabilities to force them to behave in\nundesirable ways (e.g. crash, expose sensitive user information, etc.). Thousands of such\nvulnerabilities are reported publicly to the Common Vulnerabilities and Exposures database\n(CVE) each year, and many more are discovered internally in proprietary code and patched\n[1, 2]. These vulnerabilities are often the result of errors made by programmers, and, due to\nthe prevalence of open source software and code re-use, can propagate quickly.\nIn this paper, we address the problem of learning to automatically repair the source code\nof software containing security vulnerabilities. This problem is analogous to grammatical\nerror correction, in which a grammatically incorrect sentence is translated into a correct one.\nIn our case, bad source code (that contains a vulnerability) takes the place of an incorrect\nsentence and is repaired into good source code.\nNeural Machine Translation (NMT) systems have recently achieved the state-of-the-art\nperformance on language translation and correction tasks [3, 4, 5, 6]. These models use an\nencoder-decoder approach to transform an input sequence x = (x0, x1...xT ) into an output\nsequence y = (y0, y1...yT 0), e.g., translating a sequence of words forming a sentence in English\nto one in German. By far the most common method of training NMT systems is to use\nlabeled pairs of examples to compare the likelihood of network output to a desired version,\nnecessitating a one-to-one mapping between input and desired output data. This can be\ndi\ufb03cult to obtain as in most cases it requires costly hand annotation.\nIn many sequence-to-sequence (seq2seq) applications, it is much easier to obtain unpaired\ndata, i.e., data from both source and target domains without any matching pairs, since this\n\n\u2217Work done while author was a\ufb03liated with Draper.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fonly requires data to be labeled as either source or target. For example, in natural language\ntranslation it is easy to obtain monolingual corpora in di\ufb00erent languages with almost no\ncost. For source code, automated error detection methods exist, such as static analyzers or\nmachine learning approaches, which can be used to label code as having vulnerabilities or\nnot, but do not provide one-to-one pairing between labeled sets [7, 8].\nOur approach to address this problem is adversarial learning with Generative Adversarial\nNetworks (GANs) [9]. This approach allows us to train without paired examples. We employ\na traditional NMT model as the generator, and replace the typical negative likelihood loss\nwith the gradient stemming from the loss of an adversarial discriminator. The discriminator\nis trained to distinguish between NMT-generated outputs and real examples of desired\noutput, and so its loss serves as a proxy for the discrepancy between the generated and real\ndistributions. This problem has three main di\ufb03culties. Firstly, sampling from the output of\nNMT systems, in order to produce discrete outputs, is non-di\ufb00erentiable. We address this\nproblem by using a discriminator which operates directly on the expected (soft) outputs\nof the NMT system during training, which we thoroughly discuss in Section 3.2. Secondly,\nadversarial training does not guarantee that the generated code will correspond to the input\nbad code (i.e. the generator is trained to match distributions, not samples). To enforce the\ngenerator to generate useful repairs, (i.e., generated code is a repaired version of input bad\ncode), we condition our NMT generator on the input x by incorporating two novel generator\nloss functions, described in Section 3.3. Thirdly, the domains we consider are not bijective,\ni.e., a bad code can have more than one repair or a good code can be broken in more than\none way. The regularizers we propose in Section 3.3 still work in this case. We should note\nthat although our motivation is to repair source code, the approach and the techniques\nproposed in this paper are application-agnostic in that they can be applied to other similar\nproblems, such as correcting grammar errors or converting between negative and positive\nsentiments (e.g., in online reviews.). Additionally, while software vulnerability repair is a\nharder problem than detection, our proposed repair technique can leverage the same datasets\nused for detection and yields a much more explainable and useful tool than detection alone.\n\n2 Related Work\n\n2.1 Software Repair\nMuch research has been done on automatic repair of software. Here we describe previous\ndata-driven approaches (see [10] for a more extensive review of the subject). Two successful\nrecent approaches are that of Le et al.\n[11] and Long and Rinard [12]. Le et al. mine a\nhistory of bug \ufb01xes across multiple projects and attempt to reuse common bug \ufb01x patterns\non newly discovered bugs. Long and Rinard learn and use a probabilistic model to rank\npotential \ufb01xes for defective code. These works, along with the majority of past work in this\narea, require a set of test cases which is used to rank and validate produced repairs. Devlin\net al. [13] avoid the need for test cases by generating repairs with a rule based method and\nthen ranking them using a neural network. Gupta et al. [14] take this one step further by\ntraining a seq2seq model to directly generate repairs for incorrect code. Hence, the work\nin [14] most closely resembles our work, but has the major drawback of requiring paired\ntraining data.\n\n2.2 GANs\nGANs were \ufb01rst introduced by Goodfellow et al. [15] to learn a generative model of natural\nimages. Since then, many variants of GANs have been created and applied to the image\ndomain [16, 17, 18, 19, 20]. GANs have generally focused on images due to the abundance\nof data and their continuous nature. Applying GANs to discrete data (e.g. text) poses\ntechnically challenging issues not present in the continous case (e.g. propagating gradients\nthrough discrete values). One successful approach is that of Yu et al. [21], which treats the\noutput of the discriminator as a reward in a reinforcement learning setting. This allows\nthe sampling of outputs from the generator since gradients do not need to be passed back\nthrough the discriminator. However, since a reward is provided for the entire sequence,\ngradients computed for the generator do not provide information on which parts of the\n\n2\n\n\foutput sequence the discriminator thinks is incorrect, resulting in long convergence times.\nSeveral other approaches have had success with directly applying an adversarial discriminator\nto the output of a sequence generator with likelihood output. Zhang et al.\n[22] replace\nthe traditional GAN loss in the discriminator with a Maximum Mean Discrepancy (MMD)\nmetric in order to stabilize GAN training. Both Press et al. [23] and Rajeswar et al. [24] are\nable to generate fairly realistic looking sentences of modest length using Wasserstein GAN\n[17], which is the approach we adopt in this paper.\nWork has also been done on how to condition a GAN\u2019s generator on an input sequence x\ninstead of a random variable. This can easily be performed when paired data is available, by\nproviding the discriminator with both x and y, thereby formulating the problem as in the\nconditional approach of Mirza and Osindero [25, 26]. This approach, however, is clearly more\ndi\ufb03cult when pairs are not available. One approach is to enforce conditionality through\nthe use of dual generator pairs which translate between domains in opposite directions. For\nexample, Gomez et al. apply the cycle GAN [27] approach to cipher cracking [28]. They\ntrain two generators, one to take raw text and produced ciphered text, and the other to undo\nthe cipher. Having two generators allows Gomez et al. to encrypt raw data using the \ufb01rst\ngenerator, then decrypt it with the other, ensuring conditionality by adding a loss function\nwhich compares this doubly translated output with the original raw input. Lample et al.\n[29] adopt a somewhat similar approach for NMT. They translate using two encoder/decoder\npairs which convert from a given language to a latent representation and back respectively.\nThey then use an adversarial loss to ensure that the latent representations are the same\nbetween both languages, thus allowing translation by encoding from one language and then\ndecoding into the second. For conditionality they adopt a similar approach to Gomez et al.\nby fully translating a sentence from one language to another, translating it back, and then\ncomparing the original sentence to the produced double translation.\nThe approaches of both Gomez et al. and Lample et al. rely on the ability to transform a\nsentence across domains in both directions. This makes sense in many translation spaces\nas there are a \ufb01nite number of reasonable ways to transform a sentence in one language to\na correct one in the other. This allows for a network which \ufb01nds a single mapping from\nevery point in one domain to a single point in the other domain, to still cover the majority of\ntranslations. Unfortunately, in a sequence correction task such as our problem, one domain\ncontains all correct sequences, while the other contains everything not in the correct domain.\nTherefore, the mapping from correct to incorrect is not one-to-one, it is one to many. A single\nmapping discovered by a network would fail to elaborate the space of all bad functions, thus\nenforcing conditionality only on the relatively small set of bad functions it covers. Therefore,\nwe propose to enforce conditionality using a self-regularization term on the generator, similar\nin nature to that used by Shrivastava et al. [30] to generate realistic looking images from\nsimulated ones.\n\n3 Formulation\n\nGANs are generative models originally proposed to generate realistic images, y, from random\nnoise vectors, z [9]. GANs \ufb01nd a mapping G : z \u2192 y by framing the learning problem as\na two player minimax game between a generator G(\u00b7) and a discriminator D(\u00b7), where the\ngenerator learns to generate realistic looking data samples by minimizing the performance of\na discriminator whose goal is to maximize its own performance on discriminating between\ngenerated and real samples.\nOur problem in this paper is di\ufb00erent from the original GAN problem in that our goal is to\n\ufb01nd a mapping between two discrete valued domains, namely between a given bad code (or\nsource) domain X and a good code (or target) domain Y by using unpaired training samples\n{xi}N\n\ni=1, where xi \u2208 X and yi \u2208 Y .\n\ni=1 and {yi}M\n\n3.1 Adversarial Loss\nThe original GAN loss of Goodfellow et al. [9] is expressed as\n\nLGAN(D, G) = Ey\u223cP (y)[log D(y)] + Ex\u223cP (x)[log(1 \u2212 D(G(x))]\n\n(1)\n\n3\n\n\fwhere the optimal generator is G\u2217 = arg minG maxD LGAN(D, G). It is well known that this\nloss can be unstable when the support of the distributions of generated and real samples\ndo not overlap [16]. This causes the discriminator to provide zero gradients. Further, this\nstandard loss function can lead to mode collapse, where the resulting samples come from\na single mode of the real data distribution. To alleviate these problems, Arjovsky et al.\n[17] proposed the Wasserstein GAN (WGAN) loss which instead uses the Wasserstein-1 or\nEarth-Movers (EM) distance between generated and real data samples in the discriminator.\nEM distance is relatively straightforward to estimate and leads to the easily computable loss\nfunction:\n\n(2)\nwhere the discriminator function D is constrained to be 1-Lipschitz. We use WGAN in our\nmodel as it leads to more stable training.\n\nLWGAN(D, G) = Ey\u223cP (y)[D(y)] \u2212 Ex\u223cP (x)[D(G(x))]\n\n3.2 GANs with Discrete Data\nOne of the main challenges of adversarial training with discrete sequences is that sampling\nfrom the output of NMT systems in order to produce discrete outputs is non-di\ufb00erentiable.\nThe goal of training is to generate samples from the unknown distribution of real sequences\nPY , which can be factorized as\n\nTY\n\nPY (y) = P(y0)\n\nP(yt|y0, ...yt\u22121)\n\n(3)\n\nwhere each conditional distribution P(yt|y0, ...yt\u22121) is estimated (using an RNN generator\nin our case) with a softmax output\n\nt=1\n\n\u02c6P(yt|y0...yt\u22121) = st (cid:44) softmax(f(yt\u22121, ht\u22121))\n\n(4)\nwhere f(\u00b7) and ht denote the generator network and the hidden state of the RNN at time\nt, respectively. Ideally, we would sample from st to generate a sequence and provide that\nto the discriminator along with the real data for training, but this sampling process is\nnon-di\ufb00erentiable. Instead, we provide the discriminator with st directly. Since each st is\ndependent on the previously produced output token and the RNN state, we still need to\nsample yt\u22121 from st\u22121 using arg max to generate st. Note that st can be interpreted as\nthe soft one-hot representation as it corresponds to the expectation of one-hot vectors with\nrespect to the conditional distribution in (4). Although this soft representation alleviates\nthe issue of non-di\ufb00erentiability, it may introduce potential issues with the discriminator\nwhich we discuss next.\nNote that since each generator output st is a probability vector it will almost surely not\nbe a one hot vector. In other words, while every real token, yt, lies on one of the vertices\nof the standard V \u2212 1 dimensional simplex, our generated outputs, st, lie on the interior\nof the simplex. This implies that Pr and Ps have disjoint supports and are perfectly\nseparable in theory. Therefore, there exists a \u2018trivial\u2019 discriminator which looks at each\ntoken independently and discriminates based on whether a sequence consists of one-hot\nvectors or not. Such a discriminator would not provide useful information for training the\ngenerator since it does not pay attention to the sequential dependencies between tokens.\nNevertheless, we conjecture that simple discriminator architectures do not have this problem,\nsince such a \u2018trivial\u2019 discriminator may be hard to realize in practice. This was veri\ufb01ed in\nour experiments where we found that relatively shallow networks, such as those using only a\nsingle convolutional layer, performed better than deeper ones.\nThere is related work in the literature [23, 24, 28] that reported avoiding this \u2018trivial\u2019\ndiscriminator by using the improved Wasserstein GAN (WGAN-GP) loss [31]. However,\nin our implementations, we had more success with the original version of the Wasserstein\nGAN, which uses clipped weights in the discriminator (after both versions had su\ufb03cient\nhyper-parameter tuning). We believe that this is due to weight clipping in the original\nWasserstein GAN that forces the discriminator to learn simpler functions, as was shown in\nthe improved WGAN paper [31]. These simpler functions do not allow the discriminator\nto simply focus on one-hot vectors and force it to pay attention to sequential dependencies\nbetween tokens. To further analyze this point, we provide some visualizations in Figure\n\n4\n\n\f(a) Loss Ratios\n\n(b) 1-Layer WGAN\n\n(c) WGAN-GP\n\n(d) 3-Layer WGAN\n\nFigure 1: (a) Wasserstein loss ratios between correctly and incorrectly generated pairs\nduring training. (b-d) Weights of 1-layer 1D CNN with WGAN loss, 1-layer 1D CNN with\nWGAN-GP loss, and 3-layer 1D CNN with WGAN, respectively.\n\n1, where we use a paired dataset for analysis purposes. We sample a random set of data\npairs, where x is a bad version of y, and compute Wasserstein loss values, LWGAN(D, G) as\nde\ufb01ned in (2), for two separate cases. For the \ufb01rst loss calculation, we select pairs where the\ngenerator G(x) generates correct outputs (G(x) = y), and for the second loss, the generated\noutputs are incorrect (G(x) 6= y). We then take the ratio of these two loss values and plot\nthem in Figure 1a for three di\ufb00erent discriminator settings, namely i) 1-layer CNN with\nWGAN loss; ii) 1-layer CNN with WGAN-GP loss; and 3) 3-layer CNN with WGAN loss. A\ndiscriminator which only di\ufb00erentiates inputs based on whether they are one-hot vectors\nor not should have very similar loss values for the two cases resulting in a loss ratio of\n\u223c 1, since in neither case does the generator produce one-hot vectors. As we observe in\nFigure 1a, the simpler network architecture (1-layer CNN in this case) with the original\nWasserstein loss provides better separation, i.e., better signal, for training the generator.\nThis is further emphasized by Figures 1b-1d where we show normalized weights of the 1-D\nconvolutional \ufb01lters (whose kernel size is 11) on the \ufb01rst convolutional layer in each network.\nFilters for the simplest network in Figure 1b have a low degree of sparsity, implying that\nthey are aggregating data from multiple tokens taking into account sequential dependencies,\nwhereas the networks in both Figures 1c and 1d have a much higher degree of sparsity, often\nemphasizing only a single token at a time, which we would expect for discriminators paying\nattention to individual tokens to decide based on whether a given token is one-hot or not.\nThese observations imply an inherent trade-o\ufb00. An overly complex discriminator can learn\nto discriminate based on spurious features, i.e., whether a vector is one-hot or not, which can\nlead to over\ufb01tting. On the other hand, a very simple discriminator will not accurately model\nthe data and, therefore, not provide any useful information to the generator. One needs to\ntreat this trade-o\ufb00 as one would treat a hyperparameter, by tuning the discriminator model\non an application by application basis.\nWe should also mention that there are two other approaches proposed in the literature to\novercome the issues we discussed above. The \ufb01rst approach is to (linearly and determinis-\ntically) embed each one-hot vector into a lower dimensional space [23]. This approach is\nstill vulnerable to the problem of a su\ufb03ciently complex discriminator ignoring sequential\ndependencies since these embedding are deterministic. We found this to be the case in\npractice as well; adding an embedding the the discriminator alone produced no noticeable\nimprovement and still required the use of simple networks. The second alternative approach\nis to reparamaterize the discrete sampling process via a continuous relaxation using the\nGumbel-softmax distribution [32, 33]. This approach, due to continuous relaxation, still\ngenerates (random) outputs via a softmax function, which are therefore similar to our soft\none-hot outputs. We experimented with this approach and did not observe any improvements.\n\n3.3 Domain Mapping with Self-Regularization\nIn the context of source code repair, or more generally sequence correction, we need to\nconstrain our generated samples y to be a corrected versions of x. Therefore, we have\nthe following two requirements: (1) correct sequences should remain unchanged when\npassed through the generator; and (2) repaired sequences should be close to the original\ncorresponding incorrect input sequences.\n\n5\n\n\fWe explore two regularizers to address these requirements. As our \ufb01rst regularizer, in\naddition to GAN training, we train our generator as an autoencoder on data sampled from\ncorrect sequences. This directly enforces item (1), while indirectly enforcing item (2) since\nthe autoencoder loss encourages subsequences which are correct to remain unchanged. The\nautoencoder regularizer is given as\n\nLAUTO(G) = Ex\u223cP (x) [\u2212x log(G(x))].\n\n(5)\nAs our second regularizer, we enforce that the frequency of each token in the generated output\nremains close to the frequency of the input tokens. This enforces item (2) with the exception\nthat it may allow changes in the order of the sequence, e.g., arbitrary reordering does not\nincrease this loss. However, the GAN loss alleviates this issue since arbitrary reordering\nproduces incorrect sequences which di\ufb00er heavily from P(y). Our second regularizer is given\nas\n\nnX\n\ni=0\n\nLFREQ(G) = Ex\u223cP (x)[\n\nkfreq(x, i) \u2212 freq(G(x), i)k2\n\n2 ].\n\n(6)\n\nwhere n is the size of the vocabulary and freq(x, i) is the frequency of the ith token in x.\n\n4 Putting It All Together - Proposed GAN Framework\nThe generator in our network consists of a standard NMT system with an attention mechanism\nsimilar to that of Luong et al [34]. For all experiments the encoder and decoder consist\nof multi-layer RNNs utilizing Long Short-Term Memory (LSTM) units [35]. We use a\ndot-product attention mechanism as per [34]. We use convolution based discriminators since\nthey have been shown to be easier to train and to generally perform better than RNN based\ndiscriminators [26]. Additional network details are provided in the Supplementary Material.\nWe have two di\ufb00erent regularized loss models given as\n\n(7)\n(8)\nwhere LAUTO(G) and LFREQ(G) are de\ufb01ned in Section 3.3. We also experiment with the\nunregularized base loss model where we set \u03bb = 0.\n\nL(D, G) = LWGAN(D, G) + \u03bbLAUTO(G)\nL(D, G) = LWGAN(D, G) + \u03bbLFREQ(G)\n\n4.1 Autoencoder Pre-Training\nWe rely heavily on pre-training to give our GAN a good starting point. Our generators are\npre-trained as de-noising autoencoders on the desired data [36]. Speci\ufb01cally we train the\ngenerator with the loss function:\n\nLAUTO_PRE(G) = Ey\u223cP (y) [\u2212y log(G(\u02c6y))]\n\n(9)\nwhere \u02c6y is the noisy version of the input created by dropping tokens in y with probability\n0.2 and randomly inserting and deleting n tokens, where n is 0.03 times the sequence length.\nThese numbers were selected based on hyperparameter tuning.\n\n4.2 Curriculum Learning\nLikelihood based methods for training seq2seq networks often utilize teacher forcing during\ntraining where the input to the decoder is forced to be the desired value regardless of what\nwas generated at the previous time step [37]. This allows stable training of very long sequence\nlengths even at the start of training. Adversarial methods cannot use teacher forcing since\nthe desired sequence is unknown, and must therefore always pass a sample of st\u22121 as the\ninput to time t. This can lead to unstable training since errors early in the output will be\npropagated forward, potentially creating gibberish in the latter parts of the sequence. To\navoid this problem we adopt a curriculum learning strategy where we incrementally increase\nthe length of produced sequences throughout training. Instead of selecting subsets of the\ndata for curriculum training, we clip all sequences to have a prede\ufb01ned maximum length for\n\n6\n\n\feach curriculum step. Although this approach relies on the discriminator being able to handle\nincomplete sentences, it does not degrade the performance as long as the discriminator is\nbrie\ufb02y retrained after each curriculum update.\n\n5 Experiments\n\nGAN methods have often been criticized for their lack of easy evaluation metrics. Therefore,\nwe focus our experiments on datasets which contain paired examples. This enables us to\nmeaningfully evaluate the performance of our approach, even though our GAN approach does\nnot require pairs to train. These datasets also allow us to train seq2seq networks and use\ntheir performance as an upper bound to our GAN based approach. We start our experiments\nby exploring two hand-curated datasets, namely sequences of sorted numbers and Context\nFree Grammar (CFG), which help highlight the bene\ufb01ts of our proposed GAN approach to\naddress the domain mapping problem. We then investigate the harder problem of correcting\nerrors in C/C++ code. All of our results are given in Table 1.\n\n5.1 Sorting\nIn order to show the necessity of enforcing accurate domain mapping we generate a dataset\nwhere the repair task is to sort the input into ascending order. We generate sequences of\n20 randomly selected integers (without replacement) between 0 and 50 in ascending order.\nWe then inject errors by swapping n selected tokens which are next to each other, where n\nis a (rounded) Gaussian random variable with mean 8 and standard deviation 4. The task\nis then to sort the sequence back into its original ascending order given the error injected\nsequence. This scheme of data generation allows us to maintain pairs of good (before error\ninjection) and bad (after error injection) data, and to compute the accuracy with which\nour GAN is able to restore the good sequences from the bad. We refer to this accuracy as\n\u2018Sequence Accuracy\u2019 (or Seq. Acc.). In order to assess our domain mapping approach and\nevaluate the usefulness of our self-regularizer loss functions de\ufb01ned in Section 3.3, we also\ncompute the percentage of sequences which have valid orderings but not necessarily valid\ndomain mappings, which we refer to as \u2018Order Accuracy\u2019 (or Order Acc.).\nIt is clear from the results in Table 1 that the vanilla (base) GAN easily learns to generate\nsequences with valid ordering, without necessarily paying attention to the input sequence.\nThis leads to high Order Accuracy, but low Sequence Accuracy. However, adding Auto or\nFreq loss regularizers, as in (7) and (8), signi\ufb01cantly improves the Seq. Acc., which shows\nthat these losses do e\ufb00ectively enforce correct mapping between source and target domains.\n\n5.2 Simple Grammar\nFor our second experiment, we generate data from a simple Context Free Grammar similar to\nthat used by Rajeswar et al. [24]. The speci\ufb01cs of the CFG is provided in the Supplementary\nMaterial. Our good data is selected randomly from the set of all sequences which satisfy the\ngrammar and are less than length 20. We then inject errors into each sequence, where the\nnumber of errors is chosen as a Gaussian random variable (zero thresholded and rounded)\nwith mean 5 and standard deviation 2. Each error is then randomly chosen to be either a\ndeletion of a random token, insertion of a random token, or swap of two random tokens.\nThe network is tasked with generating the original sequence from the error injected one. This\ntask better models real data than the sorting task above, because each generated token must\nfollow the grammar and is therefore conditioned on all previous tokens. The results in Table\n1 show that our proposed GAN approach is able to achieve high CFG accuracy, in terms of\ngenerating correct sequences that \ufb01t the CFG. In addition to CFG accuracy, we also compute\nBLEU scores based on the pairs before and after error injection. We should note that our\nrandom error injection process results in many bad examples corresponding to a speci\ufb01c\ngood example or vice verse, i.e., mappings are not bijective. Having multiple bad examples\nin the dataset paired with the same good example contributes to the slightly lower BLEU\nscores, since the network can only map each bad input to a single output. This issue appears\nfrequently in real world repair datasets, since code sequences can be repaired or broken\n\n7\n\n\fTable 1: Results on all experiments. Cur refers to experiments using curriculum learning,\nwhile Auto and Freq are those using LAUTO and LFREQ, respectively. Sate4-P and Sate4-U\ndenote paired and unpaired datasets, respectively.\n\nModel\nseq2seq\nBase\nBase + Cur\nProposed GAN\nBase\nBase + Auto\nBase + Freq\nBase + Cur\nBase + Cur + Auto\nBase + Cur + Freq\n\nSorting\nSate4-U\nSeq Acc. Order Acc BLEU-4 CFG Acc BLEU-4 BLEU-4\n\nSate4-P\n\nCFG\n\n99.7\n99.7\n\n82.8\n98.9\n99.3\n81.5\n96.2\n98.2\n\n99.8\n99.8\n\n96.9\n99.6\n99.7\n98.0\n98.0\n99.1\n\n91.3\n90.2\n\n88.5\n88.6\n88.3\n88.4\n88.5\n88.6\n\n99.3\n98.9\n\n98.0\n96.5\n97.5\n98.9\n97.8\n96.3\n\n96.3\n96.4\n\n84.2\n85.7\n86.2\n88.3\n89.9\n90.3\n\nN/A\nN/A\n\n79.3\n79.2\n79.5\n81.1\n81.5\n81.3\n\nmultiple di\ufb00erent ways. Our GAN approach performs well on this CFG dataset suggesting\nthat it can handle this issue for which cycle approaches are not appropriate [28, 29, 27].\n\n5.3 SATE IV\nSATE IV is a dataset which contains C/C++ synthetic code examples (functions) with\nvulnerabilities from 116 di\ufb00erent Common Weakness Enumeration (CWE) classes, and was\noriginally designed to explore performance of static and dynamic analyzers [38]. Each bad\nfunction contains a speci\ufb01c vulnerability, and is paired with several candidate repairs. There\nis a total of 117, 738 functions of which 41, 171 contain a vulnerability and 76, 567 do not.\nWe lex each function using our custom lexer. After lexing, each function ranges in length\nfrom 10 to 300 tokens.\nUsing this data, we created two datasets to perform two di\ufb00erent experiments, namely paired\nand unpaired datasets. The paired dataset allows us to compare the performance of our GAN\napproach with a seq2seq approach. In order to have a dataset which is fair for both GAN\nand seq2seq training, we created paired data by taking each example of vulnerable code and\nsampling one of its repairs randomly. We iterate this process through the dataset four times,\npairing each vulnerable function with a sampled repair, and combine the resulting sets into\na single large dataset. We should mention that although the paired dataset includes labeled\npairs, those labels are not utilized for GAN training. For the unpaired dataset, we wanted\nto guarantee that a given source sequence does not have a corresponding target sequence\nanywhere in the training data. To achieve this, we divided the data into two disjoint sets by\nplacing either a vulnerable function or its candidate repairs into the training dataset with\nequal probability. Note that this operation reduces the size of our training data by half.\nFor testing, we compute BLEU scores using all of the candidate repairs for each vulnerable\nfunction. We use a 80/10/10% train/validation/test split.\nAs shown in Table 1, our proposed GAN approach achieves progressively better results when\nwe add (a) curriculum training, and (b) either LAUTO or LFREQ regularization loss. The\nBase + Cur + Freq model proves to be the best among di\ufb00erent GAN models, and performs\nreasonably close to the seq2seq baseline, which is the upper performance bound. The results\non the unpaired dataset are fairly close to those achieved by the paired dataset, particularly\nin the Base case, even though they are obtained with only half of the training data. Some\ncode examples where our GAN makes correct repairs are provided in Table 2, with additional\nexamples in the Supplementary Material.\n\n6 Conclusions\n\nWe have proposed a GAN based approach to train an NMT system for discrete domain\nmapping applications. The major advantage of our approach is that it can be used in the\nabsence of paired data, opening up a wide set of previously unusable data sets for the\n\n8\n\n\fTable 2: Successful Repairs: (Top) This function calls sprintf to print out two strings, but\nonly provides the \ufb01rst string to print. Our GAN repairs it by providing a second string.\n(Bottom) This function uses a variable again after freeing it. Our GAN repairs it by removing\nthe \ufb01rst free.\n\nWith Vulnerability\n\nRepaired\n\nvoid CWE685_Function_Call_With_Incorrect_\n\nNumber_Of_Arguments() {\n\nchar dst[DST_SZ];\nsprintf (dst, \"%s %s\", SRC_STR);\nprintLine(dst);\n\n}\n\nvoid CWE415_Double_Free__malloc_free_\n\nstruct_31() {\n\ntwoints \u2217data;\ndata = NULL;\ndata = (twoints \u2217)malloc(100 \u2217 sizeof(twoints));\nfree(data);\n{\n\ntwoints \u2217data_copy = data;\ntwoints \u2217data = data_copy;\nfree (data);\n\n}\n\n}\n\nvoid CWE685_Function_Call_With_Incorrect_\n\nNumber_Of_Arguments() {\n\nchar dst[DST_SZ];\nsprintf (dst, \"%s %s\", SRC_STR, SRC_STR );\nprintLine(dst);\n\n}\n\nvoid CWE415_Double_Free__malloc_free_\n\nstruct_31() {\n\ntwoints \u2217data;\ndata = NULL;\ndata = (twoints \u2217)malloc(100 \u2217 sizeof(twoints));\n\ntwoints \u2217data_copy = data;\ntwoints \u2217data = data_copy;\nfree (data);\n\n{\n\n}\n\n}\n\nsequence correction task. Key to our approach is the addition of two novel generator loss\nfunctions which enforce accurate domain mapping without needing multiple generators or\ndomains to be bijective. We also have discussed, and made some progress, toward handling\ndiscrete outputs with GANs. We note that this problem is far from solved, however, and\nwill likely inspire more research. Even though we only apply our approach to the problem\nof source code correction, it is applicable to other sequence correction problems, such as\nGrammatical Error Correction or language sentiment translation, e.g., converting negative\nreviews into positive ones.\n\nAcknowledgments\nThis project was sponsored by the Air Force Research Laboratory (AFRL) as part of the\nDARPA MUSE program.\n\nReferences\n[1] MITRE. Common vulnerabilities and exposures. cve.mitre.org.\n[2] T. D. LaToza, G. Venolia, and R. DeLine. Maintaining mental models: A study of\ndeveloper work habits. International Conference on Software Engineering (ICSE), 2006.\n[3] J. Ji, Q. Wang, K. Toutanova, Y. Gong, S. Truong, and J. Gao. A Nested Attention Neu-\nral Hybrid Model for Grammatical Error Correction. Annual Meeting of the Association\nfor Computational Linguistics (ACL), pages 753\u2013762, 2017.\n\n[4] Z. Yuan and T. Briscoe. Grammatical error correction using neural machine translation.\nNorth American Chapter of the Association for Computational Linguistics: Human\nLanguage Technologies (NAACL HLT), 2016.\n\n[5] A. Schmaltz, Y. Kim, A. M. Rush, and S. M. Shieber. Adapting Sequence Models for\nSentence Correction. Empirical Methods in Natural Language Processing (EMNLP),\n2017.\n\n[6] Z. Xie, A. Avati, N. Arivazhagan, D. Jurafsky, and A. Y. Ng. Neural Language Correction\n\nwith Character-Based Attention. arXiv:1603.09727, March 2016.\n\n[7] J. A. Harer, L. Y. Kim, R. L. Russell, O. Ozdemir, L. R. Kosta, A. Rangamani, L. H.\nHamilton, G. I. Centeno, J. R. Key, P. M. Ellingwood, M. W. McConley, J. M. Opper,\nP. Chin, and T. Lazovich. Automated software vulnerability detection with machine\nlearning. arXiv:1803.04497, February 2018.\n\n9\n\n\f[8] Fourth workshop on the llvm compiler infrastructure in hpc. LLVM-HPC, 2017.\n[9] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,\nA. Courville, and Y. Bengio. Generative Adversarial Networks. Neural Information\nProcessing Systems (NIPS), June 2014.\n\n[10] M. Monperrus. Automatic software repair: A bibliography. ACM Computing Surveys\n\n(CSUR), 51(1):17:1\u201317:24, January 2018.\n\n[11] X. B. D. Le, D. Lo, and C. Le Goues. History driven program repair. Software Analysis,\n\nEvolution, and Reengineering (SANER), 2016.\n\n[12] F. Long and M. Rinard. Automatic patch generation by learning correct code. Principles\n\nof Programming Languages (POPL), 2016.\n\n[13] Devlin, Jacob, Uesato, Jonathan, Singh, Rishabh, and Kohli, Pushmeet. Semantic Code\nRepair using Neuro-Symbolic Transformation Networks. arXiv:1710.11054, October\n2017.\n\n[14] R. Gupta, S. Pal, A. Kanade, and S. Shevade. Deep\ufb01x: Fixing common c language errors\nby deep learning. Association for the Advancement of Arti\ufb01cal Intelligence (AAAI),\npages 1345\u20131351, 2017.\n\n[15] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural\nnetwork optimization problems. International Conference on Learning Representations\n(ICLR), 2015.\n\n[16] M. Arjovsky and L. Bottou. Towards Principled Methods for Training Generative\nAdversarial Networks. International Conference on Learning Representations (ICLR),\n2017.\n\n[17] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein Generative Adversarial Networks.\n\nInternational Conference on Machine Learning (ICML), 2017.\n\n[18] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial\nnets. Neural Information Processing Systems (NIPS), 2016.\n\n[19] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath.\nGenerative adversarial networks: An overview. IEEE Signal Processing Magazine,\n35(1):53\u201365, 2018.\n\n[20] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep\nconvolutional generative adversarial networks. International Conference on Learning\nRepresentations (ICLR), 2016.\n\n[21] L. Yu, W. Zhang, J. Wang, and Y. Yu. SeqGAN: Sequence Generative Adversarial Nets\nwith Policy Gradient. Association for the Advancement of Arti\ufb01cal Intelligence (AAAI),\n2017.\n\n[22] Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, and L. Carin. Adversarial\nFeature Matching for Text Generation. International Conference on Machine Learning\n(ICML), 2017.\n\n[23] O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf. Language Generation with Recurrent\nGenerative Adversarial Networks without Pre-training. 1st Workshop on Subword and\nCharacter Level Models in NLP (SCLeM), 2017.\n\n[24] S. Rajeswar, S. Subramanian, F. Dutil, C. Pal, and A. Courville. Adversarial Generation\nof Natural Language. 2nd Workshop on Representation Learning for NLP (RepL4NLP),\n2017.\n\n[25] M. Mirza and S. Osindero. Conditional Generative Adversarial Nets. arXiv:1411.1784,\n\nNovember 2014.\n\n10\n\n\f[26] Z. Yang, W. Chen, F. Wang, and B. Xu. Improving Neural Machine Translation with\nConditional Sequence Generative Adversarial Nets. North American Chapter of the\nAssociation for Computational Linguistics (NAACL), 2018.\n\n[27] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired Image-to-Image Translation\nusing Cycle-Consistent Adversarial Networks. International Conference on Computer\nVision (ICCV), 2017.\n\n[28] A. N. Gomez, S. Huang, I. Zhang, B. M. Li, M. Osama, and L. Kaiser. Unsuper-\nvised Cipher Cracking Using Discrete GANs. International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[29] G. Lample, L. Denoyer, and M. Ranzato. Unsupervised Machine Translation Using\nMonolingual Corpora Only. International Conference on Learning Representations\n(ICLR), 2018.\n\n[30] A. Shrivastava, T. P\ufb01ster, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning\nfrom Simulated and Unsupervised Images through Adversarial Training. Computer\nVision and Pattern Recognition (CVPR), 2017.\n\n[31] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved\nTraining of Wasserstein GANs. Neural Information Processing Systems (NIPS), 2017.\n[32] E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[33] C. J. Maddison, A. Mnih, and Y. W. Teh. The Concrete Distribution - A Continu-\nous Relaxation of Discrete Random Variables. International Conference on Learning\nRepresentations (ICLR), 2017.\n\n[34] M.-T. Luong, H. Pham, and C. D. Manning. E\ufb00ective Approaches to Attention-\nbased Neural Machine Translation. Empirical Methods in Natural Language Processing\n(EMNLP), 2015.\n\n[35] S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation,\n\n9(8):1735\u20131780, December 1997.\n\n[36] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing\nrobust features with denoising autoencoders. International Conference on Machine\nLearning (ICML), 2008.\n\n[37] R. J. Williams and D. Zipser. A Learning Algorithm for Continually Running Fully\n\nRecurrent Neural Networks. Neural Computation, 1989.\n\n[38] V. Okun, A. Delaitre, and P. E. Black. Report on the Static Analysis Tool Exposition\n\n(SATE) IV. Technical report, 2013.\n\n11\n\n\f", "award": [], "sourceid": 4911, "authors": [{"given_name": "Jacob", "family_name": "Harer", "institution": "Boston University"}, {"given_name": "Onur", "family_name": "Ozdemir", "institution": "Draper"}, {"given_name": "Tomo", "family_name": "Lazovich", "institution": "Lightmatter"}, {"given_name": "Christopher", "family_name": "Reale", "institution": "Draper"}, {"given_name": "Rebecca", "family_name": "Russell", "institution": "Draper"}, {"given_name": "Louis", "family_name": "Kim", "institution": "Draper"}, {"given_name": "peter", "family_name": "chin", "institution": "boston university"}]}