{"title": "SPoC: Search-based Pseudocode to Code", "book": "Advances in Neural Information Processing Systems", "page_first": 11906, "page_last": 11917, "abstract": "We consider the task of mapping pseudocode to executable code, assuming a one-to-one correspondence between lines of pseudocode and lines of code. Given test cases as a mechanism to validate programs, we search over the space of possible translations of the pseudocode to find a program that compiles and passes the test cases. While performing a best-first search, compilation errors constitute 88.7% of program failures. To better guide this search, we learn to predict the line of the program responsible for the failure and focus search over alternative translations of the pseudocode for that line.  For evaluation, we collected the SPoC dataset (Search-based Pseudocode to Code) containing 18,356 C++ programs with human-authored pseudocode and test cases. Under a budget of 100 program compilations, performing search improves the synthesis success rate over using the top-one translation of the pseudocode from 25.6% to 44.7%.", "full_text": "SPoC: Search-based Pseudocode to Code\n\nSumith Kulal\u2217, Panupong Pasupat\u2217, Kartik Chandra, Mina Lee,\n\nOded Padon, Alex Aiken, Percy Liang\n\nDepartment of Computer Science\n\n{sumith,ppasupat,kach,minalee,padon,aaiken,pliang}@cs.stanford.edu\n\nStanford University\n\nAbstract\n\nWe consider the task of mapping pseudocode to executable code, assuming a one-\nto-one correspondence between lines of pseudocode and lines of code. Given test\ncases as a mechanism to validate programs, we search over the space of possible\ntranslations of the pseudocode to \ufb01nd a program that compiles and passes the test\ncases. While performing a best-\ufb01rst search, compilation errors constitute 88.7%\nof program failures. To better guide this search, we learn to predict the line of the\nprogram responsible for the failure and focus search over alternative translations\nof the pseudocode for that line. For evaluation, we collected the SPoC dataset\n(Search-based Pseudocode to Code) containing 18,356 C++ programs with human-\nauthored pseudocode and test cases. Under a budget of 100 program compilations,\nperforming search improves the synthesis success rate over using the top-one\ntranslation of the pseudocode from 25.6% to 44.7%.\n\n1\n\nIntroduction\n\nWe consider the task of mapping natural language pseudocode to functionally correct computer\nprograms that are long enough to have signi\ufb01cant intermediate state (e.g., 10\u201320 lines) and perform\nnon-trivial computations. Previous work on executable semantic parsing mainly focuses on translating\nshort text descriptions to one-line programs [57, 47, 58, 59, 29, 11], and while recent work explored\ngenerating longer programs from text descriptions [30, 54, 39, 18, 19, 17], these programs are mostly\nevaluated on syntactic metrics (e.g., exact match and BLEU score) rather than functional correctness.\nIn contrast, the program synthesis community emphasizes functional correctness, typically captured\nby a set of input-output test cases that the program must compute correctly [14, 13]. However,\ninput-output pairs give no information about the intermediate states of the program, making it dif\ufb01cult\nto synthesize long programs.\nSynthesizing a general class of programs of signi\ufb01cant length and internal complexity is too chal-\nlenging without some description of the steps of computation. To that end, we propose synthesizing\nprograms from natural language pseudocode and test cases. The test cases provide the functional\nspeci\ufb01cation, while the pseudocode provides guidance for the intermediate computations the program\nshould perform.\nTo synthesize a functionally correct program, instead of relying on the top-one translation of the\npseudocode, we search over the space of possible translations to \ufb01nd one that passes the test cases.\nIn this work, we view the desired program as a composition of segments of code, each aligned to a\nsegment of natural language in the pseudocode. Figure 1 instantiates our setup: each pseudocode\nline translates to a line of code with approximately one or two atomic statements. Unlike treating\nthe whole program as a single big chunk (too coarse) or decomposing into individual tokens (too\n\ufb01ne-grained), semantically coherent segments of code (rendered as single lines here) form a natural\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fi\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n\nxi\n\nyi\n\nin function main\nlet n be integer\nread n\nlet A be vector of integers\nset size of A = n\nread n elements into A\nfor all elements in A\n\nint main() {\n\nint n;\ncin >> n;\nvector<int> A;\nA.resize(n);\nfor(int i = 0; i < A.size(); i++) cin >> A[i];\nfor(int i = 0; i < A.size(); i++) {\n\nset min_i to i\nfor j = i + 1 to size of A exclusive\nset min_i to j if A[min_i] > A[j]\n\nint min_i = i;\nfor(int j = i+1; j < A.size(); j++) {\nif(A[min_i] > A[j]) { min_i = j; }\n\nswap A[i], A[min_i]\nprint all elements of A\n\nswap(A[i], A[min_i]);\n\nfor(int i=0; i<A.size(); i++) cout<<A[i]<<\" \";\n\nPublic test case 1 (out of 5):\nHidden test case 1 (out of 8):\n\n}\n5 3 2 4 1 5\n8 9 2 4 5 6 2 7 1 \u2192 1 2 2 4 5 6 7 9\n\n\u2192 1 2 3 4 5\n\nFigure 1: Given L pseudocode lines x1:L (with indentation levels (cid:96)1:L) and public test cases, our task\nis to synthesize a program with code lines y1:L. The program is evaluated against both public and\nhidden test cases.\n\nabstraction that can still be described precisely and succinctly by natural language. Within this\nframework, a program can be synthesized by choosing a candidate translation for each pseudocode\nline. We can now search for a combination of code lines that form a program passing the test cases.\nHowever, common search methods for language-to-code tasks (e.g., beam search [56]) only use the\nsparse reward of whether the program succeeds to navigate the search space. Without performing\ncredit assignment to pinpoint the causes of program failures, it is dif\ufb01cult to guide search toward\npromising programs. Since 88.7% of failures during search are due to compilation errors, we propose\nto perform credit assignment based on information returned from the compiler: When a program fails\nto compile, we use error localization methods to identify which line of the program is responsible for\nthe failure, and then focus the search on alternative translations of the pseudocode for that line.\nWe propose two error localization methods. The \ufb01rst uses a multiclass classi\ufb01er to predict one of\nthe code lines as the offending line, which is then down-weighted in subsequent search iterations.\nIn contrast to previous error correction models [15], our model also uses the error message and\npseudocode for prediction. This is crucial when the compilation error can be \ufb01xed in multiple ways,\nbut only some of which are consistent with the pseudocode. The second method, pre\ufb01x-based pruning,\nuses additional compilations to \ufb01nd a code pre\ufb01x that causes the error. Unlike the classi\ufb01er, the\nidenti\ufb01ed code pre\ufb01x is guaranteed to be erroneous and can be blacklisted entirely.\nFor evaluation, we collected and release a new dataset, SPoC (Search-based Pseudocode to Code)\ncontaining 18,356 C++ programs (14.7 lines on average). In contrast to other language-to-code\ndatasets [30, 36, 19], all programs contain multiple test cases for validation. In contrast to the\nclosely-related NAPS dataset [56], which also contains test cases but only 6% human-authored\npseudocode, all programs in SPoC have human-authored pseudocode of a consistent annotation\ngranularity. Section 3 details the comparison between SPoC and related datasets.\nUsing the top-one translation of the pseudocode yields a success rate of 24.6% on the test set. Under\na limited budget of 100 synthesis trials (i.e., 100 code compilations and executions), our best method\nachieves a success rate of 44.7%. The multiclass error localization model reduces the number of\nsynthesis trials needed in 15.5% of the programs, with a median absolute reduction of 26 trials and\na median relative reduction of 42%. On the other hand, pre\ufb01x-based pruning slightly increases the\nnumber of compilations for easier problems, but is more effective on harder programs, making it\noutperform the multiclass classi\ufb01er under larger budgets.\n\n2 Problem statement\n\nFigure 1 illustrates the setup of our synthesis task. Given (a) a sequence x of L pseu-\ndocode lines x1, x2, . . . , xL and (b) k public test cases in the form of input-output string pairs\n\n2\n\n\f1 , T out\n\n1\n\nk , T out\n\nk\n\ni\n\n), . . . , (T in\n\n), the task is to synthesize a program y consisting of L code lines\n(T in\ny1, y2, . . . , yL. Each code line yi is a fragment of code whose semantics is described by the pseu-\ndocode line xi. To simplify the task, the indentation level (cid:96)i of each line is also given; we leave the\nprocess of predicting (cid:96)i from the input to future work.\nThe synthesized program is accepted if it successfully compiles and passes all public test cases (i.e.,\ni ) as well as k(cid:48) additional\nthe compiled executable prints the string T out\nhidden test cases ( \u02dcT in\nk(cid:48) , \u02dcT out\nk(cid:48) ).\nFor training, we are given a set of examples where each example contains pseudocode x, a gold\nprogram y, and both public and hidden test cases. At test time, the system has access to pseudocode x,\npublic test cases (not hidden ones), and a computation budget. For a fair comparison under different\ncomputing environments, we use the number of synthesis trials as the budget, where in each trial, the\nsystem can issue a single call to the compiler and execute the compiled program on public test cases.\nThe system must output a single \ufb01nal program, which will be validated on both public and hidden\ntest cases.\n\nafter reading the input T in\n\n), . . . , ( \u02dcT in\n\n1 , \u02dcT out\n\n1\n\n3 Dataset\n\nRecall that our main goal is to synthesize functionally correct programs of signi\ufb01cant length and\ncomplexity. To this end, we argue that it is important to have both a description of the intermediate\ncomputation and a functional speci\ufb01cation of the program. While semantic parsing datasets with\nshorter, query-like programs [57, 7, 33, 4, 59, 23, 44, 55] usually have an execution environment\n(e.g., a database) to validate the programs, Table 1 shows that most existing language-to-code datasets\nwith longer programs [30, 36, 19] lack mechanisms to validate the correctness of programs. This\ninevitably leads previous work to resort to proxy metrics, such as exact match accuracy, BLEU score,\nand tree node F1 score, which only measure syntactic similarity rather than functional correctness\n[30, 54, 39, 18, 19, 17].\nOne notable exception and the inspiration for our work is the NAPS dataset [56] which contains both\na description (pseudocode) and a functional speci\ufb01cation (test cases) of competitive programming\nproblems. However, most of their pseudocode in the training data (4,923,000 examples from 16,410\nprograms) is generated by heuristic rule-based templates, which is less realistic compared to the\nhuman-authored counterpart (2,231 examples from 582 programs). Furthermore, the descriptions\nsuffer from inconsistent granularity: the arti\ufb01cial pseudocode is \ufb01ne-grained (e.g., \u201cincrease var0\nby 1\u201d) whereas the human-written pseudocode tends to be abstract (e.g., \u201ccompute factorial\u201d) as the\nannotators were encouraged to provide high-level descriptions. This discrepancy is re\ufb02ected by the\nratio of the length of pseudocode to that of code, which is 1:1.82 in the arti\ufb01cial dataset, and 1:3.26\nin the human-authored dataset.\nAs no existing dataset contains both high-quality human-authored description with a consistent level\nof granularity and a mechanism to validate functional correctness, we created a new dataset called\nSPoC (Search-based Pseudocode to Code), which consists of programs, pseudocode, and test cases.\nThe programs are non-trivial solutions to competitive programming problems, and each program is\npaired with public and hidden test cases. To control the quality and granularity of the pseudocode,\ninstead of collecting free-form descriptions, we segmented the code in a consistent fashion and\ncollected natural language pseudocode for each segment from curated crowdworkers.\n\n3.1 Data collection\n\nPrograms and test cases. Similar to the NAPS dataset [56], we scraped competitive programming\nproblems and their test cases from codeforces.com. Each problem has multiple programs submitted\nby participants as solutions to the problem. We collected accepted C++ programs from problems\nmarked as the easiest level based on their metric. Based on our pilot study, we \ufb01ltered out programs\nwith constructs that are dif\ufb01cult to consistently annotate with pseudocode (i.e., programs with\n#define macros, classes, structs, templates, switch statements, and mallocs).\n\n1We counted the number of programs in the released dataset. Since the programs are provided as a sequence\n\nof tokens, the number of lines per program is approximated based on the number of ;, {, and }.\n\n2We excluded partial programs (smaller pieces of full programs) in the dataset when counting.\n\n3\n\n\fTable 1: Datasets for natural language to code. In contrast to other datasets, our SPoC dataset contains\nhuman-authored pseudocode with a consistent granularity of description and test cases.\n\nCONCODE1\n\nSPoC\n\nProgramming language\nNumber of programs (total)\nLines per program (average)\nType of natural language input\nAdditional input\nGranularity of text description\n\nFraction of human-annotated text\nNumber of annotators (total)\nTest cases\nNumber of test cases (average)\n\nHS\n[30]\nPython\n\nMTG\n[30]\nJava\n665\n13,297\n30.4\n7.7\n\u2014 card text \u2014\n\n\u2014 card metadata \u2014\nprogram program\n(class)\n(class)\n100%\n100%\nn/a\nn/a\n\u0017\n\u0017\n-\n-\n\nDJANGO\n[36, 30]\nPython\n18,805\n\n1\n\ncomment\n\n-\nline\n\n100%\n\n1\n\u0017\n-\n\nNAPS2\n\n[56]\nUAST\n17,477\n21.7\n\n[19]\nJava\n\n2,184,310\n\nC++\n18,356\n14.7\ndocumentation \u2014 pseudocode \u2014\nclass context\n\n4.4\n\n-\n\nvaries\n\n-\nline\n\nprogram\n(method)\n\n100%\nn/a\n\u0017\n-\n\n6%\nn/a\n\u0013\n7.5\n\n100%\n\n59\n\u0013\n38.6\n\nTable 2: Examples of complex code lines and pseudocode lines in the SPoC dataset.\n\nread n values into array a and array b\nread n and m in a loop, printing\n\nn*m/2 and a new line on each iteration\n\nprint all elements of ans\nchange max to i if tree[i] > tree[max]\n\nor max otherwise\n\nif m and n are odd\nif a is a digit return 1\nadd s to q (q is a set)\nadd ok to ans (ans is an integer)\nadd element a to v (v is a vector)\n\nfor(int i = 0; i < n; i++) cin >> a[i] >> b[i];\nwhile (cin >> n >> m) cout << n * m / 2 << endl;\n\nfor (int i = 0; i < ans.size(); i++) cout << ans[i];\nmax = tree[i] > tree[max] ? i : max;\n\nif (m % 2 != 0 && n % 2 != 0)\nif (a >= \u20190\u2019 && a <= \u20199\u2019) return 1;\nq.insert(s);\nans += ok;\nv.push_back(a);\n\nDecomposition. We decompose each program into code lines. To obtain slightly higher-level\ndescriptions for common constructs, we group any block with only one statement with the preceding\ncontrol statement (e.g., the one-line for loop \u201cfor (int i = 0; i < n; i++) cin >> x[i];\u201d\nallows a high-level description \u201cread n values into x\u201d).\n\nPseudocode. We recruited 59 crowdworkers on Amazon Mechanical Turk to write pseudocode for\neach line of code. The workers can see the whole program and are encouraged to vary the sentence\nstructure while still maintaining semantic correctness. To our pleasant surprise, we were able to\nrecruit a set of workers (as upposed to curated specialists) capable of annotating C++ code via a\nquali\ufb01cation round in which we manually inspected their annotations.\n\nStatistics. Our dataset contains 18,356 programs submitted for 677 programming problems. Each\nproblem has roughly 27 programs, which are likely to have similar semantics yet different code\nsyntax. Excluding closing braces and the common \u201cint main()\u201d line, each program contains an\naverage of 14.7 lines (with the minimum of 1 and maximum of 457 lines of code). The average length\nof code lines is 9.08 tokens, while the average length of pseudocode lines is 7.86 tokens.\nWhile many code lines in the dataset are simple statements such as assignment (\u201ci++;\u201d) or input\nreading (\u201ccin >> n;\u201d), a signi\ufb01cant number of pseudocode lines are non-trivial to translate. As\nillustrated in Table 2, some code lines contain multiple atomic statements, in which case the pseu-\ndocode tends to become more higher-level. Even single-statement lines may require understanding\ncomplex sentence structures, idiomatic expressions, or the context from other parts of the program.\n\nTraining and test sets. To evaluate the generalization on unseen problems and annotation styles,\nwe created two test sets. We generated the \ufb01rst test set TESTP by splitting based on problems: we\nheld out 158 problems (23% out of 677 problems), which is equivalent to 1,820 programs (10.1%\nof all programs). The second test set TESTW is split by workers: we held out 7 workers (12% out\n\n4\n\n\fFigure 2: Illustration of best-\ufb01rst search and error localization model. In this example, (c11, c22, c32)\nsatis\ufb01es the test cases. Best-\ufb01rst search iterates in the order of decreasing probabilities and succeeds\nin four compiler calls. The error localization method down-weights c21, leading to an earlier success.\n\nof 59 workers), which is equivalent to 1,752 programs (9.7% of all programs, with 186 programs\noverlapping with TESTP). We used the remaining data for training and development (90:10 split).\n\n4 Basic Approach\n\nAs illustrated in Figure 2, our starting point to synthesizing a program y1:L from pseudocode x1:L\nand public test cases involves two steps. First, a translation model takes each pseudocode line xi as\ninput and generates M candidate code lines ci1, . . . , ciM to be used as the ith code line. Then, we\nsearch over the possible combinations of candidate translations until we \ufb01nd a program \u02c6y1:L that\nsuccessfully compiles and passes all public test cases.\n\nTranslation. To generate candidate code lines, we use the standard seq2seq implementation from\nOpenNMT [24] with an LSTM encoder and decoder, attention-based copying mechanism [32, 49],\nand coverage vector [46]. After encoding the pseudocode line xi, we apply beam search with beam\nsize M to produce a ranked list of candidates translations Ci = (ci1, . . . , ciM ), where each code line\ncij is a sequence of string tokens. (We use M = 100 for our experiments.) The model also assigns a\nprobability pij = p(cij | xi) for each candidate cij. The translation model is trained on pairs (xi, yi)\nfrom the training data using the standard log-likelihood objective.\n\np(\u02c6y) = (cid:81)L\n\nBest-\ufb01rst search. Given the candidate lists C1, . . . , CL, we can synthesize a program \u02c6y by picking\na candidate ci,j[i] from each Ci (where j[i] \u2208 {1, . . . , M}) and then concatenating them into a\nprogram. In our search algorithm, we iterate through programs \u02c6y in the descending order of probability\ni=1 pi,j[i]. To do so, we maintain a heap of the combinations \u02c6y = (c1,j[1], . . . , cL,j[L])\nindexed by p(\u02c6y). The heap initially contains the program (c11, . . . , cL1), which is the top-one\ntranslation of the pseudocode. In each iteration, we pop a program (c1,j[1], . . . , cL,j[L]) from the heap\nand test it. If the program fails (either from a compilation error, a runtime error, or a mismatch between\nthe actual and expected test outputs), we push modi\ufb01ed programs (c1,j[1], . . . , ci,j[i]+1, . . . , cL,j[L])\nfor all i \u2208 {1, . . . , L} that have not been explored to the heap. We continue searching until we either\n\ufb01nd a program that passes all public test cases or exhaust the computation budget.\n\n5 Error localization\n\nSo far, we have treated program compilation and execution as a black box that only tells whether a\nprogram passes the public test cases. This sparse signal offers little guidance: For instance, best-\ufb01rst\nsearch will keep using an erroneous candidate cij if its probability pij is high.\n\n5\n\nx1x2x3translatec11(p11= 0.7)c12(p12= 0.2)c13(p13= 0.15)c21(p21= 0.4)c22(p22= 0.3)c23(p23= 0.05)c31(p31= 0.3)c32(p32= 0.2)c33(p33= 0.05)best-first search1.........C1:C2:C3:234\u2192 compilation error\u2192 runtime error\u2192 compilation error\u2192 success123\u2192 runtime error\u2192 successi* = 2down-weight c21model3:error: \u2018n\u2019 was not declaredbest-first search with error localization\fTo speed up search, we unpack the black box. In this work, we focus on compilation errors, which\nconstitute 88.7% of the failure cases in best-\ufb01rst search. When a program \u02c6y = (c1,j[1], . . . , cL,j[L])\nfails to compile, the compiler reports error messages with associated line numbers. Unfortunately, the\nreported line numbers do not always correspond to the actual location of the mistake (e.g., the error\n\u201c\u2018n\u2019 was not declared in this scope\u201d can occur long after the line where n should be declared according\nto the pseudocode). Empirically, the reported line number does not match the actual incorrect line\n21.7% of the time.\nTherefore, we treat the compilation error message as a noisy signal, and propose to use an error\nlocalization method to infer the actual portion of the code that causes the error. As illustrated in\nFigure 2, the error localization method has access to the pseudocode x, the synthesized code \u02c6y,\nand the \ufb01rst error message (ierr, merr) from the compiler, where ierr is a line number and merr is a\nmessage string. It then predicts an offending code line or abstains. We then either down-weight or\nblacklist the translation candidates in the offending code lines.\nWe now introduce two error localization methods: multiclass classi\ufb01cation, which uses a neural\nmodel to predict a single offending line; and pre\ufb01x-based pruning, which uses additional calls to the\ncompiler for detecting an erroneous code pre\ufb01x.\n\nMulticlass classi\ufb01cation. We train a classi\ufb01er to predict the offending line i\u2217 among the L lines.\nOur model is similar to the error correction model in [15]. For each line i, we embed the tokens of\nxi, yi, and merr, and then use three separate LSTMs to encode the sequences. We concatenate the\n\ufb01nal LSTM hidden states with the positional encoding [48] of the line offset \u2206i = ierr \u2212 i, and then\napply a feedforward network to produce the line embedding of line i. The L line embeddings are\nthen passed through another LSTM, and the hidden state of the LSTM cell corresponding to line i\nis passed through a feedforward network to compute the logit for line i. We return the line i\u2217 with\nthe highest probability (softmax over logits) if that probability exceeds a threshold \u03b2mul and abstain\notherwise. We use \u03b2mul = 0.95 for the experiments.2\nGiven i\u2217, we down-weight the current translation candidate of the line i\u2217 so that it is used less often in\nsubsequent search iterations. Concretely, we multiply the probability pi\u2217,j[i\u2217] of the current candidate\nci\u2217,j[i\u2217] in line i\u2217 with a constant factor \u03b1 < 1. As this affects the heap indices, we rebuild the heap\nfrom scratch (which takes negligible time) and continue the search, skipping any program that has\nalready been explored before the heap rebuild.\nTo construct a dataset for training the model, we consider each program y = y1:L in the synthesis\ntraining dataset, substitute a single line yi\u2217 with a candidate ci\u2217j \u2208 Ci\u2217 generated from pseudocode\nline xi\u2217, and then collect any modi\ufb01ed program y(cid:48) that produces a compilation error with an error\nmessage (ierr, merr). The model is trained to maximize the log-likelihood of the offending lines i\u2217.\n\nPre\ufb01x-based pruning. The multiclass classi\ufb01cation method does not guarantee that the predicted\nline i\u2217 is actually an offending line. Furthermore, a candidate code line might be offending in some\ncontexts but not others (e.g., a variable re-declaration is no longer offending if the previous declaration\nno longer exists). To address these issues, we propose an alternative that uses additional compiler\ncalls to \ufb01nd an offending pre\ufb01x of the program. Concretely, when a compilation error occurs, we use\nthe compiler to to \ufb01nd the minimum i\u2217 such that the pre\ufb01x (c1j[1], . . . , ci\u2217j[i\u2217]), plus closing braces\nto complete the program, fails to compile. Since programs containing that pre\ufb01x will also fail (with\nvery rare exceptions), we can safely blacklist the pre\ufb01x from future search iterations.\nEach additional compiler call is counted as one trial toward the synthesis budget. To use our budget\nsparingly, we only test i\u2217 = ierr \u2212 \u2206i where \u2206i \u2208 {0, 1, 2} corresponds to the three most frequent\noffsets. If we fail to \ufb01nd an offending pre\ufb01x, we simply abstain and continue the search.\n\n6 Experiments\n\nOur main evaluation metric is success rate at B: the fraction of test examples where the system\ngenerates an accepted program under the budget of B trials. For error localization methods, we also\nconsider the reduction in the number of trials used compared to normal best-\ufb01rst search.\n\n2 Refer to the Appendix for ablation studies of the design choices in the model.\n\n6\n\n\f(a) Line-level accuracy (%)\nat ranks 1, 5, 10, and 100\n\n(b) Number of lines where\nthe top candidate is incorrect\n\n(c) Number of lines where\n\nno candidate is correct\n\nRank\n1\n5\n10\n100\n\nTESTP\n84.0\n89.1\n90.0\n92.0\n\nTESTW\n87.0\n91.8\n92.9\n94.7\n\nTESTP\nTESTW\n\n0\n\n1\n\n0\n\n2\n1\n\n3\n2\n\n4+\n3\n\n4+\n\n0\n\n0\n\n1\n\n3+2\n2\n1\n\n0\n\n20\n\n40\n\n60\n\n80 100\n\n0\n\n20\n\n40\n\n60\n\n80 100\n\n% of programs\n\n% of programs\n\nFigure 3: (a) While the translation accuracy is high at the line level, we need to consider the result at\nthe program level. For each program, we count the number of lines i where (b) the top candidate ci1\nis incorrect, and (c) none of the candidates cij \u2208 Ci is correct.\n\nTESTP\n\nsuccess rate (%)\n\nbudget B\n\n40\n\n38\n\n36\n\n34\n\n32\n\nTESTW\n\nsuccess rate (%)\n\nbudget B\n\n60\n\n58\n\n56\n\n54\n\n52\n\n0\n\n1,000\n\n2,000\n\n3,000\n\n0\n\n1,000\n\n2,000\n\n3,000\n\nB =\nno localization\ni\u2217 = ierr\nmulticlass\npre\ufb01x-based\n\n10\n26.5\n28.0\n28.4\n25.3\n\n100\n32.5\n32.5\n34.2\n33.0\n\n1000\n37.5\n34.3\n38.3\n37.9\n\n3000\n39.1\n34.7\n39.2\n40.3\n\nB =\nno localization\ni\u2217 = ierr\nmulticlass\npre\ufb01x-based\n\n10\n42.5\n43.5\n44.4\n41.0\n\n100\n51.0\n50.0\n53.7\n50.5\n\n1000\n57.3\n52.6\n58.6\n57.9\n\n3000\n59.4\n53.1\n60.3\n60.7\n\ntop-one (B = 1): 17.8\n\noracle (B = \u221e): 55.2\n\ntop-one (B = 1): 30.7\n\noracle (B = \u221e): 71.4\n\nFigure 4: Success rates at budgets B of best-\ufb01rst search with different error localization methods.\n\nTranslation accuracy. When evaluating the translation model, surface-form metrics such as exact\nsequence match and BLEU scores fail to account for functional correctness of the code. For instance,\na prediction \u201cif (b)\u201d is functionally equivalent to the gold code \u201cif (b == true)\u201d when b is\na boolean. Hence, we instead evaluate the functional correctness of the translation. To check if a\npredicted code line cij \u2208 Ci is functionally correct, we replace the code line yi in the gold program\nwith cij and then verify whether the program still passes both public and hidden test cases.\nThe results in Figure 3(a) shows that when the lines are considered independently, the translation\nmodel achieves a high accuracy of 84\u201387% under this notion of functional correctness. However,\nthe picture is grimmer when we consider the statistics at the program level, which is what matters\nfor synthesis. For each program, we count the number of lines i where the top candidate ci1 is\nnot functionally correct. Figure 3(b) shows that only 18.2% of programs in TESTP and 32.0% of\nprograms in TESTW have the top candidate correct in every line. As code lines that are functionally\ncorrect in isolation may be incompatible one another,3 the programs formed by combining the top\ncandidate of each line have an even lower success rates of 17.8% on TESTP and 30.7% on TESTW.\n\nOracle success rate. To compute the maximum achievable success rate given the lists of candidates,\nfor each program, we count the number of lines i where the candidate list Ci does not have any correct\ncandidate. Figure 3(c) shows that 44.8% of programs in TESTP and 28.6% of programs in TESTW\nhave least one dif\ufb01cult line where the translation model does not produce a correct prediction among\nthe top M = 100 candidates. This means a synthesizer with an in\ufb01nite search budget would achieve\na maximum success rate of 55.2% on TESTP and 71.4% on TESTW given our lists of candidates\n(assuming that incorrect candidates do not give a correct behavior when combined together).\n\n3 Functionally correct lines can also give an incorrect behavior when combined, but this occurs more rarely.\n\n7\n\n\fmethod\nmulticlass\n\npre\ufb01x-based\n\nTable 3: Effects of using error localization methods on all test examples.\n\nabsolute difference\nmean median\n\u201326.0\n\neffect compared to best-\ufb01rst\nimproves number of trials\nfailed to synthesize \u2192 succeeds\nworsens number of trials\nsucceeded \u2192 fails to synthesize\nimproves number of trials\nfailed to synthesize \u2192 succeeds\nworsens number of trials\nsucceeded \u2192 fails to synthesize\n\nnumber of trials:\ncount\n13.5 % \u2013199.5\n2.0 %\n0.4 % +407.5\n1.6 %\n4.1 % \u2013272.6\n1.5 %\n15.7 %\n0.3 %\n\n+123.0\n\n\u201391.0\n\n+12.0\n\n+68.4\n\nrelative difference\ngeo.mean median\n\u00d70.58\n\n\u00d70.39\n\n\u00d76.70\n\n\u00d70.45\n\n\u00d71.65\n\n\u00d77.04\n\n\u00d70.57\n\n\u00d71.63\n\n(1)\n\n. . .\nlet s be a string\nstring s ;\nread s\ncin >> s ;\nif s is half\nif ( s / 2 == 0 )\n. . .\n\n7\n\n8\n\n9\n\n(2)\n\n. . .\ncreate int l, p and q\nint a , p , q ;\nread l, p and q\ncin >> l >> p >> q ;\nprint l * p / (p + q)\ncout << l * p / ( p + q ) << endl ;\n. . .\n\n2\n\n3\n\n4\n\nFigure 5: Examples of programs synthesized during search. In Program 1, pre\ufb01x-based pruning\ndetects that the pre\ufb01x up to line 9 is offending. In Program 2, the multiclass model incorrectly predicts\nline 3 as the offending line, which ultimately leads to a failure.\n\nSynthesis results. Figure 4 compares the success rates of best-\ufb01rst search with and without error\nlocalization. As a baseline, we try down-weighting the reported error line (i\u2217 = ierr) whenever a\ncompilation error occurs. Due to the mismatch between the actual offending line and the reported\nline, the synthesis result deteriorates. Up until the compilation budget of around B = 1500, the\nmulticlass classi\ufb01er improves the success rate more than pre\ufb01x-based pruning. Pre\ufb01x-based pruning\nachieves better success rates for higher budgets, but since it uses additional compilation calls, it\nperforms worse under tighter budgets.\nTable 3 details how the error localization methods affect the synthesis outcome. The multiclass model\ndecreases the number of trials in 15.5% of all examples, but since its predictions are not veri\ufb01ed, the\nmodel is also more prone to catastrophic failures. Pre\ufb01x-based pruning uses additional compilation\ncalls to verify its verdict, and thus slightly worsens the number of compilations needed in a large\nnumber of examples. However, for more dif\ufb01cult programs, the bene\ufb01t outweighs the cost.\nBetween the two test sets, the success rate on TESTW is consistently higher than on TESTP for\nall methods. One possible explanation is that the unseen problems in TESTP require the model to\npotentially generalize to code constructs that are speci\ufb01c to solving those problems. As a piece of\nevidence, we found that 24.74% of the code token 4-grams in TESTP are unseen during training,\nwhile the number is only 15.77% for TESTW.\n\nError analysis. To understand the behavior of error localization methods, we analyzed several\nexamples from the development data. Some prototypical examples are shown in Figure 5. Program\n1 shows how error localization can improve search. The condition \u201cs is half\u201d in line 9 should be\ntranslated as \u201cs == \"half\"\u201d, but was instead interpreted as \u201cs / 2 == 0\u201d and \u201cs % 2 == 0\u201d\nwith high probability, and hence best-\ufb01rst search spends a signi\ufb01cant amount of budget (1511 trials)\nusing these incorrect candidates in the combination. In contrast, pre\ufb01x-based pruning detects them as\noffending candidates and succeeds earlier (413 trials).\nIn contrast, Program 2 shows how an incorrect error localization can lead to a catastrophic failure.\nHere, the multiclass model reads the error message merr = \u201c\u2018l\u2019 was not declared in this scope\u201d with\nline number ierr = 3, and incorrectly predicts that line 3 is an offending line. This causes the search\nto ultimately fail whereas best-\ufb01rst search \ufb01nds a correct program in 80 search iterations.\n\n8\n\n\f7 Related work and discussion\n\nProgram synthesis. Program synthesis using test cases has been extensively studied in the pro-\ngramming languages literature. The most knowledge-heavy approach is to formulate synthesis as a\nconstraint satisfaction problem [43, 45], which requires that the synthesis problem can be translated\nto a theory with effective constraint solvers. For other problems, brute force enumeration of programs\n(with some optimization) works surprisingly well [35, 3], but when the search space is too large for\nenumeration, randomized search guided by a cost function can be effective [41, 42]. Some works\ncombine aspects of multiple approaches (e.g., [21]). For program speci\ufb01cations, the norm is to use\ninput-output pairs. However, most synthesized programs are relatively short, and works that consis-\ntently synthesize longer programs are in the domains where the intermediate computation is easier to\nrecover from input and output, such as string [14, 37, 8] and data structure transformations [13, 52].\nFor other domains, while input-output examples are precise in evaluating functional correctness, they\nprovide mostly global signals and provide little information about the intermediate computations,\nthereby requiring other forms of speci\ufb01cation along with input-output examples [12].\n\nSemantic parsing. Works on translating natural language speci\ufb01cations to executable programs, as\ndiscussed in Section 3, are closely related to semantic parsing whose goal is to map natural language\nutterances to formal representation. One of its traditional tasks is to parse a given question (usually a\nsingle sentence) into an executable database query [57, 58, 28, 4, 59, 53]. Instead of a single query,\nsome work aims to parse a sequence of utterances into queries that can be sequentially executed\n[2, 20, 31]. However, the sequences are still relatively short (e.g., maximum 5 sentences).\nWhile some semantic parsers operate by modifying the linguistic structure of the utterance [38, 40],\nmost parsers construct the parse incrementally from scratch. Partial parses can be constructed by\ncombining smaller partial parses in a bottom-up fashion [58, 28, 27], adding the next token to the\nsequence [10, 22, 51, 11, 25], or creating a new node in a tree structure [10, 26, 54, 39]. In any case,\nthe search space of possible parses can be controlled by validating the constructed partial parses and\npruning invalid ones [50, 6]. For instance, the parser can forbid partial parses that do not type-check,\nfail to execute, or execute into unfavorable values (e.g., an empty set). This validation process\nmotivates our pre\ufb01x pruning approach, which validates partial programs by invoking a compiler.\nHowever, we put an additional emphasis on minimizing the number of compilations needed, as our\ninitial attempt to compile every pre\ufb01x in the same fashion as previous semantic parsing works [50]\nsigni\ufb01cantly slows down the synthesis process.\n\nError localization. Error localization in the context of automated program repair has been an\nactive topic of research. Many recent work that uses neural models to localize errors has focused on\nlocalizing and correcting syntax errors [15] or a class of well-de\ufb01ned semantic errors such as variable\nmisuse and variable replace [1, 9, 34]. Other work identi\ufb01es error locations by interpreting compiler\nerror messages [16, 5]. Likewise, our multiclass error localization model uses compilation errors to\nlocate offending code; however, since the code is tied to pseudocode, we also use the signal from\npseudocode to distinguish ambiguous cases (e.g., in Program 2 of Figure 5, while changing either line\n2 or line 3 can \ufb01x the error, a correct model should choose line 2 as the offending line with respect to\nthe pseudocode.)\n\nAcknowledgements. We thank Shivam Garg, Jason Koenig, Nadia Polikarpova, Alex Polozov and\nRishabh Singh for valuable feedback at different stages of the project. This work was supported by\nNSF grant CCF-1409813, NSF CAREER Award IIS-1552635 and a grant from Amazon.\n\nReproducibility. The dataset and code are available at https://sumith1896.github.io/\nspoc/. Reproducible experiments are available on the CodaLab platform at https://worksheets.\ncodalab.org/worksheets/0x23b27b2131634a158c8149d5b82adecf.\n\nReferences\n[1] M. Allamanis, M. Brockschmidt, and M. Khademi. Learning to represent programs with graphs.\n\nIn International Conference on Learning Representations (ICLR), 2018.\n\n[2] J. Andreas and D. Klein. Alignment-based compositional semantics for instruction following.\n\nIn Empirical Methods in Natural Language Processing (EMNLP), 2015.\n\n9\n\n\f[3] S. Bansal and A. Aiken. Automatic generation of peephole superoptimizers. In Architectural\n\nSupport for Programming Languages and Operating Systems (ASPLOS), 2006.\n\n[4] J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic parsing on Freebase from question-answer\n\npairs. In Empirical Methods in Natural Language Processing (EMNLP), 2013.\n\n[5] S. Bhatia, P. Kohli, and R. Singh. Neuro-symbolic program corrector for introductory program-\n\nming assignments. In International Conference on Software Engineering (ICSE), 2018.\n\n[6] M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov. Generative code modeling with\n\ngraphs. In International Conference on Learning Representations (ICLR), 2019.\n\n[7] D. A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky,\nand E. Shriberg. Expanding the scope of the ATIS task: The ATIS-3 corpus. In Workshop on\nHuman Language Technology, pages 43\u201348, 1994.\n\n[8] J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A. Mohamed, and P. Kohli. Robust\ufb01ll: Neural\nprogram learning under noisy i/o. In International Conference on Machine Learning (ICML),\n2017.\n\n[9] J. Devlin, J. Uesato, R. Singh, and P. Kohli. Semantic code repair using neuro-symbolic\n\ntransformation networks. arXiv preprint arXiv:1710.11054, 2017.\n\n[10] L. Dong and M. Lapata. Language to logical form with neural attention. In Association for\n\nComputational Linguistics (ACL), 2016.\n\n[11] L. Dong and M. Lapata. Coarse-to-\ufb01ne decoding for neural semantic parsing. In Association\n\nfor Computational Linguistics (ACL), 2018.\n\n[12] Y. Feng, R. Martins, Y. Wang, I. Dillig, and T. W. Reps. Component-based synthesis for complex\n\napis. In Principles of Programming Languages (POPL), 2017.\n\n[13] J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structure transformations from\ninput-output examples. In Programming Language Design and Implementation (PLDI), 2015.\n\n[14] S. Gulwani. Automating string processing in spreadsheets using input-output examples. ACM\n\nSIGPLAN Notices, 46(1):317\u2013330, 2011.\n\n[15] R. Gupta, S. Pal, A. Kanade, and S. K. Shevade. Deep\ufb01x: Fixing common C language errors by\n\ndeep learning. In Association for the Advancement of Arti\ufb01cial Intelligence (AAAI), 2017.\n\n[16] B. Hartmann, D. MacDougall, J. Brandt, and S. R. Klemmer. What would other programmers\ndo: suggesting solutions to error messages. In Conference on Human Factors in Computing\nSystems (CHI), 2010.\n\n[17] T. Hashimoto, K. Guu, Y. Oren, and P. Liang. A retrieve-and-edit framework for predicting\n\nstructured outputs. In Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[18] S. A. Hayati, R. Olivier, P. Avvaru, P. Yin, A. Tomasic, and G. Neubig. Retrieval-based neural\n\ncode generation. In Empirical Methods in Natural Language Processing (EMNLP), 2018.\n\n[19] S. Iyer, I. Konstas, A. Cheung, and L. S. Zettlemoyer. Mapping language to code in program-\n\nmatic context. In Empirical Methods in Natural Language Processing (EMNLP), 2018.\n\n[20] M. Iyyer, W. Yih, and M. Chang. Answering complicated question intents expressed in\n\ndecomposed question sequences. CoRR, 0, 2016.\n\n[21] S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari. Oracle-guided component-based program\n\nsynthesis. In International Conference on Software Engineering (ICSE), 2010.\n\n[22] R. Jia and P. Liang. Data recombination for neural semantic parsing.\n\nComputational Linguistics (ACL), 2016.\n\nIn Association for\n\n[23] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr:\nA diagnostic dataset for compositional language and elementary visual reasoning. In Computer\nVision and Pattern Recognition (CVPR), 2017.\n\n10\n\n\f[24] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. OpenNMT: Open-source toolkit for\n\nneural machine translation. arXiv preprint arXiv:1701.02810, 2017.\n\n[25] T. Kocisk\u00fd, G. Melis, E. Grefenstette, C. Dyer, W. Ling, P. Blunsom, and K. M. Hermann.\nSemantic parsing with semi-supervised sequential autoencoders. In Empirical Methods in\nNatural Language Processing (EMNLP), pages 1078\u20131087, 2016.\n\n[26] J. Krishnamurthy, P. Dasigi, and M. Gardner. Neural semantic parsing with type constraints for\nsemi-structured tables. In Empirical Methods in Natural Language Processing (EMNLP), 2017.\n\n[27] T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and M. Steedman. Lexical generalization in\nCCG grammar induction for semantic parsing. In Empirical Methods in Natural Language\nProcessing (EMNLP), pages 1512\u20131523, 2011.\n\n[28] P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. In\n\nAssociation for Computational Linguistics (ACL), pages 590\u2013599, 2011.\n\n[29] X. V. Lin, C. Wang, L. S. Zettlemoyer, and M. D. Ernst. Nl2bash: A corpus and semantic\nparser for natural language interface to the linux operating system. In Language Resources and\nEvaluation Conference (LREC), 2018.\n\n[30] W. Ling, E. Grefenstette, K. M. Hermann, T. Ko\u02c7cisk\u00fd, A. Senior, F. Wang, and P. Blunsom.\nLatent predictor networks for code generation. In Association for Computational Linguistics\n(ACL), pages 599\u2013609, 2016.\n\n[31] R. Long, P. Pasupat, and P. Liang. Simpler context-dependent logical forms via model projec-\n\ntions. In Association for Computational Linguistics (ACL), 2016.\n\n[32] M. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural\nmachine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages\n1412\u20131421, 2015.\n\n[33] M. MacMahon, B. Stankiewicz, and B. Kuipers. Walk the talk: Connecting language, knowledge,\n\nand action in route instructions. In National Conference on Arti\ufb01cial Intelligence, 2006.\n\n[34] V. Marko, K. Aditya, M. Petros, B. David, and S. Rishabh. Neural program repair by jointly\nlearning to localize and repair. In International Conference on Learning Representations (ICLR),\n2019.\n\n[35] H. Massalin. Superoptimizer \u2013 a look at the smallest program. In Architectural Support for\n\nProgramming Languages and Operating Systems (ASPLOS), 1987.\n\n[36] Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura. Learning to\nIEEE/ACM\n\ngenerate pseudo-code from source code using statistical machine translation.\nInternational Conference on Automated Software Engineering (ASE), 30:574\u2013584, 2015.\n\n[37] E. Parisotto, A. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli. Neuro-symbolic program\n\nsynthesis. In International Conference on Learning Representations (ICLR), 2017.\n\n[38] H. Poon. Grounded unsupervised semantic parsing. In Association for Computational Linguis-\n\ntics (ACL), 2013.\n\n[39] M. Rabinovich, M. Stern, and D. Klein. Abstract syntax networks for code generation and\n\nsemantic parsing. In Association for Computational Linguistics (ACL), 2017.\n\n[40] S. Reddy, O. T\u00e4ckstr\u00f6m, M. Collins, T. Kwiatkowski, D. Das, M. Steedman, and M. Lapata.\nTransforming dependency structures to logical forms for semantic parsing. In Association for\nComputational Linguistics (ACL), pages 127\u2013140, 2016.\n\n[41] E. Schkufza, R. Sharma, and A. Aiken. Stochastic superoptimization. In Architectural Support\n\nfor Programming Languages and Operating Systems (ASPLOS), 2013.\n\n[42] K. Shi, J. Steinhardt, and P. Liang. FrAngel: Component-based synthesis with control structures.\n\nIn Principles of Programming Languages (POPL), 2019.\n\n11\n\n\f[43] A. Solar-Lezama, L. Tancau, R. Bodik, V. Saraswat, and S. Seshia. Combinatorial sketching for\n\ufb01nite programs. In Architectural Support for Programming Languages and Operating Systems\n(ASPLOS), 2006.\n\n[44] A. Talmor and J. Berant. The web as knowledge-base for answering complex questions. In\n\nNorth American Association for Computational Linguistics (NAACL), 2018.\n\n[45] R. Tate, M. Stepp, Z. Tatlock, and S. Lerner. Equality saturation: a new approach to optimization.\n\nIn Principles of Programming Languages (POPL), 2009.\n\n[46] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li. Modeling coverage for neural machine translation. In\n\nAssociation for Computational Linguistics (ACL), 2016.\n\n[47] D. Vadas and J. R. Curran. Programming with unrestricted natural language. In Australasian\n\nLanguage Technology Workshop (ALTA), 2005.\n\n[48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and\n\nI. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.\n\n[49] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Advances in Neural Information\n\nProcessing Systems (NeurIPS), pages 2674\u20132682, 2015.\n\n[50] C. Wang, K. Tatwawadi, M. Brockschmidt, P. Huang, Y. Mao, O. Polozov, and R. Singh. Robust\ntext-to-SQL generation with execution-guided decoding. arXiv preprint arXiv:1807.03100,\n2018.\n\n[51] C. Xiao, M. Dymetman, and C. Gardent. Sequence-based structured prediction for semantic\n\nparsing. In Association for Computational Linguistics (ACL), 2016.\n\n[52] N. Yaghmazadeh, C. Klinger, I. Dillig, and S. Chaudhuri. Synthesizing transformations on\nhierarchically structured data. In Programming Language Design and Implementation (PLDI),\n2016.\n\n[53] N. Yaghmazadeh, Y. Wang, I. Dillig, and T. Dillig. Sqlizer: Query synthesis from natural\nlanguage. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA),\n2017.\n\n[54] P. Yin and G. Neubig. A syntactic neural model for general-purpose code generation. In\n\nAssociation for Computational Linguistics (ACL), pages 440\u2013450, 2017.\n\n[55] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman,\nZ. Zhang, and D. R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-\ndomain semantic parsing and text-to-SQL task. In Empirical Methods in Natural Language\nProcessing (EMNLP), 2018.\n\n[56] M. Zavershynskyi, A. Skidanov, and I. Polosukhin. NAPS: Natural program synthesis dataset.\n\nIn Workshop on Neural Abstract Machines & Program Induction (NAMPI), 2018.\n\n[57] M. Zelle and R. J. Mooney. Learning to parse database queries using inductive logic program-\nming. In Association for the Advancement of Arti\ufb01cial Intelligence (AAAI), pages 1050\u20131055,\n1996.\n\n[58] L. S. Zettlemoyer and M. Collins. Online learning of relaxed CCG grammars for parsing\nto logical form. In Empirical Methods in Natural Language Processing and Computational\nNatural Language Learning (EMNLP/CoNLL), pages 678\u2013687, 2007.\n\n[59] V. Zhong, C. Xiong, and R. Socher. Seq2SQL: Generating structured queries from natural\n\nlanguage using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6390, "authors": [{"given_name": "Sumith", "family_name": "Kulal", "institution": "Stanford University"}, {"given_name": "Panupong", "family_name": "Pasupat", "institution": "Stanford University"}, {"given_name": "Kartik", "family_name": "Chandra", "institution": "Stanford University"}, {"given_name": "Mina", "family_name": "Lee", "institution": "Stanford University"}, {"given_name": "Oded", "family_name": "Padon", "institution": "Stanford University"}, {"given_name": "Alex", "family_name": "Aiken", "institution": "Stanford University"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}