{"title": "Sampling for Bayesian Program Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1297, "page_last": 1305, "abstract": "Towards learning programs from data, we introduce the problem of sampling programs from posterior distributions conditioned on that data. Within this setting, we propose an algorithm that uses a symbolic solver to efficiently sample programs. The proposal combines constraint-based program synthesis with sampling via random parity constraints. We give theoretical guarantees on how well the samples approximate the true posterior, and have empirical results showing the algorithm is efficient in practice, evaluating our approach on 22 program learning problems in the domains of text editing and computer-aided programming.", "full_text": "Sampling for Bayesian Program Learning\n\nKevin Ellis\n\nArmando Solar-Lezama\n\nBrain and Cognitive Sciences\n\nMIT\n\nCSAIL\nMIT\n\nJoshua B. Tenenbaum\n\nBrain and Cognitive Sciences\n\nMIT\n\njbt@mit.edu\n\nellisk@mit.edu\n\nasolar@csail.mit.edu\n\nAbstract\n\nTowards learning programs from data, we introduce the problem of sampling\nprograms from posterior distributions conditioned on that data. Within this setting,\nwe propose an algorithm that uses a symbolic solver to ef\ufb01ciently sample programs.\nThe proposal combines constraint-based program synthesis with sampling via\nrandom parity constraints. We give theoretical guarantees on how well the samples\napproximate the true posterior, and have empirical results showing the algorithm is\nef\ufb01cient in practice, evaluating our approach on 22 program learning problems in\nthe domains of text editing and computer-aided programming.\n\n1\n\nIntroduction\n\nLearning programs from examples is a central problem in arti\ufb01cial intelligence, and many recent\napproaches draw on techniques from machine learning. Connectionist approaches, like the Neural\nTuring Machine [1, 2] and symbolic approaches, like Hierarchical Bayesian Program Learning [3,\n4, 5], couple a probabilistic learning framework with either gradient- or sampling-based search\nprocedures. In this work, we consider the problem of Bayesian inference over program spaces. We\ncombine solver-based program synthesis [6] and sampling via random projections [7], showing how\nto sample from posterior distributions over programs where the samples come from a distribution\nprovably arbitrarily close to the true posterior. The new approach is implemented in a system called\nPROGRAMSAMPLE and evaluated on a set of program induction problems that include list and string\nmanipulation routines.\n\n1.1 Motivation and problem statement\n\nInput\n\u201c1/21/2001\u201d\n\nOutput\n\u201c01\u201d\n\nsubstr(pos(\u20190\u2019,-1),-1)\nconst(\u201901\u2019)\nsubstr(-2,-1)\n\nConsider the problem of learning string edit pro-\ngrams, a well studied domain for programming\nby example. Often end users provide these ex-\namples and are unwilling to give more than one\ninstance, which leaves the target program highly\nambiguous. We model this ambiguity by sam-\npling string edit programs, allowing us to learn\nfrom very few examples (Figure 1) and offer\ndifferent plausible solutions. Our sampler also\nincorporates a description-length prior to bias\nus towards simpler programs.\nAnother program learning domain comes from computer-aided programming, where the goal is\nto synthesize algorithms from either examples or formal speci\ufb01cations. This problem can be ill\nposed because many programs may satisfy the speci\ufb01cation or examples. When this ambiguity\narises, PROGRAMSAMPLE proposes multiple implementations with a bias towards shorter or simpler\nones. The samples can also be used to ef\ufb01ciently approximate the posterior predictive distribution,\n\nFigure 1: Learning string manipulation programs\nby example (top input/output pair). Our system re-\nceives data like that shown above and then sampled\nthe programs shown below.\n\n\u201clast 0 til end\u201d\n\u201coutput 01\u201d\n\u201ctake last two\u201d\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\feffectively integrating out the program. We show PROGRAMSAMPLE learning routines for counting\nand recursively sorting/reversing lists while modeling the uncertainty over the correct algorithm.\nBecause any model can be represented as a (probabilistic or deterministic) program, we need to\ncarefully delimit the scope of this work. The programs we learn are a subset of those handled\nby constraint-based program synthesis tools. This means that the program is \ufb01nite (bounded size,\nbounded runtime, bounded memory consumption), can be modeled in a constraint solver (like a SAT\nor SMT solver), and that the program\u2019s high-level structure is already given as a sketch [6], which\ncan take the form of a recursive grammar over expressions. The sketch de\ufb01nes the search space and\nimparts prior knowledge. For example, we use one sketch when learning string edit programs and a\ndifferent sketch when learning recursive list manipulation programs.\nMore formally, our sketch speci\ufb01es a \ufb01nite set of programs, S, as well as a measure of the programs\u2019\ndescription length, which we write as |x| for x \u2208 S. This de\ufb01nes a prior (\u221d 2\u2212|x|). For each\nprogram learning problem, we have a speci\ufb01cation (such as consistency with input/output examples)\nand want to sample from S conditioned upon the speci\ufb01cation holding, giving the posterior over\nprograms \u221d 2\u2212|x|1[speci\ufb01cation holds for x]. Throughout the rest of the paper, we write p(\u00b7) to\nmean this posterior distribution, and write X to mean the set of all programs in S consistent with the\nspeci\ufb01cation. So the problem is to sample from p(x) = 2\u2212|x|\nWe can invoke a solver, which enumerates members of X, possibly subject to extra constraints, but\nwithout any guarantees on the order of enumeration. Throughout this work we use a SAT solver, and\nencode x \u2208 X in the values of n Boolean decision variables. With a slight abuse of notation we will\nuse x to refer to both a member of X and an assignment to those n decision variables. An assignment\nto jth variable we write as xj for 1 \u2264 j \u2264 n. Section 1.2 briskly summarizes the constraint-solving\nprogram synthesis approach.\n\nZ where Z =(cid:80)\n\nx\u2208X 2\u2212|x|.\n\n1.2 Program synthesis by constraint solving\n\nThe constraint solving approach to program synthesis, pioneered in [6, 8], synthesizes programs by\n(1) modeling the space of programs as assignments to Boolean decision variables in a constraint\nsatisfaction problem; (2) adding constraints to enforce consistency with a speci\ufb01cation; (3) asking the\nsolver to \ufb01nd any solution to the constraints; and (4) reinterpreting that solution as a program.\nFigure 2 illustrates this approach for the toy problem of synthesizing programs in a language consisting\nof single-bit operators. Each program has one input (i in Figure 2) which it transforms using nand\ngates. The grammar in Figure 2a is the sketch. If we inline the grammar, we can diagram the space of\nall programs as an AND/OR graph (Figure 2b), where xj are Boolean decision variables that control\nthe program\u2019s structure. For each of the input/output examples (Figure 2d) we have constraints that\nmodel the program execution (Figure 2c) and enforce the desired output (P1 taking value 1). After\nsolving for a satisfying assignment to the xj\u2019s, we can read these off as a program (Figure 2e). In this\nwork we measure the description length of a program x as the number of bits required to specify its\nstructure (so |x| is a natural number).1 PROGRAMSAMPLE further constrains unused bits to take\na canonical form, such as all being zero. This causes the mapping between programs x \u2208 X and\nvariable assignments {xj}N\n\nj=1 to be one-to-one.\n\n1.3 Algorithmic contribution\n\nIn the past decade different groups of researchers have concurrently developed solver-based techniques\nfor (1) sampling of combinatorial spaces [9, 7, 10, 11] and (2) program synthesis [6, 8]. This work\nmerges these two lines of research to attack the problem of program learning in a probabilistic\nsetting. We use program synthesis tools to convert a program learning problem into a SAT formula.\nThen, rather than search for one program (formula solution), we augment the formula with random\nconstraints that cause it to (approximately) sample the space of programs, effectively \u201cupgrading\u201d\nour SAT solver from a program synthesizer to a program sampler.\nThe groundbreaking algorithms in [9] gave the \ufb01rst scheme (XORSample) for sampling discrete\nspaces by adding random constraints to a constraint satisfaction problem. While one could use a tool\nlike Sketch to reduce a program learning problem to SAT and then use an algorithm like XORSample,\n\n1This is equivalent to the assumption that x is drawn from a probabilistic grammar speci\ufb01ed by the sketch\n\n2\n\n\fProgram ::= i\n| nand(Program,Program)\n\n(a) Sketch\n\nProgram(i = 0) = 1\n\n(d) Speci\ufb01cation\n\nx1 = 0, x2 = 1, x3 = 1\nProgram = nand(i, i)\n(e) A constraint solution; |x| = 3 bits\n\nP1\n\nP2\n\nx1\n\ni\n\nx2\n\ni\n\nx1\nnand\n\nP3\n\n. . . . . .\n\nx2\nnand\n. . . . . .\n\n(b) Program space\n\ni \u21d4 0 \u2227 P1 \u21d4 1\nx1 \u21d2 (P1 \u21d4 i)\nx1 \u21d2 (P1 \u21d4 P2 \u2227 P3)\nx2 \u21d2 (P2 \u21d4 i)\nx2 \u21d2 (P2 \u21d4 P4 \u2227 P5)\nx3 \u21d2 (P3 \u21d4 i)\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n\n(c) Constraints for SAT solver\n\nFigure 2: Synthesizing a program via sketching and constraint solving. Typewriter font refers\nto pieces of programs or sketches, while math font refers to pieces of a constraint satisfaction problem.\nThe variable i is the program input.\n\nPAWS, or WeightGen [9, 7, 10] to sample programs from a description length prior, doing so can\nbe surprisingly inef\ufb01cient2. The ef\ufb01ciency of these sampling algorithms depends critically on a\nquantity called the distribution\u2019s tilt, introduced in [10] as maxx p(x)\nminx p(x) . When there are a few very likely\n(short) programs and many extremely unlikely (long) programs, the posterior over programs becomes\nextremely tilted. Recent work has relied on upper bounding the tilt, often to around 20 [10]. For\nprogram sampling problems, we usually face very high tilt upwards of 250. Our main algorithmic\ncontribution is a new approach that extends these techniques to distributions with high tilt, such as\nthose encountered in program induction.\n\n2 The sampling algorithm\nGiven the distribution p(\u00b7) on the program space X, it is always possible to de\ufb01ne a higher dimensional\nspace E (an embedding) and a mapping F : E \u2192 X such that sampling uniformly from E and\napplying F will give us approximately p-distributed samples [7]. But, when the tilt of p(\u00b7) becomes\nlarge, we found that such an approach is no longer practical.3\nOur approach instead is to de\ufb01ne an F (cid:48) : E \u2192 X such that uniform samples on E map to a\ndistribution q(\u00b7) that is guaranteed to have low tilt, but whose KL divergence from p(\u00b7) is low. The\ndiscrepancy between the distributions p(\u00b7) and q(\u00b7) can be corrected through rejection sampling.\nSampling uniformly from E is itself not trivial, but a variety of techniques exist to approximate\nuniform sampling by adding random XOR constraints (random projections mod 2) to the set E,\nwhich is extensively studied in [9, 12, 10, 13, 11]. These techniques introduce approximation error\nthat can be made arbitrarily small at the expense of lower ef\ufb01ciency. Figure 3 illustrates this process.\n\n2.1 Getting high-quality samples\n\nLow-tilt approximation. We introduce a parameter into the sampling algorithm, d, that parameter-\nizes q(\u00b7). The parameter d acts as a threshold, or cut-off, for the description length of a program;\nthe distribution q(\u00b7) acts as though any program with description length exceeding d can be encoded\nusing d bits. Concretely,\n\n(1)\nIf we could sample exactly from q(\u00b7), we could reject a sample x with probability 1 \u2212 A(x) where A\nis\n\n2\u2212d,\n\nq(x) \u221d\n\nif |x| \u2264 d\notherwise\n\n(cid:26)2\u2212|x|,\n(cid:26)1,\n\nA(x) \u221d\n\nif |x| \u2264 d\n2\u2212|x|+d, otherwise\n\n(2)\n\n2In many cases, slower than rejection sampling or enumerating all of the programs\n3[10] take a qualitatively different approach from [7] not based on an embedding, but which still becomes\n\nprohibitively expensive in the high-tilt regime.\n\n3\n\n\f1\n\nand get exact samples from p(\u00b7), where the acceptance rate would approach 1 exponentially quickly\nin d. We have the following result; see supplement for proofs.\nProposition 1. Let x \u2208 X be a sample from q(\u00b7). The probability of accepting x is at least\n1+|X|2|x\u2217|\u2212d where x\u2217 = arg minx|x|.\nThe distribution q(\u00b7) is useful because we can\nguarantee that it has tilt bounded by 2d\u2212|x\u2217|. In-\ntroducing the proposal q(\u00b7) effectively rei\ufb01es the\ntilt, making it a parameter of the sampling algo-\nrithm, not the distribution over programs. We\nnow show how to approximately sample from\nq(\u00b7) using a variant of the Embed and Project\nframework [7].\nThe embedding. The idea is to de\ufb01ne a new set\nof programs, which we call E, such that short\nprograms are included in the set much more of-\nten than long programs. Each program x will be\nrepresented in E by an amount proportional to\n2\u2212 min(|x|,d), thus proportional to q(x), such that\nsampling elements uniformly from E samples\naccording to q(\u00b7).\nWe embed X within the larger set E by intro-\nducing d auxiliary variables, written (y1,\u00b7\u00b7\u00b7 , yd), such that every element of E is a tuple of an\nelement of x = (x1,\u00b7\u00b7\u00b7 , xn) and an assignment to y = (y1,\u00b7\u00b7\u00b7 , yd):\n\nFigure 3: PROGRAMSAMPLE twice distorts the\nposterior distribution p(\u00b7). First it introduces a\nparameter d that bounds the tilt; we correct for\nthis by accepting samples w.p. A(x). Second it\nsamples from q(\u00b7) by drawing instead from r(\u00b7),\nwhere KL(q||r) can be made arbitrarily small by\nappropriately setting another parameter, K. The\ndistribution of samples is A(x)r(x).\n\n(cid:94)\n\n1\u2264j\u2264d\n\nE = {(x, y) : x \u2208 X,\n\n|x| \u2265 j \u21d2 yj = 1}\n\n(3)\n\nSuppose we sample (x, y) uniformly from E. Then the probability of getting a particular x \u2208 X\nis proportional to |{(x(cid:48), y) \u2208 E : x(cid:48) = x}| = |{y : |x| \u2265 j \u21d2 yj = 1}| = 2min(0,d\u2212|x|) which is\nproportional to q(x). Notice that |E| grows exponentially with d, and thus with the tilt of the q(\u00b7).\nThis is the crux of the inef\ufb01ciency of sampling from high-tilt distributions in these frameworks: these\nauxiliary variables combine with the random constraints to entangle otherwise independent Boolean\ndecision variables, while also increasing the number of variables and clauses.\nThe random projections. We could sample exactly from E by invoking the solver |E| + 1 times to\nget every element of E, but in general it will have O(|X|2d) elements, which could be very large.\nInstead, we ask the solver for all the elements of E consistent with K random constraints such that\n(1) few elements of E are likely to satisfy (\u201csurvive\u201d) the constraints, and (2) any element of E is\napproximately equally likely to satisfy the constraints. We can then sample a survivor uniformly to\nget an approximate sample from E, an idea introduced in the XORSample(cid:48) algorithm [9]. Although\nsimple compared to recent approaches [10, 14, 15], it suf\ufb01ces for our theoretical and empirical results.\nOur random constraints take the form of XOR, or parity constraints, which are random projections\nmod 2. Each constraint \ufb01xes the parity of a random subset of SAT variables in x to either 1 or 0;\nthus any x survives a constraint with probability 1\n2. A useful feature of random parity constraints\nis that whether an assignment to the SAT variables survives is independent of whether another,\ndifferent assignment survives, which has been exploited to create a variety of approximate sampling\nalgorithms [9, 12, 10, 13, 11].\n\n2\u2261 b where h is a K \u00d7 (d + n) binary matrix and b is a\nThen the K constraints are of the form h ( x\ny )\nK-dimensional binary vector. If no solutions satisfy the K constraints then the sampling attempt is\nrejected. These samples are close to uniform in the following sense:\nProposition 2. The probability of sampling (x, y) is at least 1|E| \u00d7\ngetting any sample at all is at least 1 \u2212 2K/|E|.\nSo we get approximate samples from E as long as |E|2\u2212K is not small. In reference to Figure 3,\ny r(x, y). Schemes more sophisticated than\nXORSample(cid:48), like [7], also guarantee upper bounds on sampling probability, but we found that these\n\nwe call the distribution of these samples r(x) = (cid:80)\n\n1+2K /|E| and the probability of\n\n1\n\n4\n\n\fAlgorithm 1 PROGRAMSAMPLE\n\nInput: Program space X, number of samples N, failure probability \u03b4, parameters \u2206 > 0, \u03b3 > 0\nOutput: N samples\nSet |x\u2217| = minx\u2208X|x|\nSet BX = ApproximateUpperBoundModelCount(X,\u03b4/2)\nSet d = (cid:100)\u03b3 + log BX + |x\u2217|(cid:101)\n\nDe\ufb01ne E = {(x, y) : x \u2208 X,(cid:86)\n\n1\u2264j\u2264d|x| \u2265 j =\u21d2 yj = 1}\n\nSet BE = ApproximateLowerBoundModelCount(E,\u03b4/2)\nSet K = (cid:98)log BE \u2212 \u2206(cid:99)\nInitialize samples = [ ]\nrepeat\n\nSample h uniformly from {0, 1}(d+n)\u00d7K\nSample b uniformly from {0, 1}K\nEnumerate S = {(x, y) where h(x, y) = b \u2227 x \u2208 X}\nif |S| > 0 then\n\nSample (x, y) uniformly from S\nif Uniform(0, 1) < 2d\u2212|x| then\n\nsamples = samples + [x]\n\nend if\n\nend if\n\nuntil |samples| = N\nreturn samples\n\n1+2\u2206\n\n(cid:17)\n\n(cid:16)\n\n1 + 1+2\u2212\u03b3\n\nwhere \u2206 = log |E| \u2212 K and \u03b3 = d \u2212 log |X| \u2212 |x\u2217|.\n\nwere unnecessary for our main result, which is that the KL between p(\u00b7) and A(x)r(x) goes to zero\nexponentially quickly in a new quantity we call \u2206:\nProposition 3. Write Ar(x) to mean the distribution proportional to A(x)r(x). Then D(p||Ar) <\nlog\nSo we can approximate the true distribution p(\u00b7) arbitrarily well, but at the expense of either more calls\nto the solver (increasing \u2206) or a larger embedding (increasing \u03b3; our main algorithmic contribution).\nSee supplement for theoretical and empirical analyses of this accuracy/runtime trade-off.\nProposition 3 requires knowing minx|x| to set K and d. We compute minx|x| using the iterative\nminimization routine in [16]; in practice this is very ef\ufb01cient for \ufb01nite program spaces. We also need\nto calculate |X| and |E|, which are model counts that are in general dif\ufb01cult to compute exactly.\nHowever, many approximate model counting schemes exist, which provide upper and lower bounds\nthat hold with arbitrarily high probability. We use Hybrid-MBound [13] to upper bound |X| and\nlower bound |E| that each individually hold with probability at least 1 \u2212 \u03b4/2, thus giving lower\nbounds on both the \u03b3 and \u2206 parameters of Proposition 3 with probability at least 1 \u2212 \u03b4 and thus an\nupper bound on the KL divergence. Algorithm 1 puts these ideas together.\n\n3 Experimental results\n\nWe evaluated PROGRAMSAMPLE on program learning problems in a text editing domain and a\nlist manipulation domain. For each domain, we wrote down a sketch and produced SAT formulas\nusing the tool in [6], specifying a large but \ufb01nite set of possible programs. This implicitly de\ufb01ned a\ndescription-length prior, where |x| is the number of bits required to specify x in the SAT encoding.\nWe used CryptoMiniSAT [17], which can ef\ufb01ciently handle parity constraints.\n\n3.1 Learning Text Edit Scripts\n\nWe applied our program sampling algorithm to a suite of programming by demonstration problems\nwithin a text editing domain. Here, the challenge is to learn a small text editing program from very\nfew examples and apply that program to held out inputs. This problem is timely, given the widespread\nuse of the FlashFill program synthesis tool, which now ships by default in Microsoft Excel [18]\nand can learn sophisticated edit operations in real time from examples. We modeled a subset of\n\n5\n\n\f| -1 | -2 | ...\n\nString\n\n::= Character\n\n|\n\nCharacter + String\n\nNumber\n\n::= 0 | 1 | 2 | ...\n\n|\n\npos(String,String,Number)\n\nCharacter ::= a | b | c | ...\n\n::= String | substr(Pos,Pos)\n::= Number\n\nProgram ::= Term | Program + Term\nTerm\nPos\n\nthe FlashFill language; our goal here is not to compete with FlashFill, which is cleverly engineered\nfor its speci\ufb01c domain, but to study the behavior of our more general-purpose program learner in a\nreal-world task. To impart domain knowledge, we used a sketch equivalent to Figure 4.\nBecause FlashFill\u2019s training set is not yet\npublic, we drew text editing problems\nfrom [19] and adapted them to our sub-\nset of FlashFill, giving 19 problems, each\nwith 5 training examples. The supplement\ncontains these text edit problems.\nWe are interested both in the ability of\nthe learner to generalize and in PRO-\nGRAMSAMPLE\u2019s ability to generate sam-\nples quickly. Table 1 shows the aver-\nage time per sampling attempt using PRO-\nGRAMSAMPLE, which is on the order of\na minute. These text edit problems come\nfrom distributions with extremely high tilt:\noften the smallest program is only tens of bits long, but the program space contains (implausible)\nsolutions with over 100 bits. By putting d to |x\u2217| \u2212 n we eliminate the tilt correction and recover a\nvariant of the approaches in [7]. This baseline does not produce any samples for any of our text edit\nproblems in under an hour.4 Other baselines also failed to produce samples in a reasonable amount\nof time (see supplement). For example, pure rejection sampling (drawing from the prior) is also\ninfeasible, with consistent programs having prior probability \u2264 2\u221250 in some cases.\nThe learner generalizes to unseen examples, as Figure 5 shows. We evaluated the performance of the\nlearner on held out test examples while varying training set size, and compare with baselines that\neither (1) enumerate programs in the arbitrary order provided by the underlying solver, or (2) takes the\nmost likely program under p(x) (MDL learner). The posterior is sharply peaked, with most samples\nbeing from the MAP solution, and so our learner does about as well as the MDL learner. However,\nsampling offers an (approximate) predictive posterior over predictions on the held out examples; in a\nreal world scenario, one would offer the top C predictions to the user and let them choose, much like\nhow spelling correction works. This procedure allows us to offer the correct predictions more often\nthan the MDL learner (Figure 6), because we correctly handle ambiguous problems like in Figure 1.\nWe see this as a primary strength of the sampling approach to Bayesian program learning: when\nlearning from one or a few examples, a point estimate of the posterior can often miss the mark.\n\nFigure 4: The sketch (program space) for learning text\nedit scripts\n\nFigure 5: Generalization when learning text\nedit operations by example. Results averaged\nacross 19 problems. Solid: 100 samples from\nPROGRAMSAMPLE . Dashed: enumerating\n100 programs. Dotted: MDL learner. Test\ncases past 1 (respectively 2,3) examples are\nheld out when trained on 1 (respectively 2,3)\nexamples.\n\nFigure 6: Comparing the MDL learner\n(dashed black line) to program sampling when\ndoing one-shot learning. We count a problem\nas \u201csolved\u201d if the correct joint prediction to\nthe test cases is in the top C most frequent\nsamples.\n\n4Approximate model counting of E was also intractable in this regime, so we used the lower bound\n\n|E| \u2265 2d\u2212|x\u2217| + |X| \u2212 1\n\n6\n\n\fTable 1: Average solver time to generate a sam-\nple measured in seconds. See Figure 9 and 5 for\ntraining set sizes. n \u2248 180, 65 for text edit, list\nmanipulation domains, respectively. w/o tilt cor-\nrection, sampling text edit & count takes > 1 hour.\nSmall set\n84 \u00b13\n463 \u00b165\n39 \u00b13\n\u2264 1\n3.2 Learning list manipulation algorithms\n\nMedium set\n21 \u00b11\n905 \u00b158\n141 \u00b118\n\u2264 1\n\nLarge set\n49\u00b13\n1549\u00b1155\n326\u00b142\n\u2264 1\n\ntext edit\nsort\nreverse\ncount\n\nFigure 7:\nSampling frequency vs.\nground truth probability on a counting\ntask with \u2206 = 3 and \u03b3 = 4.\n\n| X | (tail List) | (list Int)\n\nRecursiveList ::= List\n\n| (recurse List)\n\n| (length List) | (head List)\n\nProgram ::=\n\n(if Bool List\n\nBool ::= (<= Int) | (>= Int)\nInt\n\n::= 0 | (1+ Int) | (1- Int)\n\nList ::= nil | (filter Bool List)\n\n(append RecursiveList\nRecursiveList\nRecursiveList))\n\nOne goal of program synthesis is computer-aided programming [6], which is the automatic generation\nof executable code from either declarative speci\ufb01cations or examples of desired behavior. Systems\nwith this goal have been successfully applied to, for example, synthesizing intricate bitvector routines\nfrom speci\ufb01cations [18]. However, when learning from examples, there is often uncertainty over\nthe correct program. While past approaches have handled this uncertainty within an optimization\nframework (see [20, 21, 16]), we show that PROGRAMSAMPLE can sample algorithms.\nWe take as our goal to learn recursive routines\nfor sorting, reversing, and counting list elements\nfrom input/output examples, particularly in the\nambiguous, unconstrained regime of few exam-\nples. We used a sketch with a set of basis prim-\nitives capable of representing a range of list ma-\nnipulation routines equivalent to Figure 8.\nA description-length prior that penalizes longer\nprograms allowed learning of recursive list ma-\nnipulation routines (from production Program)\nand a non-recursive count routine (from produc-\ntion Int); see Figure 9, which shows average\naccuracy on held out test data when trained on\nvariable numbers of short randomly generated\nlists. With the large training set (5\u201311 exam-\nples) PROGRAMSAMPLE recovers a correct im-\nplementation, and with less data it recovers a\ndistribution over programs that functions as a\nprobabilistic algorithm despite being composed\nof only deterministic programs.\nFor some of these tasks the number of consistent\nprograms is small enough that we can enumerate\nall of them, allowing us to compare our sampler\nwith ground-truth probabilities. Figure 7 shows\nthis comparison for a counting problem with 80\nconsistent programs, showing empirically that\nthe tilt correction and random constraints do not\nsigni\ufb01cantly perturb the distribution.\nTable 1 shows the average solver time per sam-\nple. Generating recursive routines like sort-\ning and reversing is much more costly than\ngenerating the nonrecursive counting routine.\nThe constraint-based approach propositional-\nizes higher-order constructs like recursion, and\nso reasoning about them is much more costly.\nYet counting problems are highly tilted due to\ncount\u2019s short implementation, which makes them intractable without our tilt correction.\n\nFigure 8: The sketch (program space) for learning\nlist manipulation routines; X is program input\n\nFigure 9: Learning to manipulate lists. Trained on\nlists of length \u2264 3; tested on lists of length \u2264 14.\n\n7\n\n\f4 Discussion\n\n4.1 Related work\n\nThere is a vast literature on program learning in the AI and machine learning communities. Many\nemploy a (possibly stochastic) heuristic search over structures using genetic programming [22] or\nMCMC [23]. These approaches often \ufb01nd good programs and can discover more high-level structure\nthan our approach. However, they are prone to getting trapped in local minima and, when used as a\nsampler, lack theoretical guarantees. Other work has addressed learning priors over programs in a\nmultitask setting [4, 5]. We see our work as particularly complementary to these methods: while they\nfocus on learning the structure of the hypothesis space, we focus on ef\ufb01ciently sampling an already\ngiven hypothesis space (the sketch). Several recent proposals for recurrent deep networks can learn\nalgorithms [2, 1]. We see our system working in a different regime, where we want to quickly learn\nan algorithm from a small number of examples or an ambiguous speci\ufb01cation.\nThe program synthesis community has several recently proposed learners that work in an optimization\nframework [20, 21, 16]. By computing a posterior over programs, we can more effectively represent\nuncertainty, particularly in the small data limit, but at the cost of more computation.\nPROGRAMSAMPLE borrows heavily from a line of work started in [9, 13] on sampling of combi-\nnatorial spaces using random XOR constraints. An exciting new approach is to use sparse XOR\nconstraints [14, 15] , which might sample more ef\ufb01ciently from our embedding of the program space.\n\n4.2 Limitations of the approach\n\nConstraint-based synthesis methods tend to excel in domains where the program structure is restricted\nby a sketch [6] and where much of the program\u2019s description length can be easily computed from\nthe program text. For example, PROGRAMSAMPLE can synthesize text editing programs that are\nalmost 60 bits long in a couple seconds, but spends 10 minutes synthesizing a recursive sorting\nroutine that is shorter but where the program structure is less restricted. Constraint-based methods\nalso require the entire problem to be represented symbolically, so they have trouble when the function\nto be synthesized involves dif\ufb01cult to analyze building blocks such as numerical routines. For such\nproblems, stochastic search methods [23, 22] can be more effective because they only need to run the\nfunctions under consideration. Finally, past work shows empirically that these methods scale poorly\nwith data set size, although this can be mitigated by considering data incrementally [21, 20].\nThe requirement of producing representative samples imposes additional overhead on our approach,\nso scalability can more limited than for standard symbolic techniques on some problems. For\nexample, our method requires 1 MAP inference query, and 2 queries to an approximate model\ncounter. These serve to \u201ccalibrate\u201d the sampler, and its cost can be amortized because they only has\nto be invoked once in order to generate an arbitrary number of iid samples. Approximate model\ncounters like MBound [13] have complexity comparable with that of generating a sample, but the\ncomplexity can depend on the number of solutions. Thus, for good performance, PROGRAMSAMPLE\nrequires that there not be too many programs consistent with the data\u2014the largest spaces considered\nin our experiments had \u2264 107 programs. This limitation, together with the general performance\ncharacteristics of symbolic techniques, means that the approach will work best for \u201cneedle in a\nhaystack\u201d problems, where the space of possible programs is large but restricted in its structure, and\nwhere only a small fraction of the programs satisfy the constraints.\n\n4.3 Future work\n\nThis work could naturally extend to other domains that involve inducing latent symbolic structure\nfrom small amounts of data, such as semantic parsing to logical forms [24], synthesizing motor\nprograms [3], or learning relational theories [25]. These applications have some component of\ntransfer learning, and building ef\ufb01cient program learners that can transfer inductive biases across\ntasks is a prime target for future research.\n\nAcknowledgments\n\nWe are grateful for feedback from Adam Smith, Kuldeep Meel, and our anonymous reviewers. Work\nsupported by NSF-1161775 and AFOSR award FA9550-16-1-0012.\n\n8\n\n\fReferences\n[1] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv:1410.5401, 2014.\n\n[2] Scott Reed and Nando de Freitas. Neural programmer-interpreters. CoRR, abs/1511.06279, 2015.\n\n[3] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[4] Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hierarchical bayesian approach. In\n\nJohannes F\u00fcrnkranz and Thorsten Joachims, editors, ICML, pages 639\u2013646. Omnipress, 2010.\n\n[5] Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Kalai. A machine learning\n\nframework for programming by example. In ICML, pages 187\u2013195, 2013.\n\n[6] Armando Solar Lezama. Program Synthesis By Sketching. PhD thesis, EECS Department, University of\n\nCalifornia, Berkeley, Dec 2008.\n\n[7] Stefano Ermon, Carla P Gomes, Ashish Sabharwal, and Bart Selman. Embed and project: Discrete sampling\nwith universal hashing. In Advances in Neural Information Processing Systems, pages 2085\u20132093, 2013.\n\n[8] Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. Oracle-guided component-based program\n\nsynthesis. In ICSE, volume 1, pages 215\u2013224. IEEE, 2010.\n\n[9] Carla P Gomes, Ashish Sabharwal, and Bart Selman. Near-uniform sampling of combinatorial spaces\n\nusing xor constraints. In Advances In Neural Information Processing Systems, pages 481\u2013488, 2006.\n\n[10] Supratik Chakraborty, Daniel Fremont, Kuldeep Meel, Sanjit Seshia, and Moshe Vardi. Distribution-aware\n\nsampling and weighted model counting for sat. In AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[11] Supratik Chakraborty, Kuldeep S Meel, and Moshe Y Vardi. A scalable and nearly uniform generator of sat\nwitnesses. In International Conference on Computer Aided Veri\ufb01cation, pages 608\u2013623. Springer, 2013.\n\n[12] Leslie G Valiant and Vijay V Vazirani. Np is as easy as detecting unique solutions. In Proceedings of the\n\nseventeenth annual ACM symposium on Theory of computing, pages 458\u2013463. ACM, 1985.\n\n[13] Carla P Gomes, Ashish Sabharwal, and Bart Selman. Model counting: A new strategy for obtaining good\n\nbounds. In AAAI Conference on Arti\ufb01cial Intelligence, 2006.\n\n[14] Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Low-density parity constraints for\n\nhashing-based discrete integration. In ICML, pages 271\u2013279, 2014.\n\n[15] Dimitris Achlioptas and Pei Jiang. Stochastic integration via errorcorrecting codes. UAI, 2015.\n\n[16] Rishabh Singh, Sumit Gulwani, and Armando Solar-Lezama. Automated feedback generation for introduc-\n\ntory programming assignments. In ACM SIGPLAN Notices, volume 48, pages 15\u201326. ACM, 2013.\n\n[17] Cryptominisat. http://www.msoos.org/documentation/cryptominisat/.\n\n[18] Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM\n\nSIGPLAN Notices, volume 46, pages 317\u2013330. ACM, 2011.\n\n[19] Dianhuan Lin, Eyal Dechter, Kevin Ellis, Joshua B. Tenenbaum, and Stephen Muggleton. Bias reformula-\n\ntion for one-shot function induction. In ECAI 2014, pages 525\u2013530, 2014.\n\n[20] Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. Learning programs from noisy data.\n\nIn POPL, pages 761\u2013774. ACM, 2016.\n\n[21] Kevin Ellis, Armando Solar-Lezama, and Josh Tenenbaum. Unsupervised learning by program synthesis.\n\nIn Advances in Neural Information Processing Systems, pages 973\u2013981, 2015.\n\n[22] John R. Koza. Genetic programming - on the programming of computers by means of natural selection.\n\nComplex adaptive systems. MIT Press, 1993.\n\n[23] Eric Schkufza, Rahul Sharma, and Alex Aiken. Stochastic superoptimization. In ACM SIGARCH Computer\n\nArchitecture News, volume 41, pages 305\u2013316. ACM, 2013.\n\n[24] P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. In Association\n\nfor Computational Linguistics (ACL), pages 590\u2013599, 2011.\n\n[25] Yarden Katz, Noah D. Goodman, Kristian Kersting, Charles Kemp, and Joshua B. Tenenbaum. Modeling\n\nsemantic cognition as logical dimensionality reduction. In CogSci, pages 71\u201376, 2008.\n\n9\n\n\f", "award": [], "sourceid": 699, "authors": [{"given_name": "Kevin", "family_name": "Ellis", "institution": "MIT"}, {"given_name": "Armando", "family_name": "Solar-Lezama", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}