{"title": "HOUDINI: Lifelong Learning as Program Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 8687, "page_last": 8698, "abstract": "We present a neurosymbolic framework for the lifelong learning of algorithmic tasks that mix perception and procedural reasoning. Reusing high-level concepts across domains and learning complex procedures are key challenges in lifelong learning. We show that a program synthesis approach that combines gradient descent with combinatorial search over programs can be a more effective response to these challenges than purely neural methods. Our framework, called HOUDINI, represents neural networks as strongly typed, differentiable functional programs that use symbolic higher-order combinators to compose a library of neural functions. Our learning algorithm consists of: (1) a symbolic program synthesizer that performs a type-directed search over parameterized programs, and decides on the library functions to reuse, and the architectures to combine them, while learning a sequence of tasks; and (2) a neural module that trains these programs using stochastic gradient descent. We evaluate HOUDINI on three benchmarks that combine perception with the algorithmic tasks of counting, summing, and shortest-path computation. Our experiments show that HOUDINI transfers high-level concepts more effectively than traditional transfer learning and progressive neural networks, and that the typed representation of networks signi\ufb01cantly accelerates the search.", "full_text": "HOUDINI: Lifelong Learning as Program Synthesis\n\nLazar Valkov\n\nUniversity of Edinburgh\nL.Valkov@sms.ed.ac.uk\n\nDipak Chaudhari\nRice University\ndipakc@rice.edu\n\nAkash Srivastava\n\nUniversity of Edinburgh\n\nAkash.Srivastava@ed.ac.uk\n\nCharles Sutton\n\nUniversity of Edinburgh,\n\nThe Alan Turing Institute, and Google Brain\n\ncharlessutton@google.com\n\nSwarat Chaudhuri\n\nRice University\nswarat@rice.edu\n\nAbstract\n\nWe present a neurosymbolic framework for the lifelong learning of algorithmic\ntasks that mix perception and procedural reasoning. Reusing high-level concepts\nacross domains and learning complex procedures are key challenges in lifelong\nlearning. We show that a program synthesis approach that combines gradient\ndescent with combinatorial search over programs can be a more effective response\nto these challenges than purely neural methods. Our framework, called HOUDINI,\nrepresents neural networks as strongly typed, differentiable functional programs\nthat use symbolic higher-order combinators to compose a library of neural func-\ntions. Our learning algorithm consists of: (1) a symbolic program synthesizer that\nperforms a type-directed search over parameterized programs, and decides on the\nlibrary functions to reuse, and the architectures to combine them, while learning a\nsequence of tasks; and (2) a neural module that trains these programs using stochas-\ntic gradient descent. We evaluate HOUDINI on three benchmarks that combine\nperception with the algorithmic tasks of counting, summing, and shortest-path\ncomputation. Our experiments show that HOUDINI transfers high-level concepts\nmore effectively than traditional transfer learning and progressive neural networks,\nand that the typed representation of networks signi\ufb01cantly accelerates the search.\n\n1\n\nIntroduction\n\nDifferentiable programming languages [25, 29, 8, 15, 10, 39, 34] have recently emerged as a powerful\napproach to the task of engineering deep learning systems. These languages are syntactically similar\nto, and often direct extensions of, traditional programming languages. However, programs in these\nlanguages are differentiable in their parameters, permitting gradient-based parameter learning. The\nframework of differentiable languages has proven especially powerful for representing architectures\nthat have input-dependent structure, such as deep networks over trees [35, 2] and graphs [23, 19].\nA recent paper by Gaunt et al. [14] points to another key appeal of high-level differentiable languages:\nthey facilitate transfer across learning tasks. The paper gives a language called NEURAL TERPRET\n(NTPT) in which programs can invoke a library of trainable neural networks as subroutines. The\nparameters of these \u201clibrary functions\u201d are learned along with the code that calls them. The modularity\nof the language allows knowledge transfer, as a library function trained on a task can be reused in\nlater tasks. In contrast to standard methods for transfer learning in deep networks, which re-use the\n\ufb01rst few layers of the network, neural libraries have the potential to enable reuse of higher, more\nabstract levels of the network, in what could be called high-level transfer. In spite of its promise,\nthough, inferring the structure of differentiable programs is a fundamentally hard problem. While\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fNTPT and its predecessor TERPRET [15] allow some aspects of the program structure to be induced,\na detailed hand-written template of the program is required for even the simplest tasks.\nIn this paper, we show that algorithmic ideas from program synthesis can help overcome this dif\ufb01culty.\nThe goal in program synthesis [3, 36, 13] is to discover programs (represented as terms following a\nspeci\ufb01ed syntax) that accomplish a given task. Many symbolic algorithms for the problem have been\nproposed in the recent past [16]. These algorithms can often outperform purely neural approaches on\nprocedural tasks [15]. A key insight behind our approach is that these methods naturally complement\nstochastic gradient descent (SGD) in the induction of differentiable programs: while SGD is an\neffective way of learning program parameters, each step in a symbolic search can explore large\nchanges to the program structure.\nA second feature of our work is a representation of programs in a typed functional language. Such\na representation is natural because functional combinators can compactly describe many common\nneural architectures [26]. For example, fold combinators can describe recurrent neural networks,\nand convolution over data structures such as lists and graphs can also be naturally expressed as\nfunctional combinators. Such representations also facilitate search, as they tend to be more canonical,\nand as the type system can help prune infeasible programs early on in the search [13, 27].\nConcretely, we present a neurosymbolic learning framework, called HOUDINI, which is to our\nknowledge the \ufb01rst method to use symbolic methods to induce differentiable programs. In HOUDINI,\na program synthesizer is used to search over networks described as strongly typed functional programs,\nwhose parameters are then tuned end-to-end using gradient descent. Programs in HOUDINI specify\nthe architecture of the network, by using functional combinators to express the network\u2019s connections,\nand can also facilitate learning transfer, by letting the synthesizer choose among library functions.\nHOUDINI is \ufb02exible about how the program synthesizer is implemented: here, we use and compare\ntwo implementations, one performing top-down, type-directed enumeration with a preference for\nsimpler programs, and the other based on a type-directed evolutionary algorithm. The implementation\nfor HOUDINI is available online [1].\nWe evaluate HOUDINI in the setting of lifelong learning [38], in which a model is trained on a\nseries of tasks, and each training round is expected to bene\ufb01t from previous rounds of learning. Two\nchallenges in lifelong learning are catastrophic forgetting, in which later tasks overwrite what has\nbeen learned from earlier tasks, and negative transfer, in which attempting to use information from\nearlier tasks hurts performance on later tasks. Our use of a neural library avoids both problems, as\nthe library allows us to freeze and selectively re-use portions of networks that have been successful.\nOur experimental benchmarks combine perception with algorithmic tasks such as counting, summing,\nand shortest-path computation. Our programming language allows parsimonious representation for\nsuch tasks, as the combinators used to describe network structure can also be used to compactly\nexpress rich algorithmic operations. Our experiments show that HOUDINI can learn nontrivial\nprograms for these tasks. For example, on a task of computing least-cost paths in a grid of images,\nHOUDINI discovers an algorithm that has the structure of the Bellman-Ford shortest path algorithm [7],\nbut uses a learned neural function that approximates the algorithm\u2019s \u201crelaxation\u201d step. Second, our\nresults indicate that the modular representation used in HOUDINI allows it to transfer high-level\nconcepts and avoid negative transfer. We demonstrate that HOUDINI offers greater transfer than\nprogressive neural networks [32] and traditional \u201clow-level\u201d transfer [40], in which early network\nlayers are inherited from previous tasks. Third, we show that the use of a higher-level, typed language\nis critical to scaling the search for programs.\nThe contributions of this paper are threefold. First, we propose the use of symbolic program synthesis\nin transfer and lifelong learning. Second, we introduce a speci\ufb01c representation of neural networks\nas typed functional programs, whose types contain rich information such as tensor dimensions, and\nshow how to leverage this representation in program synthesis. Third, we show that the modularity\ninherent in typed functional programming allows the method to transfer high-level modules, to avoid\nnegative transfer and to achieve high performance with a small number of examples.\n\nRelated Work. HOUDINI builds on a known insight from program synthesis [16]: that functional\nrepresentations and type-based pruning can be used to accelerate search over programs [13, 27, 20].\nHowever, most prior efforts on program synthesis are purely symbolic and driven by the Boolean\ngoals. HOUDINI repurposes these methods for an optimization setting, coupling them with gradient-\nbased learning. A few recent approaches to program synthesis do combine symbolic and neural\n\n2\n\n\f::= Atom | ADT | F\n\n::= ADT | F1 \u2192 F2\n\nTypes \u03c4:\nAtom ::= bool | real\n\u03c4\nTT ::= Tensor(cid:104)Atom(cid:105)[m1][m2] . . . [mk] ADT ::= TT | \u03b1(cid:104)TT(cid:105)\nF\nPrograms e over library L:\nFigure 1: Syntax of the HOUDINI language. Here, \u03b1 is an ADT, e.g., list; m1, . . . , mk \u2265 1 de\ufb01ne\nthe shape of a tensor; F1 \u2192 F2 is a function type; \u2295w \u2208 L is a neural library function that has type\n\u03c40 and parameters w; and \u25e6 is the composition operator. map, fold, and conv are de\ufb01ned below.\n\n::= \u2295w : \u03c40 | e0 \u25e6 e1 | map\u03b1 e | fold\u03b1 e | conv\u03b1 e.\n\n.\n\ne\n\nmethods [11, 6, 12, 28, 18]. Unlike our work, these methods do not synthesize differentiable programs.\nThe only exception is NTPT [14], which neither considers a functional language nor a neurosymbolic\nsearch. Another recent method that creates a neural library is progress-and-compress [33], but this\nmethod is restricted to feedforward networks and low-level transfer. It is based on progressive\nnetworks [32], a method for lifelong learning based on low-level transfer, in which lateral connections\nare added between the networks for the all the previous tasks and the new task.\nNeural module networks (NMNs) [4, 17] select and arrange reusable neural modules into a program-\nlike network, for visual question answering. The major difference from our work is that NMNs\nrequire a natural language input to guide decisions about which modules to combine. HOUDINI\nworks without this additional supervision. Also, HOUDINI can be seen to perform a form of neural\narchitecture search. Such search has been studied extensively, using approaches such as reinforcement\nlearning, evolutionary computation, and best-\ufb01rst search [42, 24, 31, 41]. HOUDINI operates at a\nhigher level of abstraction, combining entire networks that have been trained previously, rather than\noptimizing over lower-level decisions such as the width of convolutional \ufb01lters, the details of the\ngating mechanism, and so on. HOUDINI is distinct in its use of functional programming to represent\narchitectures compactly and abstractly, and in its extensive use of types in accelerating search.\n\n2 The HOUDINI Programming Language\n\nHOUDINI consists of two components. The \ufb01rst is a typed, higher-order, functional language of\ndifferentiable programs. The second is a learning procedure split into a symbolic module and a\nneural module. Given a task speci\ufb01ed by a set of training examples, the symbolic module enumerates\nparameterized programs in the HOUDINI language. The neural module uses gradient descent to \ufb01nd\noptimal parameters for synthesized programs; it also assesses the quality of solutions and decides\nwhether an adequately performant solution has been discovered.\nThe design of the language is based on three ideas:\n\u2022 The ubiquitous use of function composition to glue together different networks.\n\u2022 The use of higher-order combinators such as map and fold to uniformly represent neural archi-\n\u2022 The use of a strong type discipline to distinguish between neural computations over different forms\n\ntectures as well as patterns of recursion in procedural tasks.\n\nof data, and to avoid generating provably incorrect programs during symbolic exploration.\n\nFigure 1 shows the grammar for the HOUDINI language. Here, \u03c4 denotes types and e denotes\nprograms. Now we elaborate on the various language constructs.\n\nTypes. The \u201catomic\u201d data types in HOUDINI are booleans (bool) and reals. For us, bool is\nrelaxed into a real value in [0, 1], which for example, allows the type system to track if a vector has\nbeen passed through a sigmoid. Tensors over these types are also permitted. We have a distinct type\nTensor(cid:104)Atom(cid:105)[m1][m2] . . . [mk] for tensors of shape m1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 mk whose elements have atomic\ntype Atom. (The dimensions m1, . . . , mk, as well as k itself, are bounded to keep the set of types\n\ufb01nite.) We also have function types F1 \u2192 F2, and abstract data types (ADTs) \u03b1(cid:104)TT(cid:105) parameterized\nby a tensor type TT . Our current implementation supports two kinds of ADTs: list(cid:104)TT(cid:105), lists with\nelements of tensor type TT , and graph(cid:104)TT(cid:105), graphs whose nodes have values of tensor type TT .\nPrograms. The fundamental operation in HOUDINI is function composition. A composition opera-\ntion can involve functions \u2295w, parameterized by weights w and implemented by neural networks,\ndrawn from a library L. It can also involve a set of symbolic higher-order combinators that are\n\n3\n\n\fguaranteed to preserve end-to-end differentiability and used to implement high-level network archi-\ntectures. Speci\ufb01cally, we allow the following three families of combinators. The \ufb01rst two are standard\nconstructs for functional languages, whereas the third is introduced speci\ufb01cally for deep models.\n\u2022 Map combinators map\u03b1(cid:104)\u03c4(cid:105), for ADTs of the form \u03b1(cid:104)\u03c4(cid:105). Suppose e is a function. The expression\nmaplist(cid:104)\u03c4(cid:105) e is a function that, given a list [a1, . . . , ak], returns the list [e(a1), . . . , e(ak)]. The\ne is a function that, given a graph G whose i-th node is labeled with a value\nexpression mapgraph\u03c4\nai, returns a graph that is identical to G, but whose i-th node is labeled by e(ai).\n\u2022 Left-fold combinators fold\u03b1(cid:104)\u03c4(cid:105). For a function e and a term z, foldlist(cid:104)\u03c4(cid:105) e z is the function that,\ngiven a list [a1, . . . , ak], returns the value (e (e . . . (e (e z a1) a2) . . . ) ak). To de\ufb01ne fold over\na graph, we assume a linear order on the graph\u2019s nodes. Given G, the function foldgraph(cid:104)\u03c4(cid:105) e z\nreturns the fold over the list [a1, . . . , ak], where ai is the value at the i-th node in this order.\n\u2022 Convolution combinators conv\u03b1(cid:104)\u03c4(cid:105). Let p > 0 be a \ufb01xed constant. For a \u201ckernel\u201d function e,\nconvlist(cid:104)\u03c4(cid:105) e is the function that, given a list [a1, . . . , ak], returns the list [a(cid:48)\nk], where\na(cid:48)\ni = e [ai\u2212p, . . . , ai, . . . , ai+p]. (We de\ufb01ne aj = a1 if j < 1, and aj = ak if j > k.) Given\na graph G, the function convgraph(cid:104)\u03c4(cid:105) e returns the graph G(cid:48) whose node u contains the value\ne [ai1 , . . . , aim], where aij is the value stored in the j-th neighbor of u.\n\n1, . . . , a(cid:48)\n\nEvery neural library function is assumed to be annotated with a type. Using programming language\ntechniques [30], HOUDINI assigns a type to each program e whose subexpressions use types consis-\ntently (see supplementary material). If it is impossible to assign a type to e, then e is type-inconsistent.\nNote that complete HOUDINI programs do not have explicit variable names. Thus, HOUDINI follows\nthe point-free style of functional programming [5]. This style permits highly succinct representations\nof complex computations, which reduces the amount of enumeration needed during synthesis.\n\nHOUDINI for deep learning. The language has several properties\nthat are useful for specifying deep models. First, any complete HOU-\nDINI program e is differentiable in the parameters w of the neural\nlibrary functions used in e. Second, common deep architectures can be\ncompactly represented in our language. For example, deep feedforward\nnetworks can be represented by \u22951 \u25e6 \u00b7\u00b7\u00b7 \u25e6 \u2295k, where each \u2295i is a neu-\nral function, and recurrent nets can be expressed as foldlist(cid:104)\u03c4(cid:105) \u2295 z,\nwhere \u2295 is a neural function and z is the initial state. Graph convo-\nlutional networks can be expressed as convgraph(cid:104)\u03c4(cid:105) \u2295. Going further,\nthe language can be easily extended to handle bidirectional recurrent\nnetworks, attention mechanisms, and so on.\n\nFigure 2: A grid of images\nfrom the GTSRB dataset [37].\nThe least-cost path from the\ntop left to the bottom right\nExample: Shortest path in a grid of images. To show how HOU-\nnode is marked.\nDINI can model tasks that mix perception and procedural reasoning,\nwe use an example that generalizes the navigation task of Gaunt et al. [14]. Suppose we are given a\ngrid of images (e.g., Figure 2), whose elements represent speed limits and are connected horizontally\nand vertically, but not diagonally. Passing through each node induces a penalty, which depends on\nthe node\u2019s speed limit, with lower speed limits having a higher penalty. The task is to predict the\nminimum cost d(u) incurred while traveling from a \ufb01xed starting point init to every other node u.\nOne way to compute these costs is using the Bellman-Ford shortest-path algorithm [7], whose i-th\niteration computes an estimated minimum cost di(u) of travel to each node u in the graph. The\ncost estimates for the (i + 1)-th iteration are computed using a relaxation operation: di+1(u) :=\nmin(di(u), minv\u2208Adj (u) di(v) + w(v)), where w(v) is the penalty and Adj (u) the neighbors of u.\nAs the update to di(u) only depends on values at u and its neighbors, the relaxation step can be\nrepresented as a graph convolution. As described in Section 4, HOUDINI is able to discover an\napproximation of this program purely from data. The synthesized program uses a graph convolution,\na graph map, a neural module that processes the images of speed limits, and a neural module that\napproximates the relaxation function.\n\n3 Learning Algorithm\n\nNow we de\ufb01ne our learning problem. For a HOUDINI program ew parameterized by a vector w,\nlet e[w (cid:55)\u2192 v] be the function for the speci\ufb01c parameter vector v, i.e. by substituting w by v in e.\n\n4\n\n\fSuppose we have a library L of neural functions and a training set D. As usual, we assume that D\nconsists of i.i.d. samples from a distribution pdata. We assume that D is properly typed, i.e., every\ntraining instance (xi, yi) \u2208 D has the same type, which is known. This means that we also know the\ntype \u03c4 of our target function. The goal in our learning problem is to discover a program e\u2217\nw of type\nw[w (cid:55)\u2192 v] = argmine\u2208Progs(L),w\u2208Rn (Ex\u223cpdata [l(e,D, x)]), where\n\u03c4, and values v for w such that e\u2217\nProgs(L) is the universe of all programs over L, and l is a suitable loss function.\nOur algorithm for this task consists of a symbolic program synthesis module called GENERATE and a\ngradient-based optimization module called TUNE. GENERATE repeatedly generates parameterized\nprograms ew and \u201cproposes\u201d them to TUNE. TUNE uses stochastic gradient descent to \ufb01nd parameter\nvalues v for ew that lead to the optimal value of the loss function on a training set, and produces a\nprogram e = ew[w (cid:55)\u2192 v] with instantiated parameters. The \ufb01nal output of the algorithm is a program\ne\u2217, among all programs e as above, that leads to optimal loss on a validation set.\nAs each program proposed by GENERATE is subjected to training, GENERATE can only afford to\npropose a small number of programs, out of the vast combinatorial space of all programs. Selecting\nthese programs is a dif\ufb01cult challenge. We use and compare two strategies for this task. Now we\nsketch these strategies; for more details, see the supplementary material.\n\u2022 The \ufb01rst strategy is top-down iterative re\ufb01nement, similar to the algorithm in the \u03bb2 program\nsynthesizer [13]. Here, the synthesis procedure iteratively generates a series of \u201cpartial\u201d programs\n(i.e., programs with missing code) over the library L, starting with an \u201cempty\u201d program and ending\nwith a complete program. A type inference procedure is used to rule out any partial program that is\nnot type-safe. A cost heuristic is used to generate programs in an order of structural simplicity.\nConcretely, shorter programs are evaluated \ufb01rst.\n\u2022 The second method is an evolutionary algorithm inspired by work on functional genetic program-\nming [9]. Here, we use selection, crossover, and mutation operators to evolve a population of\nprograms over L. Types play a key role: all programs in the population are ensured to be type-safe,\nand mutation and crossover only replace a subterm in a program with terms of the same type.\n\nIn both cases, the use of types vastly reduces the amount of search that is needed, as the number of\ntype-safe programs of a given size is a small fraction of the number of programs of that size. See\nSection 4 for an experimental con\ufb01rmation.\nLifelong Learning. A lifelong learning setting is a sequence of related tasks D1,D2, . . ., where\neach task Di has its own training and validation set. Here, the learner is called repeatedly, once for\neach task Di using a neural library Li, returning a best-performing program e\u2217\ni with parameters v\u2217\ni .\nWe implement transfer learning simply by adding new modules to the neural library after each call\nto the learner. We add all neural functions from e\u2217\ni back into the library, freezing their parameters.\nMore formally, let \u2295i1 . . .\u2295iK be the neural library functions which are called anywhere in e\u2217\ni . Each\nlibrary function \u2295ik has parameters wik, set to the value v\u2217\nik by TUNE. The library for the next task\nik]}. This process ensures that the parameter vectors of \u2295ik are\nis then Li+1 = Li \u222a {\u2295ik[wik (cid:55)\u2192 v\u2217\nfrozen and can no longer be updated by subsequent tasks. Thus, we prevent catastrophic forgetting\nby design. Importantly, it is always possible for the synthesizer to introduce \u201cfresh networks\u201d whose\nparameters have not been pretrained. This is because the library always monotonically increases over\ntime, so that an original neural library function with untrained parameters is still available.\nThis approach has the important implication that the set of neural library functions that the synthesizer\nuses is not \ufb01xed, but continually evolving. Because both trained and untrained versions of the library\nfunctions are available, this can be seen to permit selective transfer, meaning that depending on which\nversion of the library function GENERATE chooses, the learner has the option of using or not using\npreviously learned knowledge in a new task. This fact allows HOUDINI to avoid negative transfer.\n\n4 Evaluation\n\nOur evaluation studies four questions. First, we ask whether HOUDINI can learn nontrivial differen-\ntiable programs that combine perception and algorithmic reasoning. Second, we study if HOUDINI\ncan transfer perceptual and algorithmic knowledge during lifelong learning. We study three forms of\ntransfer: low-level transfer of perceptual concepts across domains, high-level transfer of algorithmic\nconcepts, and selective transfer where the learning method decides on which known concepts to\n\n5\n\n\fIndividual tasks\nrecognize digit(d): Binary\nclassi\ufb01cation of whether image\ncontains a digit d \u2208 {0 . . . 9}\nclassify digit: Classify a digit into\ndigit categories (0 \u2212 9)\nrecognize toy(t): Binary\nclassi\ufb01cation of whether an image\ncontains a toy t \u2208 {0 . . . 4}\nregress speed: Return the speed value\nand a maximum distance constant\nfrom an image of a speed limit sign.\nregress mnist: Return the value and a\nmaximum distance constant from a\ndigit image from MNIST dataset.\ncount digit(d): Given a list of images,\ncount the number of images of digit d\ncount toy(t): Given a list of images,\ncount the number of images of toy t\nsum digits: Given a list of digit\nimages, compute the sum of the\ndigits.\nshortest path street: Given a grid of\nimages of speed limit signs, \ufb01nd the\nshortest distances to all other nodes\nshortest path mnist: Given a grid of\nMNIST images, and a source node,\n\ufb01nd the shortest distances to all other\nnodes in the grid.\n\nTask Sequences\nCounting\nCS1: Evaluate low-level transfer.\nTask 1: recognize digit(d1); Task 2: recognize digit(d2); Task 3:\ncount digit(d1); Task 4: count digit(d2)\nCS2: Evaluate high-level transfer, and learning of perceptual tasks from\nhigher-level supervision.\nTask 1: recognize digit(d1); Task 2: count digit(d1); Task 3:\ncount digit(d2); Task 4: recognize digit(d2)\nCS3: Evaluate high-level transfer of counting across different image\ndomains.\nTask 1: recognize digit(d); Task 2: count digit(d); Task 3: count toy(t);\nTask 4: recognize toy(t)\nSumming\nSS: Demonstrate low-level transfer of a multi-class classi\ufb01er as well as\nthe advantage of functional methods like foldl in speci\ufb01c situations.\nTask 1: classify digit; Task 2: sum digits\nSingle-Source Shortest Path\nGS1: Learning of complex algorithms.\nTask 1: regress speed; Task 2: shortest path street\nGS2: High-level transfer of complex algorithms.\nTask 1: regress mnist; Task 2: shortest path mnist; Task 3:\nshortest path street\nLong sequence LS.\nTask 1:count digit(d1); Task 2: count toy(t1); Task 3: recognize toy(t2);\nTask 4: recognize digit(d2); Task 5: count toy(t3); Task 6:\ncount digit(d3); Task 7: count toy(t4); Task 8: recognize digit(d4); Task\n9: count digit(d5)\n\nFigure 3: Tasks and task sequences.\n\nreuse. Third, we study the value of our type-directed approach to synthesis. Fourth, we compare the\nperformance of the top-down and evolutionary synthesis algorithms.\n\nTask Sequences. Each lifelong learning setting is a sequence of individual learning tasks. The full\nlist of tasks is shown in Figure 3. These tasks include object recognition tasks over three data sets:\nMNIST [21], NORB [22], and the GTSRB data set of images of traf\ufb01c signs [37]. In addition, we\nhave three algorithmic tasks: counting the number of instances of images of a certain class in a list of\nimages; summing a list of images of digits; and the shortest path computation described in Section 2.\nWe combine these tasks into seven sequences. Three of these (CS1, SS, GS1) involve low-level\ntransfer, in which earlier tasks are perceptual tasks like recognizing digits, while later tasks introduce\nhigher-level algorithmic problems. Three other task sequences (CS2, CS3, GS2) involve higher-level\ntransfer, in which earlier tasks introduce a high-level concept, while later tasks require a learner\nto re-use this concept on different perceptual inputs. For example, in CS2, once count digit(d1) is\nlearned for counting digits of class d1, the synthesizer can learn to reuse this counting network on a\nnew digit class d2, even if the learning system has never seen d2 before. The graph task sequence GS1\nalso demonstrates that the graph convolution combinator in HOUDINI allows learning of complex\ngraph algorithms and GS2 tests if high-level transfer can be performed with this more complex task.\nFinally, we include a task sequence LS that is designed to evaluate our method on a task sequence\nthat is both longer and that lacks a favourable curriculum. The sequence LS was initially randomly\ngenerated, and then slightly amended in order to evaluate all lifelong learning concepts discussed.\nExperimental setup. We allow three kinds of neural library modules: multi-layer perceptrons\n(MLPs), convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We use two\nsymbolic synthesis strategies: top-down re\ufb01nement and evolutionary. We use three types of baselines:\n(1) standalone networks, which do not do transfer learning, but simply train a new network (an RNN)\nfor each task in the sequence, starting from random weights; (2) a traditional neural approach to\nlow-level transfer (LLT) that transfers all weights learned in the previous task, except for the output\nlayer that is kept task-speci\ufb01c; and (3) a version of the progressive neural networks (PNNs) [32]\n\n6\n\n\fTask\nTask 1: regress mnist\n\nTask 2:\nshortest path mnist\n\nTask 3:\nshortest path street\n\nTop 3 programs\n1. nn gs2 1 \u25e6 nn gs2 2\n1. (conv g10 (nn gs2 3)) \u25e6 (map g (lib.nn gs2 1 \u25e6 lib.nn gs2 2))\n2. (conv g9 (nn gs2 4)) \u25e6 (map g (lib.nn gs2 1 \u25e6 lib.nn gs2 2))\n3. (conv g9 (nn gs2 5)) \u25e6 (map g (nn gs2 6 \u25e6 nn gs2 7))\n1. (conv g10(lib.nn gs2 3)) \u25e6 (map g (nn gs2 8 \u25e6 nn gs2 9))\n2. (conv g9(lib.nn gs2 3)) \u25e6 (map g (nn gs2 10 \u25e6 nn gs2 11))\n3. (conv g10(lib.nn gs2 3)) \u25e6 (map g (lib.nn gs2 1 \u25e6 lib.nn gs2 2))\n\nRMSE\n1.47\n1.57\n1.72\n4.99\n3.48\n3.84\n6.91\n\nFigure 4: Top 3 synthesized programs on Graph Sequence 2 (GS2). map g denotes a graph map (of\nthe appropriate type); conv gi denotes i repeated applications of a graph convolution combinator.\n\napproach, which retains a pool of pretrained models during training and learns lateral connections\namong these models. Experiments were performed using a single-threaded implementation on a\nLinux system, with 8-core Intel E5-2620 v4 2.10GHz CPUs and TITAN X (Pascal) GPUs.\nThe architecture chosen for the standalone and LLT baselines composes an MLP, an RNN, and a\nCNN, and matches the structure of a high-performing program returned by HOUDINI, to enable\nan apples-to-apples comparison. In PNNs, every task in a sequence is associated with a network\nwith the above architecture; lateral connections between these networks are learned. Each sequence\ninvolving digit classes d and toy classes t was instantiated \ufb01ve times for random values of d and t,\nand the results shown are averaged over these instantiations. In the graph sequences, we ran the same\nsequences with different random seeds, and shared the regressors learned for the \ufb01rst tasks across the\ncompeting methods for a more reliable comparison. We do not compare against PNNs in this case, as\nit is nontrivial to extend them to work with graphs. We evaluate the competing approaches on 2%,\n10%, 20%, 50% and 100% of the training data for all but the graph sequences, where we evaluate\nonly on 100%. For classi\ufb01cation tasks, we report error, while for the regression tasks \u2014 counting,\nsumming, regress speed and shortest path \u2014 we report root mean-squared error (RMSE).\nResults: Synthesized programs. HOUDINI successfully synthesizes programs for each of the tasks\nin Figure 3 within at most 22 minutes. We list in Figure 4 the top 3 programs for each task in the graph\nsequence GS2, and the corresponding RMSEs. Here, function names with pre\ufb01x \u201cnn \u201d denote fresh\nneural modules trained during the corresponding tasks. Terms with pre\ufb01x \u201clib.\u201d denote pretrained\nneural modules selected from the library. The synthesis times for Task 1, Task 2, and Task 3 are 0.35s,\n1061s, and 1337s, respectively.\n(conv g10 lib.nn gs2 3) \u25e6\nAs an illustration, consider\n(map g (nn gs2 8 \u25e6 nn gs2 9)). Here, map g takes as argument a function for processing the\nimages of speed limits. Applied to the input graph, the map returns a graph G in which each node con-\ntains a number associated with its corresponding image and information about the least cost of travel\nto the node. The kernel for the graph convolution combinator conv g is a function lib.nn gs2 3,\noriginally learned in Task 2, that implements the relaxation operation used in shortest-path algorithms.\nThe convolution is applied repeatedly, just like in the Bellman-Ford shortest path algorithm.\nIn the SS sequence, the top program for Task 2 is: (fold l nn ss 3 zeros(1)) \u25e6 map l(nn ss 4 \u25e6\nlib.nn ss 2). Here, fold l denotes the fold operator applied to lists, and zeros(dim) is a function\nthat returns a zero tensor of appropriate dimension. The program uses a map to apply a previously\nlearned CNN feature extractor (lib.nn ss 2) and a learned transformation of said features into a 2D\nhidden state, to all images in the input list. It then uses fold with another function (nn ss 3) to give\nthe \ufb01nal sum. Our results, presented in the supplementary material, show that this program greatly\noutperforms the baselines, even in the setting where all of the training data is available. We believe\nthat this is because the synthesizer has selected a program with fewer parameters than the baseline\nRNN. In the results for the counting sequences (CS) and the long sequence (LS), the number of\nevaluated programs is restricted to 20, therefore fold l is not used within the synthesized programs.\nThis allows us to evaluate the advantage of HOUDINI brought by its transfer capabilities, rather than\nits rich language.\nResults: Transfer. First we evaluate the performance of the methods on the counting sequences\n(Figure 5). For space, we omit early tasks where, by design, there is no opportunity for transfer; for\nthese results, see the Appendix. In all cases where there is an opportunity to transfer from previous\ntasks, we see that HOUDINI has much lower error than any of the other transfer learning methods.\nThe actual programs generated by HOUDINI are listed in the Appendix.\n\nthe top program for Task 3:\n\n7\n\n\fLow-level transfer (CS1)\n\nHigh-level transfer (CS2)\n\nHigh-level transfer\nacross domains (CS3)\n\n(a) CS1 Task 3: count digit(d1)\n\n(b) CS2 Task 3: count digit(d2)\n\n(c) CS3 Task 3: count toy(t1)\n\n(d) CS1 Task 4: count digit(d2)\nFigure 5: Lifelong \u201clearning to count\u201d (Sequences CS1 \u2013 CS3), demonstrating both low-level\ntransfer of perceptual concepts and high-level transfer of a counting network. HOUDINI-TD and\nHOUDINI-EVOL are HOUDINI with the top-down and evolutionary synthesizers, respectively.\n\n(e) CS2 Task 4: recognize digit(d2)\n\n(f) CS3 Task 4: recognize toy(t1)\n\nTask sequence CS1 evaluates the method\u2019s ability to selectively perform low-level transfer of a\nperceptual concept across higher level tasks. The \ufb01rst task that provides a transfer opportunity is\nCS1 task 3 (Figure 5a). There are two potential lower-level tasks that the methods could transfer\nfrom: recognize digit(d1) and recognize digit(d2). HOUDINI learns programs composed of neural\nmodules nn cs1 1, nn cs1 2, nn cs1 3, and nn cs1 4 for these two tasks. During training for the\ncount digit(d1) task, all the previously learned neural modules are available in the library. The\nlearner, however, picks the correct module (nn cs1 2) for reuse, learning the program \u201cnn cs1 7 \u25e6\n(map l (nn cs1 8 \u25e6 lib.nn cs1 2))\u201d where nn cs1 7 and nn cs1 8 are fresh neural modules, and\nmap l stands for a list map combinator of appropriate type. The low-level transfer baseline cannot\nselect which of the previous tasks to re-use, and so suffers worse performance.\nTask sequence CS2 provides an opportunity to transfer the higher-level concepts of counting, across\ndifferent digit classi\ufb01cation tasks. Here CS2 task 3 (Figure 5b) is the task that provides the \ufb01rst\nopportunity for transfer. We see that HOUDINI is able to learn much faster on this task because\nit is able to reuse a network which has learned from the previous counting task. Task sequence\nCS3 examines whether the methods can demonstrate high-level transfer when the input image\ndomains are very different, from the MNIST domain to the NORB domain of toy images. We see in\nFigure 5c that the higher-level network still successfully transfers across tasks, learning an effective\nnetwork for counting the number of toys of type t1, even though the network has not previously\nseen any toy images at all. What is more, it can be seen that because of the high-level transfer,\nHOUDINI has learned a modular solution to this problem. From the subsequent performance on a\nstandalone toy classi\ufb01cation task (Figure 5f), we see that CS3 task 3 has already caused the network\nto induce a re-usable classi\ufb01er on toys. Overall, it can be seen that HOUDINI outperforms all the\nbaselines even under the limited data setting, con\ufb01rming the successful selective transfer of both\nlow-level and high-level perceptual information. Similar results can be seen on the summing task\n(see supplementary material). Moreover, on the longer task sequence LS, we also \ufb01nd that HOUDINI\nperforms signi\ufb01cantly better on the tasks in the sequence where there is an opportunity for transfer,\nand performs comparably the baselines on the other tasks (see supplementary material). Furthermore,\non the summing sequence, our results also show low level transfer.\nFinally, for the graph-based tasks (Table 2), we see that the graph convolutional program learned by\nHOUDINI on the graph tasks has signi\ufb01cantly less error than a simple sequence model, a standalone\n\n8\n\n\fTask\nTask 1\nTask 2\nTask 3\nTask 4\nTask 1\nTask 2\nTask 3\nTask 4\n\nNo types\n\n+ Types\n\nNumber of programs\n\nsize = 4\n8182\n12333\n17834\n24182\n2\n5\n9\n9\n\nsize = 5\n110372\n179049\n278318\n422619\n20\n37\n47\n51\n\nsize = 6\n1318972\n2278113\n3727358\n6474938\n44\n67\n158\n175\n\nTable 1: Effect of the type system on the number of\nprograms considered in the symbolic search for task\nsequence CS1.\n\nRNN w llt\nstandalone\nHOUDINI\nHOUDINI EA\nlow-level-transfer\n\nTask 1 Task 2\n5.58\n0.75\n4.96\n0.75\n1.77\n0.75\n8.32\n0.75\n0.75\n1.98\n\n(a) Low-level transfer (llt) (task sequence GS1).\n\nRNN w llt\nstandalone\nHOUDINI\nHOUDINI EA\nlow-level-transfer\n(b) High-level transfer (task sequence GS2).\n\nTask 1 Task 2 Task 3\n6.05\n1.44\n1.44\n7.\n3.31\n1.44\n7.88\n1.44\n1.44\n2.08\n\n5.00\n6.49\n1.50\n6.67\n1.76\n\nTable 2: Lifelong learning on graphs. Col 1:\nRMSE on speed/distance from image. Cols 2,\n3: RMSE on shortest path (mnist, street).\n\nbaseline and the evolutionary-algorithm-based version of HOUDINI. As explained earlier, in the\nshortest path street task in the graph sequence GS2, HOUDINI learns a program that uses newly\nlearned regress functions for the street signs, along with a \u201crelaxation\u201d function already learned\nfrom the earlier task shortest path mnist. In Table 2, we see this program performs well, suggesting\nthat a domain-general relaxation operation is being learned. Our approach also outperforms the\nlow-level-transfer baseline, except on the shortest path street task in GS2. We are unable to compare\ndirectly to NTPT because no public implementation is available. However, our graph task is a more\ndif\ufb01cult version of a task from [14], who report on their shortest-path task \u201c2% of random restarts\nsuccessfully converge to a program that generalizes\u201d (see their supplementary material).\nResults: Typed vs. untyped synthesis. To assess the impact of our type system, we count the\nprograms that GENERATE produces with and without a type system (we pick the top-down imple-\nmentation for this test, but the results also apply to the evolutionary synthesizer). Let the size of a\nprogram be the number of occurrences of library functions and combinators in the program. Table 1\nshows the number of programs of different sizes generated for the tasks in the sequence CS1. Since\nthe typical program size in our sequences is less than 6, we vary the target program size from 4 to 6.\nWhen the type system is disabled, the only constraint that GENERATE has while composing programs\nis the arity of the library functions. We note that this constraint fails to bring down the number of\ncandidate programs to a manageable size. With the type system, however, GENERATE produces far\nfewer candidate programs. For reference, neural architecture search often considers thousands of\npotential architectures for a single task [24].\nResults: Top-Down vs. Evolutionary Synthesis. Overall, the top-down implementation of GEN-\nERATE outperformed the evolutionary implementation. In some tasks, the two strategies performed\nsimilarly. However, the evolutionary strategy has high variance; indeed, in many runs of the task\nsequences, it times out without \ufb01nding a solution. The timed out runs are not included in the plots.\n\n5 Conclusion\n\nWe have presented HOUDINI, the \ufb01rst neurosymbolic approach to the synthesis of differentiable\nfunctional programs. Deep networks can be naturally speci\ufb01ed as differentiable programs, and\nfunctional programs can compactly represent popular deep architectures [26]. Therefore, symbolic\nsearch through a space of differentiable functional programs is particularly appealing, because it\ncan at the same time select both which pretrained neural library functions should be reused, and\nalso what deep architecture should be used to combine them. On several lifelong learning tasks that\ncombine perceptual and algorithmic reasoning, we showed that HOUDINI can accelerate learning by\ntransferring high-level concepts.\n\nAcknowledgements. This work was partially supported by DARPA MUSE award #FA8750-14-2-\n0270 and NSF award #CCF-1704883.\n\n9\n\n\fReferences\n\n[1] Houdini code repository. https://github.com/capergroup/houdini.\n[2] Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, and Charles Sutton. Learning\ncontinuous semantic representations of symbolic expressions. In International Conference on\nMachine Learning (ICML), 2017.\n\n[3] Rajeev Alur, Rastislav Bod\u00b4\u0131k, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman,\nSanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa.\nSyntax-guided synthesis. In Formal Methods in Computer-Aided Design, FMCAD, pages 1\u201317,\n2013.\n\n[4] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n39\u201348, 2016.\n\n[5] John Backus. Can programming be liberated from the von Neumann style?: A functional style\n\nand its algebra of programs. Commun. ACM, 21(8):613\u2013641, August 1978.\n\n[6] Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow.\nDeepcoder: Learning to write programs. In International Conference on Learning Representa-\ntions (ICLR), 2017. arXiv:1611.01989.\n\n[7] Richard Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87\u201390, 1958.\n[8] Matko Bosnjak, Tim Rockt\u00a8aschel, Jason Naradowsky, and Sebastian Riedel. Programming with\na differentiable Forth interpreter. In International Conference on Machine Learning (ICML),\npages 547\u2013556, 2017.\n\n[9] Forrest Briggs and Melissa O\u2019Neill. Functional genetic programming with combinators. In\nProceedings of the Third Asian-Paci\ufb01c Workshop on Genetic Programming (ASPGP), pages\n110\u2013127, 2006.\n\n[10] Rudy R. Bunel, Alban Desmaison, Pawan Kumar Mudigonda, Pushmeet Kohli, and Philip H. S.\nTorr. Adaptive neural compilation. In Advances in Neural Information Processing Systems 29,\npages 1444\u20131452, 2016.\n\n[11] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed,\nand Pushmeet Kohli. Robust\ufb01ll: Neural program learning under noisy I/O. In International\nConference on Machine Learning (ICML), 2017.\n\n[12] Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Joshua B. Tenenbaum. Learning to\n\ninfer graphics programs from hand-drawn images. CoRR, abs/1707.09627, 2017.\n\n[13] John K. Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations\nfrom input-output examples. In ACM SIGPLAN Conference on Programming Language Design\nand Implementation, pages 229\u2013239, 2015.\n\n[14] Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differentiable\nprograms with neural libraries. In International Conference on Machine Learning (ICML),\npages 1213\u20131222, 2017.\n\n[15] Alexander L. Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli,\nJonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program\ninduction. CoRR, abs/1608.04428, 2016.\n\n[16] Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. Program synthesis. Foundations and\n\nTrends in Programming Languages, 4(1-2):1\u2013119, 2017.\n\n[17] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to\nreason: End-to-end module networks for visual question answering. In International Conference\non Computer Vision (ICCV), 2017. CoRR, abs/1704.05526.\n\n[18] Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, and Sumit\nGulwani. Neural-guided deductive search for real-time program synthesis from examples. In\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[19] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. In International Conference on Learning Representations (ICLR), 2017.\n\n10\n\n\f[20] Vu Le and Sumit Gulwani. FlashExtract: A framework for data extraction by examples. In\nACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI),\npages 542\u2013553, 2014.\n\n[21] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[22] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition\nwith invariance to pose and lighting. In Computer Vision and Pattern Recognition (CVPR),\n2004.\n\n[23] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural\n\nnetworks. In International Conference on Learning Representations (ICLR), 2016.\n\n[24] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan\nHuang, and Kevin Murphy. Progressive neural architecture search. In European Conference on\nComputer Vision (ECCV), 2018. arXiv:1712.00559.\n\n[25] Dougal Maclaurin, David Duvenaud, Matthew Johnson, and Ryan P. Adams. Autograd: Reverse-\n\nmode differentiation of native Python, 2015.\n\n[26] Christopher Olah. Neural networks, types, and functional programming, 2015. http://colah.\n\ngithub.io/posts/2015-09-NN-Types-FP/.\n\n[27] Peter-Michael Osera and Steve Zdancewic. Type-and-example-directed program synthesis. In\n\nPLDI, volume 50, pages 619\u2013630. ACM, 2015.\n\n[28] Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and\nPushmeet Kohli. Neuro-symbolic program synthesis. In International Conference on Learning\nRepresentations (ICLR), 2016.\n\n[29] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[30] Benjamin C. Pierce. Types and programming languages. MIT Press, 2002.\n[31] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan,\nQuoc V. Le, and Alexey Kurakin. Large-scale evolution of image classi\ufb01ers. In International\nConference on Machine Learning (ICML), 2017.\n\n[32] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[33] Jonathan Schwarz, Jelena Luketina, Wojciech M. Czarnecki, Agnieszka Grabska-Barwinska,\nYee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework\nfor continual learning. In International Conference on Machine Learning (ICML), 2018.\n\n[34] Asim Shankar and Wolff Dobson.\n\nby-run\n10/eager-execution-imperative-define-by.html, 2017.\n\ntensor\ufb02ow.\n\ninterface\n\nto\n\nEager execution:\n\nAn imperative, de\ufb01ne-\nhttps://research.googleblog.com/2017/\n\n[35] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y\nNg, and Christopher Potts. Recursive deep models for semantic compositionality over a\nsentiment treebank. In EMNLP, 2013.\n\n[36] Armando Solar-Lezama. Program sketching. STTT, 15(5-6):475\u2013495, 2013.\n[37] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine\n\nlearning algorithms for traf\ufb01c sign recognition. Neural Networks, (0), 2012.\n\n[38] Sebastian Thrun and Tom M. Mitchell. Lifelong robot learning. Robotics and Autonomous\n\nSystems, 15(1-2):25\u201346, 1995.\n\n[39] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open\nsource framework for deep learning. In Proceedings of NIPS Workshop on Machine Learning\nSystems (LearningSys), 2015.\n\n[40] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in\ndeep neural networks? In Advances in Neural Information Processing Systems 27 (NIPS), 2014.\n\n11\n\n\f[41] Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning.\n\nCoRR, abs/1708.05552, 2017.\n\n[42] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n12\n\n\f", "award": [], "sourceid": 5246, "authors": [{"given_name": "Lazar", "family_name": "Valkov", "institution": "University of Edinburgh"}, {"given_name": "Dipak", "family_name": "Chaudhari", "institution": "Rice University"}, {"given_name": "Akash", "family_name": "Srivastava", "institution": "University of Edinburgh"}, {"given_name": "Charles", "family_name": "Sutton", "institution": "Google"}, {"given_name": "Swarat", "family_name": "Chaudhuri", "institution": "Rice University"}]}