{"title": "Compositional generalization through meta sequence-to-sequence learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9791, "page_last": 9801, "abstract": "People can learn a new concept and use it compositionally, understanding how to \"blicket twice\" after learning how to \"blicket.\" In contrast, powerful sequence-to-sequence (seq2seq) neural networks fail such tests of compositionality, especially when composing new concepts together with existing concepts. In this paper, I show how memory-augmented neural networks can be trained to generalize compositionally through meta seq2seq learning. In this approach, models train on a series of seq2seq problems to acquire the compositional skills needed to solve new seq2seq problems. Meta se2seq learning solves several of the SCAN tests for compositional learning and can learn to apply implicit rules to variables.", "full_text": "Compositional generalization through meta\n\nsequence-to-sequence learning\n\nBrenden M. Lake\nNew York University\nFacebook AI Reasearch\n\nbrenden@nyu.edu\n\nAbstract\n\nPeople can learn a new concept and use it compositionally, understanding how to\n\u201cblicket twice\u201d after learning how to \u201cblicket.\u201d In contrast, powerful sequence-to-\nsequence (seq2seq) neural networks fail such tests of compositionality, especially\nwhen composing new concepts together with existing concepts. In this paper,\nI show how memory-augmented neural networks can be trained to generalize\ncompositionally through meta seq2seq learning. In this approach, models train on\na series of seq2seq problems to acquire the compositional skills needed to solve\nnew seq2seq problems. Meta se2seq learning solves several of the SCAN tests for\ncompositional learning and can learn to apply implicit rules to variables.\n\n1\n\nIntroduction\n\nPeople can learn new words and use them immediately in a rich variety of ways, thanks to their skills\nin compositional learning. Once a person learns the meaning of the verb \u201cto Facebook\u201d, she or he\ncan understand how to \u201cFacebook slowly,\u201d \u201cFacebook eagerly,\u201d or \u201cFacebook while walking.\u201d These\nabilities are due to systematic compositionality, or the algebraic capacity to understand and produce\nnovel utterances by combining familiar primitives [5, 27]. The \u201cFacebook slowly\u201d example depends\non knowledge of English, yet the ability is more general; people can also generalize compositionally\nwhen learning to follow instructions in arti\ufb01cial languages [18]. Despite its importance, systematic\ncompositionality has unclear origins, including the extent to which it is inbuilt versus learned. A\nkey challenge for cognitive science and arti\ufb01cial intelligence is to understand the computational\nunderpinnings of human compositional learning and to build machines with similar capabilities.\nNeural networks have long been criticized for lacking compositionality, leading critics to argue they\nare inappropriate for modeling language and thought [8, 24, 25]. Nonetheless neural architectures\nhave advanced and made important contributions in natural language processing (NLP) [20]. Recent\nwork has revisited these classic critiques through studies of modern neural architectures [10, 16, 3, 21,\n23, 2, 6], with a focus on the sequence-to-sequence (seq2seq) models used successfully in machine\ntranslation and other NLP tasks [34, 4, 38]. These studies show that powerful seq2seq approaches\nstill have substantial dif\ufb01culties with compositional generalization, especially when combining a new\nconcept (\u201cto Facebook\u201d) with previous concepts (\u201cslowly\u201d or \u201ceagerly\u201d) [16, 3, 21].\nNew benchmarks have been proposed to encourage progress [10, 16, 2], including the SCAN dataset\nfor compositional learning [16]. SCAN involves learning to follow instructions such as \u201cwalk twice\nand look right\u201d by performing a sequence of appropriate output actions; in this case, the correct\nresponse is to \u201cWALK WALK RTURN LOOK.\u201d A range of SCAN examples are shown in Table\n1. Seq2seq models are trained on thousands of instructions built compositionally from primitives\n(\u201clook\u201d, \u201cwalk\u201d, \u201crun\u201d, \u201cjump\u201d, etc.), modi\ufb01ers (\u201ctwice\u201d, \u201caround right,\u201d etc.) and conjunctions\n(\u201cand\u201d and \u201cafter\u201d). After training, the aim is to execute, zero-shot, novel instructions such as\n\u201cwalk around right after look twice.\u201d Previous studies show that seq2seq recurrent neural networks\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(RNN) generalize well when the training and test sets are similar, but fail catastrophically when\ngeneralization requires systematic compositionality [16, 3, 21]. For instance, models often fail to\nunderstand how to \u201cjump twice\u201d after learning how to \u201crun twice,\u201d \u201cwalk twice,\u201d and how to \u201cjump.\u201d\nBuilding neural architectures with these compositional abilities remains an open problem.\nIn this paper, I show how memory-augmented neural networks can be trained to generalize composi-\ntionally through \u201cmeta sequence-to-sequence learning\u201d (meta seq2seq learning). As is standard with\nmeta learning, training is distributed across a series of small datasets called \u201cepisodes\u201d instead of a\nsingle static dataset [36, 32, 7], in a process called \u201cmeta-training.\u201d Speci\ufb01c to meta seq2seq learning,\neach episode is a novel seq2seq problem that provides \u201csupport\u201d sequence pairs (input and output)\nand \u201cquery\u201d sequences (input only), as shown in Figures 1 and 2. The network loads the support\nsequence pairs into an external memory [33, 11, 31] to provide needed context for producing the\nright output sequence for each query sequence. The network\u2019s output sequences are compared to the\ntargets, demonstrating how to generalize compositionally from the support items to the query items.\nMeta seq2seq networks meta-train on multiple seq2seq problems that require compositional gener-\nalization, with the aim of acquiring the compositional skills needed to solve new problems. New\nseq2seq problems are solved entirely using the activation dynamics and external memory of the\nnetworks; no weight updates are made after the meta-training phase ceases. Through its unique\nchoice of architecture and training procedure, the network can implicitly learn rules that operate on\nvariables, an ability considered beyond the reach of eliminative connectionist networks [24, 25, 23]\nbut which has been pursued by more structured alternatives [33, 11, 12, 28]. In the sections below,\nI show how meta seq2seq learning can solve several challenging SCAN tasks for compositional\nlearning, although generalizing to longer output sequences remains unsolved.\n\n2 Related work\n\nMeta sequence-to-sequence learning builds on several areas of active research. Meta learning has been\nsuccessfully applied to few-shot image classi\ufb01cation [36, 32, 7, 19] including sequential versions that\nrequire external memory [31]. Few-shot visual tasks are qualitatively different from the compositional\nreasoning tasks studied here, which demand different architectures and learning principles. Closer to\nthe present work, meta learning has been recently applied to low resource machine translation [13],\ndemonstrating one application of meta learning to seq2seq translation problems. Crucially, these\nnetworks tackle a new task through weight updates rather than through memory and reasoning [7],\nand it is unclear whether this approach would work for compositional reasoning.\nExternal memories have also expanded the capabilities of modern neural network architectures.\nMemory networks have been applied to reasoning and question answering tasks [33], in cases where\nonly a single output is needed instead of a series of outputs. The Differentiable Neural Computer\n(DNC) [11] is also related to my proposal, in that a single architecture can reason through a wide\nrange of scenarios, including seq2seq-like graph traversal tasks. The DNC is a complex architecture\nwith multiple heads for reading and writing to memory, temporal links between memory cells, and\ntrackers to monitor memory usage. In contrast, the meta seq2seq learner uses a simple memory\nmechanism akin to memory networks [33] and does not call the memory module with every new\ninput symbol. Meta seq2seq uses higher-level abstractions to store and reason with entire sequences.\nThere has been recent progress on SCAN due to clever data augmentation [1] and syntax-based\nattention [30], although both approaches are currently limited in scope. For instance, syntactic\nattention relies on a symbol-to-symbol mapping module that may be inappropriate for many domains.\nMeta seq2seq is compared against syntactic attention [30] in the experiments that follow.\n\n3 Model\n\nThe meta sequence-to-sequence approach learns how to learn sequence-to-sequence (seq2seq)\nproblems \u2013 it uses a series of training seq2seq problems to develop the needed compositional skills\nfor solving new seq2seq problems. An overview of the meta seq2seq learner is illustrated in Figure 1.\nIn this \ufb01gure, the network is processing a query instruction \u201cjump twice\u201d in the context of a support\nset that shows how to \u201crun twice,\u201d \u201cwalk twice\u201d, \u201clook twice,\u201d and \u201cjump.\u201d In broad strokes, the\narchitecture is a standard seq2seq model [22] translating a query input into a query output (Figure 1).\nA recurrent neural network (RNN) encoder (fie; red RNN in bottom right of Figure 1) and a RNN\n\n2\n\n\fSupport\noutput\n\nfoe\n\nAAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRV0GPRi8cK9gPaUDbbSbt2sxt2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemHCmjed9O4W19Y3NreJ2aWd3b/+gfHjU0jJVFJtUcqk6IdHImcCmYYZjJ1FI4pBjOxzfzvz2EyrNpHgwkwSDmAwFixglxkqtqJ9JnPbLFa/qzeGuEj8nFcjR6Je/egNJ0xiFoZxo3fW9xAQZUYZRjtNSL9WYEDomQ+xaKkiMOsjm107dM6sM3EgqW8K4c/X3REZirSdxaDtjYkZ62ZuJ/3nd1ETXQcZEkhoUdLEoSrlrpDt73R0whdTwiSWEKmZvdemIKEKNDahkQ/CXX14lrVrVv6jW7i8r9Zs8jiKcwCmcgw9XUIc7aEATKDzCM7zCmyOdF+fd+Vi0Fpx85hj+wPn8AdRdj0s=\n\nfieAAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2k3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqzUCvsZx2m/XHGr7hxklXg5qUCORr/81RvELI1QGiao1l3PTYyfUWU4Ezgt9VKNCWVjOsSupZJGqP1sfu2UnFllQMJY2ZKGzNXfExmNtJ5Ege2MqBnpZW8m/ud1UxNe+xmXSWpQssWiMBXExGT2OhlwhcyIiSWUKW5vJWxEFWXGBlSyIXjLL6+SVq3qXVRr95eV+k0eRxFO4BTOwYMrqMMdNKAJDB7hGV7hzYmdF+fd+Vi0Fpx85hj+wPn8Acs5j0U=\n\nRUN RUN\n\nWALK WALK\n\nLOOK LOOK\n\nJUMP\n\nSupport\ninput\n\nrun twice\n\nwalk twice\n\nlook twice\n\njump\n\nQuery\noutput\n\nJUMP JUMP\n\n\n\nfod\n\nAAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWw2m3btZjfsboQS+h+8eFDEq//Hm//GbZqDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo2WqCG0TyaXqBVhTzgRtG2Y47SWK4jjgtBtMbud+94kqzaR4MNOE+jEeCRYxgo2VOtEwk+FsWK25dTcHWiVeQWpQoDWsfg1CSdKYCkM41rrvuYnxM6wMI5zOKoNU0wSTCR7RvqUCx1T7WX7tDJ1ZJUSRVLaEQbn6eyLDsdbTOLCdMTZjvezNxf+8fmqiaz9jIkkNFWSxKEo5MhLNX0chU5QYPrUEE8XsrYiMscLE2IAqNgRv+eVV0mnUvYt64/6y1rwp4ijDCZzCOXhwBU24gxa0gcAjPMMrvDnSeXHenY9Fa8kpZo7hD5zPH9LYj0o=\n\n\n\nJUMP JUMP\n\nCAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELh4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj2txvP6HSPJYPZpKgH9Gh5CFn1FipUesXS27ZXYCsEy8jJchQ7xe/eoOYpRFKwwTVuuu5ifGnVBnOBM4KvVRjQtmYDrFrqaQRan+6OHRGLqwyIGGsbElDFurviSmNtJ5Ege2MqBnpVW8u/ud1UxPe+lMuk9SgZMtFYSqIicn8azLgCpkRE0soU9zeStiIKsqMzaZgQ/BWX14nrUrZuypXGtel6l0WRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJczjMs=\n\nMAAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BL16EBMwDkiXMTnqTMbOzy8ysEEK+wIsHRbz6Sd78GyfJHjSxoKGo6qa7K0gE18Z1v53c2vrG5lZ+u7Czu7d/UDw8auo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDvzW0+oNI/lgxkn6Ed0IHnIGTVWqt/3iiW37M5BVomXkRJkqPWKX91+zNIIpWGCat3x3MT4E6oMZwKnhW6qMaFsRAfYsVTSCLU/mR86JWdW6ZMwVrakIXP198SERlqPo8B2RtQM9bI3E//zOqkJr/0Jl0lqULLFojAVxMRk9jXpc4XMiLEllClubyVsSBVlxmZTsCF4yy+vkmal7F2UK/XLUvUmiyMPJ3AK5+DBFVThDmrQAAYIz/AKb86j8+K8Ox+L1pyTzRzDHzifP6ZbjNU=\n\n{\n\nK\nL\nA\nW\n \nK\nL\nA\nW\n\nK\nO\nO\nL\n \nK\nO\nO\nL\n\nN\nU\nR\n \nN\nU\nR\n\nP\nM\nU\nJ\n\n{\n\ne\nc\ni\nw\nt\n \nk\nl\na\nw\n\ne\nc\ni\nw\nt\n \nk\no\no\nl\n\ne\nc\ni\nw\nt\n \nn\nu\nr\n\np\nm\nu\nj\n\nValues\n\nVAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZrtfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8fs/+M3g==\n\nAAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI+oF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftMrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1dssjjycwCmcgwdXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AJQrjMk=\n\nKeys KAAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BL4KXBMwDkiXMTnqTMbOzy8ysEEK+wIsHRbz6Sd78GyfJHjSxoKGo6qa7K0gE18Z1v53c2vrG5lZ+u7Czu7d/UDw8auo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDvzW0+oNI/lgxkn6Ed0IHnIGTVWqt/3iiW37M5BVomXkRJkqPWKX91+zNIIpWGCat3x3MT4E6oMZwKnhW6qMaFsRAfYsVTSCLU/mR86JWdW6ZMwVrakIXP198SERlqPo8B2RtQM9bI3E//zOqkJr/0Jl0lqULLFojAVxMRk9jXpc4XMiLEllClubyVsSBVlxmZTsCF4yy+vkmal7F2UK/XLUvUmiyMPJ3AK5+DBFVThDmrQAAYIz/AKb86j8+K8Ox+L1pyTzRzDHzifP6NTjNM=\n\nQAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZrNfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8frGuM2Q==\n\nfieAAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2k3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqzUCvsZx2m/XHGr7hxklXg5qUCORr/81RvELI1QGiao1l3PTYyfUWU4Ezgt9VKNCWVjOsSupZJGqP1sfu2UnFllQMJY2ZKGzNXfExmNtJ5Ege2MqBnpZW8m/ud1UxNe+xmXSWpQssWiMBXExGT2OhlwhcyIiSWUKW5vJWxEFWXGBlSyIXjLL6+SVq3qXVRr95eV+k0eRxFO4BTOwYMrqMMdNKAJDB7hGV7hzYmdF+fd+Vi0Fpx85hj+wPn8Acs5j0U=\n\nQuery\ninput\n\njump twice\n\nFigure 1: The meta sequence-to-sequence learner. The backbone is a sequence-to-sequence (seq2seq) network\naugmented with a context C produced by an external memory. The seq2seq model uses an RNN encoder (fie;\nbottom right) to read a query and then pass stepwise messages Q to an attention-based RNN decoder (fod; top\nright). Distinctive to meta seq2seq learning, the messages Q are transformed into C based on context from the\nsupport set (left). The transformation operates through a key-value memory. Support item inputs are encoded\nand used a keys K while outputs are encoded and used as value V . The query is stepwise compared to the keys,\nretrieving weighted sums M of the most similar values. This is mapped to C which is decoded as the \ufb01nal\noutput sequence. Color coding indicates shared RNN modules.\n\ndecoder (fod; green RNN in top right of Figure 1) work together to interpret the query sequence as an\noutput sequence, with the encoder passing an embedding at each timestep (Q) to a Luong attention\ndecoder [22]. The architecture differs from standard seq2seq modeling through its use of the support\nset, external memory, and training procedure. As the messages pass from the query encoder to the\nquery decoder, they are infused with stepwise context C provided by an external memory that stores\nthe support items. The inner-working of the architecture are described in detail below.\nInput encoder. The input encoder fie(\u00b7,\u00b7) (Figure 1 red) encodes the query input instruction (e.g.,\n\u201cjump twice\u2019) and each of the input instructions for the ns support items (\u201crun twice\u201d, \u201cwalk twice\u201d,\n\u201cjump\u201d, etc.). The encoder \ufb01rst embeds the sequence of symbols (e.g., words) to get a sequence of\ninput embeddings wt 2 Rm, which the RNN transforms into hidden embeddings ht 2 Rm,\n(1)\nFor the query sequence, the embedding ht at each step t = 1, . . . , T passes both through the external\nmemory as well as directly to the decoder. For each support sequence, only the last step hidden\nembedding is needed, denoted Ki 2 Rm for i = 1, . . . , ns. These vectors Ki become the keys\nin the external key-value memory (Figure 1). Although other choices are possible, this paper uses\nbidirectional long short-term memory encoders (biLSTM) [14].\nOutput encoder. The output encoder foe(\u00b7,\u00b7) (Figure 1 blue) is used for each of the of ns support\nitems and their output sequences (e.g., \u201cRUN RUN\u201d, \u201cWALK WALK\u201d, \u201cJUMP\u201d, etc.). First, the\nencoder embeds the sequence of output symbols (e.g., actions) using an embedding layer. Second, a\nsingle embedding for the entire sequence is computed using the same process as fie(\u00b7,\u00b7) (Equation 1).\nOnly the \ufb01nal RNN state is captured for each support item i and stored as the value vector Vi 2 Rm\nfor i = 1, . . . , ns in the key-value memory. A biLSTM encoder is also used.\nExternal memory. The architecture uses a soft key-value memory similar to memory networks [33].\nThe precise formulation used is described in [35]. The key-value memory uses the attention function\n\nht = fie(ht1, wt).\n\nAttention(Q, K, V ) = softmax(\n\nQKT\npm\n\n)V = AV,\n\n(2)\n\nwith matrices Q, K, and V for the queries, keys, and values respectively, and the matrix A as\nthe attention weights, A = softmax(QKT /pm). Each query instruction spawns T embeddings\n\n3\n\n\ffrom the RNN encoder, one for each query symbol, which populate the rows of the query matrix\nQ 2 RT,m. The encoded support items form the rows of K 2 Rns,m and the rows of V 2 Rns,m\nfor their input and output sequences, respectively. Attention weights A 2 RT,ns indicate which\nmemory cells are active for each query step. The output of the memory is a matrix M = AV where\neach row is a weighted combination of the value vectors, indicating the memory output for each\nof the T query input steps, M 2 RT,m. Finally, a stepwise context is computed by combining the\nquery input embeddings ht and the stepwise memory outputs Mt 2 Rm with a concatenation layer\nCt = tanh(Wc1[ht; Mt]) producing a stepwise context matrix C 2 RT,m.\nFor additional representational power, the key-value memory could replace the simple attention\nmodule with a multi-head attention module, or even a transformer-style multi-layer multi-head\nattention module [35]. This additional power was not needed for the tasks tackled in this paper, but it\nis compatible with the meta seq2seq approach.\nOutput decoder. The output decoder translates the stepwise context C into an output sequence\n(Figure 1 green). The decoder embeds the previous output symbol as vector oj1 2 Rm which is fed\nto the RNN (LSTM) along with the previous hidden state gj1 2 Rm to get the next hidden state,\n(3)\nThe initial hidden state g0 is set as the context from the last step CT 2 Rm. Luong-style attention\n[22] is used to compute a decoder context uj 2 Rm such that uj = Attention(gj, C, C). This context\nsoftmax output layer to produce an output symbol. This process repeats until all of the output symbols\nare produced and the RNN terminates the response by producing an end-of-sequence symbol.\nMeta-training. Meta-training optimizes the network across a series of training episodes, each of\nwhich is a novel seq2seq problem with ns support items and nq query items (see example in Figure\n2). The model\u2019s vocabulary is the union of the episode vocabularies, and the loss function is the\nnegative log-likelihood of the predicted output sequences for the queries. My implementation uses\neach episode as a training batch and takes one gradient step per episode. For improved sample\nand training ef\ufb01ciency, the optimizer could take multiple steps per episode or replay past episodes,\nalthough this was not explored here.\nDuring meta-training, the network may need extra encouragement to use its memory. To provide this,\nthe support items are passed through the network as additional query items, i.e. using an auxiliary\n\u201csupport loss\u201d that is added to the query loss computed from the query items. The support items have\nalready been observed and stored in memory, and thus it is not noteworthy that the network learns to\nreconstruct these output sequences. Nevertheless, it ampli\ufb01es the memory during meta-training.\n\nis passed through another concatenation layer egj = tanh(Wc2[gj; uj]) which is then mapped to a\n\ngj = fod(gj1, oj1).\n\n4 Experiments\n\n4.1 Architecture and training parameters\n\nA PyTorch implementation is available (see acknowledgements). All experiments use the same\nhyperparameters, and many were set according to the best-performing seq2seq model in [16]. The\ninput and output sequence encoders are two-layer biLSTMs with m = 200 hidden units per layer,\nproducing m dimensional embeddings. The output decoder is a two-layer LSTM also with m = 200.\nDropout is applied with probability 0.5 to each LSTM and symbol embedding. A greedy decoder is\neffective due to SCAN\u2019s determinism [16].\nNetworks are meta-trained for 10,000 episodes with the ADAM optimizer [15]. The learning rate\nis reduced from 0.001 to 0.0001 halfway, and gradients with a l2-norm greater than 50 are clipped.\nWith my PyTorch implementation, it takes less than 1 hour to train meta seq2seq on SCAN using one\nNVIDIA Titan X GPU (regular seq2seq trains in less than 30 minutes). All models were trained \ufb01ve\ntimes with different random initializations and random meta-training episodes.\n\n4.2 Experiment: Mutual exclusivity\n\nThis experiment evaluates meta seq2seq learning on a synthetic task borrowed from developmental\npsychology (Figure 2). Each episode introduces a new mapping from non-sense words (\u201cdax\u201d, \u201cwif\u201d,\netc.) to non-sense meanings (\u201cred circle\u201d, \u201cgreen circle\u201d, etc.), partially revealed in the support set.\n\n4\n\n\fMeta-training episodes\n\nPossible inputs: dax, wif, lug, zup\nPossible outputs: !, !, !, !\n\nTest episode\n\nSupport set\ndax !\n!\nwif\n!\nlug\n\nSupport set\ndax !\n!\nlug\nzup !\n\nSupport set\n!\n!\n!\n\nwif\nlug\nzup\n\nQuery set\nwif zup dax\nlug dax lug zup lug !!!!!\ndax wif lug\n\u2026\n\n!!!\n!!!\n\u2026\n\nQuery set\ndax dax\nwif dax lug zup lug wif !!!!!!\nwif lug lug\n\u2026\n\n!!\n!!!\n\u2026\n\nQuery set\nzup dax wif\nlug zup lug wif dax zup\nlug dax dax wif lug\n\u2026\n\n!!!\n!!!!!!\n!!!!!\n\u2026\n\nFigure 2: The mutual exclusivity task showing two meta-training episodes (left) and one test episode (right).\nEach episode requires executing instructions in a novel language of 4 input pseudowords (\u201cdax\u201d, \u201cwif\u201d, etc.) and\nfour output actions (\u201cred\u201d, \u201cyellow\u201d, etc.). Each episode has a random mapping from pseudowords to meanings,\nproviding three isolated words and their outputs as support. Answering queries requires concatenation as well as\nreasoning by mutual exclusivity to infer the fourth mapping (\u201cdax\u201d means \u201cblue\u201d in the test episode).\n\nTo answer the queries, a model must acquire two abilities inspired by human generalization patterns\n[18]: 1) using isolated symbol mappings to translate concatenated symbol sequences, and 2) using\nmutual exclusivity (ME) to resolve unseen mappings. Children use ME to help learn the meaning of\nnew words, assuming that an object with one label does not need another [26]. When provided with a\nfamiliar object (e.g., a cup) and an unfamiliar object (e.g., a cherry pitter) and asked to \u201cShow me the\ndax,\u201d children tend to pick the unfamiliar object rather than the familiar one.\nAdults also use ME to help resolve ambiguity. When presented with episodes like Figure 2 in\na laboratory setting, participants use ME to resolve unseen mappings and translate sequences in\na symbol-by-symbol manner. Most people generalize in this way spontaneously, without any\ninstructions or feedback about how to respond to compositional queries [18]. An untrained meta\nseq2seq learner would not be expected to generalize spontaneously \u2013 human participants come to the\ntask with a starting point that is richer in every way \u2013 but computational models should nonetheless be\ncapable of these inferences if trained to make them. This is a challenge for neural networks because\nthe mappings change every episode, and standard architectures do not reason using ME. In fact,\nstandard networks map novel inputs to familiar outputs, which is the opposite of ME [9].\nExperimental setup. During meta-training, each episode is generated by sampling a random mapping\nfrom four input symbols to four output symbols (19 permutations used for meta-training and 5 for\ntesting). The support set shows how three symbols should be translated, while one is withheld. The\nqueries consist of arbitrary concatenations of the pseudowords (length 2 to 6) which can be translated\nsymbol-by-symbol to produce the proper output responses (20 queries per episode). The fourth input\nsymbol, which was withheld from the support, is used in the queries. The model must learn how to\nuse ME to map this unseen symbol to an unseen meaning rather than a seen meaning (Figure 2).\nResults. Meta seq2seq successfully learns to reason with ME to answer queries, achieving 100%\naccuracy (SD = 0%). Based on the isolated mappings stored in memory, the network learns to\ntranslate sequences of those items. Moreover, it can acquire and use new mappings at test time,\nutilizing only its external memory and the activation dynamics. By learning to use ME, the network\nshows it can reason about the absence of symbols in the memory rather than simply their presence.\nThe attention weights and use of memory is visualized and presented in the appendix (Figure A.1).\n\n4.3 Experiment: Adding a new primitive through permutation meta-training\nThis experiment evaluates meta seq2seq learning on the SCAN task of adding a new primitive [16].\nModels are trained to generalize compositionally by decomposing the original SCAN task into\na series of related seq2seq sub-tasks. The goal is to learn a new primitive instruction and use it\ncompositionally, operationalized in SCAN as the \u201cadd jump\u201d split [16]. Models learn a new primitive\n\u201cjump\u201d and aim to use it in combination with other instructions, resembling the \u201cto Facebook\u201d example\nintroduced earlier in this paper. First, the original seq2seq problem from [16] is described. Second,\nthe adapted problem for training meta seq2seq learners is described.\n\n5\n\n\fTable 1: SCAN task for compositional learning with input instructions (left) and their output actions (right) [16].\n\njump\njump left\njump around right\nturn left twice\njump thrice\njump opposite left and walk thrice\njump opposite left after walk around left ) LTURN WALK LTURN WALK LTURN WALK LTURN WALK\n\n) JUMP\n) LTURN JUMP\n) RTURN JUMP RTURN JUMP RTURN JUMP RTURN JUMP\n) LTURN LTURN\n) JUMP JUMP JUMP\n) LTURN LTURN JUMP WALK WALK WALK\n\nLTURN LTURN JUMP\n\nSeq2seq learning. Standard seq2seq models applied to SCAN have both a training and a test\nphase. During training, seq2seq models are exposed to the \u201cjump\u201d instruction in a single context\ndemonstrating how to jump in isolation. Also during training, the models are exposed to all primitive\nand composed instructions for the other actions (e.g., \u201cwalk\u201d, \u201cwalk twice\u201d, \u201clook around right and\nwalk twice\u201d, etc.) along with the correct output sequences, which is about 13,000 unique instructions.\nFollowing [16], the critical \u201cjump\u201d demonstration is overrepresented in training to ensure it is learned.\nDuring test, models are evaluated on all of the composed instructions that use the \u201cjump\u201d primitive,\nexamining the ability to integrate new primitives and use them productively. For instance, models are\nevaluated on instructions such as \u201cjump twice\u201d, \u201cjump around right and walk twice\u201d, \u201cwalk left thrice\nand jump right thrice,\u201d along with about 7,000 other instructions using jump.\nMeta seq2seq learning. Meta seq2seq models applied to SCAN have both a meta-training and a test\nphase. During meta-training, the models observe episodes that are variants of the original seq2seq\nproblem, each of which requires rapid learning of new meanings for the primitives. Speci\ufb01cally,\neach meta-training episode provides a different random assignment of the primitive instructions\n(\u2018jump\u2019,\u2018run\u2019, \u2018walk\u2019, \u2018look\u2019) to their meanings (\u2018JUMP\u2019,\u2018RUN\u2019,\u2018WALK\u2019,\u2018LOOK\u2019), with the restric-\ntion that the proper (original) permutation not be observed during meta-training. Withholding the\noriginal permutation, there are 23 possible permutations for meta-training. Each episode presents 20\nsupport and 20 query instructions, with instructions sampled from the full SCAN set. The models\npredict the response to the query instructions, using the support instructions and their outputs as\ncontext. Through meta-training, the models are familiarized with all of the possible SCAN training\nand test instructions, but no episode maps all of its instructions to their original (target) outputs\nsequences. In fact, models have no signal to learn which primitives in general correspond to which\nactions, since the assignments are sampled anew for each episode.\nDuring test, models are evaluated on rapid learning of new meanings. Just four support items are\nobserved and loaded into memory, consisting of the isolated primitives (\u2018jump\u2019,\u2018run\u2019, \u2018walk\u2019, \u2018look\u2019)\npaired with their original meanings (\u2018JUMP\u2019,\u2018RUN\u2019,\u2018WALK\u2019,\u2018LOOK\u2019). Notably, memory use at\ntest time (with only four primitive items in memory) diverges substantially from memory use during\nmeta-training (with 20 complex instructions in memory). To evaluate test accuracy, models make\npredictions on the original SCAN test instructions consisting of all composed instructions using\n\u201cjump.\u201d An output sequence is considered correct only if it perfectly matches the target sequence.\nAlternative models. The meta seq2seq learner is compared with an analogous \u201cstandard seq2seq\u201d\nlearner [22], which uses the same architecture with the external memory removed. The standard\nseq2seq learner is trained on the original SCAN problem with a \ufb01xed meaning for each primitive.\nEach meta seq2seq \u201cepisode\u201d can be interpreted as a standard seq2seq \u201cbatch,\u201d and a batch size of 40\nis chosen to equate the total number of presentations between approaches. All other architectural and\ntraining parameters are shared between meta seq2seq learning and seq2seq learning.\nThe meta seq2seq learner is also compared with two additional lesioned variants that examine the\nimportance of different architectural components. First, the meta seq2seq learner is trained \u201cwithout\nsupport loss\u201d (Section 3 meta-training), which guides the architecture about how to best use its\nmemory. Second, the meta seq2seq learner is trained \u201cwithout decoder attention\u201d (Section 3 output\ndecoder). This leads to substantial differences in the architecture operation; rather than producing a\nsequence of context embeddings C1, . . . , CT for each step of the T steps of a query sequence, only\nthe last step context CT is computed and passed to the decoder.\nResults. The results are summarized in Table 2. On the \u201cadd jump\u201d test set [16], standard seq2seq\nmodeling completely fails to generalize compositionally, reaching an average performance of only\n0.03% correct (SD = 0.02). It fails even while achieving near perfect performance on the training set\n\n6\n\n\fTable 2: Test accuracy on the SCAN \u201cadd jump\u201d task across different training paradigms.\n\nstandard\ntraining\nModel\n\u2014\nmeta seq2seq learning\n\u2014\n-without support loss\n-without decoder attention \u2014\nstandard seq2seq\nsyntactic attention [30]\n\n0.03%\n78.4%\n\npermutation\naugmentation\nmeta-training meta-training\n99.95%\n5.43%\n10.32%\n\u2014\n\u2014\n\n98.71%\n99.48%\n9.29%\n12.26%\n\u2014\n\n(>99% on average). This replicates the results from [16] which trained many seq2seq models, \ufb01nding\nthe best network performed at only 1.2% accuracy. Again, standard seq2seq models do not show the\nnecessary systematic compositionality.\nThe meta seq2seq model succeeds at learning compositional skills, achieving an average performance\nof 99.95% correct (SD = 0.08). At test, the support set contains only the four primitives and their\nmappings, demonstrating that meta seq2seq learning can handle test episodes that are qualitatively\ndifferent from those seen during training. Moreover, the network learns how to store and retrieve\nvariables from memory with arbitrary assignments, as long as the network is familiarized with the\npossible input and output symbols during meta-training (but not necessarily how they correspond). A\nvisualization of how meta seq2seq uses attention on SCAN is shown in the appendix (Figure A.2).\nThe meta seq2seq learner also outperforms syntactic attention which achieves 78.4% and varies\nwidely in performance across runs (SD = 27.4) [30].\nThe lesion analyses demonstrate the importance of various components. The meta seq2seq learner\nfails to solve the task without the guidance of the support loss, achieving only 5.43% correct (SD =\n7.6). These runs typically learn the consistent, static meanings such as \u201ctwice\u201d, \u201cthrice\u201d, \u201caround right\u201d\nand \u201cafter\u201d, but fail to learn the dynamic primitives which require using memory. The meta seq2seq\nlearner also fails when the decoder attention is removed (10.32% correct; SD = 6.4), suggesting that\na single m dimensional embedding is not suf\ufb01cient to relate a query to the support items.\n\n4.4 Experiment: Adding a new primitive through augmentation meta-training\n\nExperiment 4.3 demonstrates that the meta seq2seq approach can learn how to learn the meaning\nof a primitive and use it compositionally. However, only a small set of four input primitives and\nfour meanings was considered; it is unclear whether meta seq2seq learning works in more complex\ncompositional domains. In this experiment, meta seq2seq is evaluated on a much larger domain\nproduced by augmenting the meta-training with 20 additional input and action primitives. This more\nchallenging task requires that the networks handle a much larger set of possible meanings. The\narchitecture and training procedures are identical to those used in Experiment 4.3 except where noted.\nSeq2seq learning. To equate learning environment across approaches, standard seq2seq models\nuse a training phase that is substantially expanded from that in Experiment 4.3. During train-\ning, the input primitives include the original four (\u2018jump\u2019,\u2018run\u2019, \u2018walk\u2019, \u2018look\u2019) as well as 20\nnew symbols (\u2018Primitive1,\u2019 . . . , \u2018Primitive20\u2019). The output meanings include the original four\n(\u2018JUMP\u2019,\u2018RUN\u2019,\u2018WALK\u2019,\u2018LOOK\u2019) as well as 20 new actions (\u2018Action1,\u2019 . . . , \u2018Action20\u2019). In the\nseq2seq training (but notably, not in meta seq2seq training), \u2018Primitive1\u2019 always corresponds to\n\u2018Action1,\u2019 \u2018Primitive2\u2019 corresponds to \u2018Action2,\u2019 and so on. A training batch uses the original SCAN\ntemplates with primitives sampled from the augmented set rather than the original set; for instance, a\ntraining instruction may be \u201clook around right and Primitive20 twice.\u201d During training the \u201cjump\u201d\nprimitive is only presented in isolation, and it is included in every batch to ensure the network learns\nit properly. Compared to Experiment 4.3, the augmented SCAN domain provides substantially more\nevidence for compositionality and productivity.\nMeta seq2seq learning. Meta seq2seq models are trained similarly to Experiment 4.3 with an\naugmented primitive set. During meta-training, episodes are generated by randomly sampling a set of\nfour primitive instructions (from the set of 24) and their corresponding meanings (from the set of 24).\nFor instance, an example training episode could use the four instruction primitives \u2018Primitive16\u2019, \u2018run\u2019,\n\u2018Primitive2\u2019, and \u2018Primitive12\u2019 mapped respectively to actions \u2018Action3\u2019, \u2018Action20\u2019, \u2018JUMP\u2019, and\n\u2018Action11\u2019. Although Experiment 4.3 has only 23 possible assignments, this experiment has orders-\n\n7\n\n\fof-magnitude more possible assignments than training episodes, ensuring meta-training only provides\na small subset. Moreover, the models are evaluated using a stricter criterion for generalization: the\nprimitive \u201cjump\u201d is never assigned to the proper action \u201cJUMP\u201d during meta-training.\nThe test phase is analogous to the previous experiment. Models are evaluated by loading\nall of the isolated primitives (\u2018jump\u2019,\u2018run\u2019, \u2018walk\u2019, \u2018look\u2019) paired with their original meanings\n(\u2018JUMP\u2019,\u2018RUN\u2019,\u2018WALK\u2019,\u2018LOOK\u2019) into memory as support items. No other items are included in\nmemory. To evaluate test accuracy, models make predictions on the original SCAN test instructions\nconsisting of all composed instructions using \u201cjump.\u201d\nResults. The results are summarized in Table 2. The meta seq2seq learner succeeds at acquiring\n\u201cjump\u201d and using it correctly, achieving 98.71% correct (SD = 1.49) on the test instructions. The\nslight decline in performance compared to Experiment 4.3 is not statistically signi\ufb01cant with \ufb01ve\nruns. The standard seq2seq learner takes advantage of the augmented training to generalize better\nthan when using standard SCAN training (Experiment 4.3 and [16]), achieving 12.26% accuracy (SD\n= 8.33) on the test instructions (with >99% accuracy during training). The augmented task provides\n23 fully compositional primitives during training, compared to the three in the original task. The\nbasic seq2seq model still fails to properly discover and utilize this salient compositionality.\nThe lesion analyses show that the support loss is not critical in this setting, and the meta seq2seq\nlearner achieves 99.48% correct without it (SD = 0.37). In contrast to Experiment 4.3, using many\nprimitives more strongly guides the network to use the memory, since the network cannot substantially\nreduce the training loss without it. The decoder attention remains critical in this setting, and the\nnetwork attains merely 9.29% correct without it (SD = 13.07). Only the full meta seq2seq learner\nmasters both the current and the previous learning settings (Table 2).\n\n4.5 Experiment: Combining familiar concepts through meta-training\n\nThe next experiment examines combining familiar concepts in new ways.\nSeq2seq learning. Seq2seq training holds out all instances of \u201caround right\u201d for testing, while\ntraining on all other SCAN instructions (\u201caround right\u201d split [21]). Using the symmetry between\n\u201cleft\u201d and \u201cright,\u201d the network must extrapolate to \u201cjump around right\u201d from training examples like\n\u201cjump around left,\u201d \u201cjump left,\u201d and \u201cjump right.\u201d\nMeta seq2seq learning. Meta-training follows Experiment 4.4.\nInstead of just two directions\n\u201cleft\u201d and \u201cright\u201d, the possibilities also include \u201cDirection1\u201d and \u201cDirection2\u201d (or equivalently, \u201cfor-\nward\u201d and \u201cbackward\u201d). Meta-training episodes are generated by randomly sampling two directions\nto be used in the instructions (from \u201cleft\u201d, \u201cright\u201d, \u201cforward\u201d, \u201cbackward\u201d) and their meanings\n(from \u201cLTURN,\u201d \u201cRTURN,\u201d \u201cFORWARD\u201d,\u201cBACKWARD\u201d), permuted to have no systematic cor-\nrespondence. The primitive \u201cright\u201d is never assigned to the proper meaning during meta-training.\nMeta-training uses both 20 support and 20 query instructions. During test, models must infer how to\nperform an action \u201caround right\u201d and use it compositionally in all possible ways, with a support set\nof just \u201cturn left\u201d and \u201cturn right\u201d mapped to their proper meanings.\nResults. Meta seq2seq learning is nearly perfect at inferring the meaning of \u201caround right\u201d from\nits components (99.96% correct; SD = 0.08; Table 3), while standard seq2seq fails catastrophically\n(0.0% correct) and syntactic attention struggles (28.9%; SD = 34.8) [30].\n\n4.6 Experiment: Generalizing to longer instructions through meta-training\n\nThe \ufb01nal experiment examines whether the meta seq2seq approach can learn to generalize to longer\nsequences, even when the test sequences are longer than any experienced during meta-training.\nSeq2seq learning. The SCAN instructions are divided into training and test sets based on the number\nof required output actions. Following the SCAN \u201clength\u201d split [16], standard seq2seq models are\ntrained on all instructions that require 22 or fewer actions (\u21e017,000) and evaluated on all instructions\nthat require longer action sequences (\u21e04,000 ranging in length from 24-28). During test, the network\nmust execute instructions that require 24 actions such as \u201cjump around right twice and look opposite\nright thrice,\u201d where both sub-instructions have been trained but the conjunction is novel.\nMeta seq2seq learning. Meta-training optimizes the network to extrapolate from shorter support\ninstructions to longer query instructions. During test, the model is examined on even longer queries\n\n8\n\n\fthan seeing during meta-training (drawn from the SCAN \u201clength\u201d test set). For meta-training, the\noriginal \u201clength\u201d training set is sub-divided into the support pool (all instructions with less than 12\noutput actions) and a query pool (all instructions with 12 to 22 output actions). In each episode, the\nnetwork gets 100 support items and must respond to 20 (longer) query items. To encourage use of\nthe external memory, primitive augmentation as in Experiment 4.4 is also applied. During test, the\nmodels load 100 support items from the original \u201clength\u201d split training set (lengths 1 to 22 output\nactions) and responds to queries from the original test set (lengths 24-28).\nResults. None of the models perform well on\nlonger sequences (Table 3). The meta seq2seq\nlearner achieves 16.64% accuracy (SD = 2.10)\nwhile the baseline seq2seq learner achieves\n7.71% (SD = 1.90). Syntactic attention also per-\nforms poorly at 15.2% (SD = 0.7) [30]. Despite\nits other compositional successes, meta seq2seq\nlacks the truly systematic generalization needed\nto extrapolate to longer sequences.\n\nTable 3: Test accuracy on the SCAN \u201caround right\u201d and\n\u201clength\u201d tasks.\n\nlength\n16.64%\n7.71%\n15.2%\n\nModel\nmeta seq2seq learning\nstandard seq2seq\nsyntactic attention [30]\n\naround right\n99.96%\n0.0%\n28.9%\n\n5 Discussion\n\nPeople are skilled compositional learners while standard neural networks are not. After learning how\nto \u201cdax,\u201d people understand how to \u201cdax twice,\u201d \u201cdax slowly,\u201d or even \u201cdax like there is no tomorrow.\u201d\nThese abilities are central to language and thought yet they are conspicuously lacking in modern\nneural networks [16, 3, 21, 23, 2].\nIn this paper, I introduced a meta sequence-to-sequence (meta seq2seq) approach for learning to\ngeneralize compositionally, exploiting the algebraic structure of a domain to help understand novel\nutterances. Unlike standard seq2seq, meta seq2seq learners can abstract away the surface patterns and\noperate closer to rule space. Rather than attempting to solve \u201cjump around right twice and walk thrice\u201d\nby comparing surface level patterns with training items, meta seq2seq learns to treat the instruction\nas a template \u201cx around right twice and y thrice\u201d where x and y are variables. This approach solves\nseveral SCAN compositional learning tasks that have eluded standard NLP approaches, although it\nstill does not generalize systematically to longer sequences [16]. In this way, meta seq2seq learning\nis a step forward in capturing the compositional abilities studied in synthetic learning tasks [18] and\nmotivated in the \u201cto dax\u201d or \u201cto Facebook\u201d thought experiments.\nMeta seq2seq learning has implications for understanding how people generalize compositionally.\nSimilarly to meta-training, people learn in dynamic environments, tackling a series of changing\nlearning problems rather than iterating through a static dataset. There is natural pressure to generalize\nsystematically after a single experience with a new verb like \u201cto Facebook,\u201d and thus people are\nincentivized to generalize compositionally in ways that resemble the meta seq2seq loss. Meta learning\nis a powerful new toolbox for studying learning-to-learn and other elusive cognitive abilities [17, 37],\nalthough more work is needed to understand its implications for cognitive science.\nThe models studied here can learn variables that assign novel meanings to words at test time, using\nonly the network dynamics and the external memory. Although powerful, this is a limited concept of\n\u201cvariable\u201d since it requires familiarity with all of the possible input and output assignments during\nmeta-training. This limitation is shared by nearly all existing neural architectures [33, 11, 31] and\nshows that the meta seq2seq framework falls short of addressing Marcus\u2019s challenge of extrapolating\noutside the training space [24, 25, 23]. In future work, I intend to explore adding more symbolic\nmachinery to the architecture [29] with the goal of handling genuinely new symbols. Hybrid neuro-\nsymbolic models could also address the challenge of generalizing to longer output sequences, a\nproblem that continues to vex neural networks [16, 3, 30] including meta seq2seq learning.\nThe meta seq2seq approach could be applied to a wide range of tasks including low resource machine\ntranslation [13], graph traversal [11], or \u201cFlash Fill\u201d style program induction [28]. For traditional\nseq2seq tasks like machine translation, standard seq2seq training could be augmented with hybrid\ntraining that alternates between standard training and meta-training to encourage compositional\ngeneralization. I am excited about the potential of the meta seq2seq approach both for solving\npractical problems and for illuminating the foundations of human compositional learning.\n\n9\n\n\fAcknowledgments\n\nPyTorch code is available at https://github.com/brendenlake/meta_seq2seq. I am very\ngrateful to Marco Baroni for contributing key ideas to the architecture and experiments. I also thank\nKyunghyun Cho, Guy Davidson, Tammy Kwan, Tal Linzen, Gary Marcus, and Maxwell Nye for\ntheir helpful comments.\n\nReferences\n[1] Jacob Andreas. Good-Enough Compositional Data Augmentation. arXiv preprint, 2019.\n[2] Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries,\nand Aaron Courville. Systematic generalization: What is required and can it be learned? pages\n1\u201316, 2018.\n\n[3] Joost Bastings, Marco Baroni, Jason Weston, Kyunghyun Cho, and Douwe Kiela. Jump to\nbetter conclusions: SCAN both left and right. In Proceedings of the EMNLP BlackboxNLP\nWorkshop, pages 47\u201355, Brussels, Belgium, 2018.\n\n[4] Ond\u02c7rej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias\nHuck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo\nNegri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina\nScarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. Findings of the\n2016 Conference on Machine Translation. In Proceedings of the First Conference on Machine\nTranslation, pages 131\u2013198, Berlin, Germany, 2016.\n\n[5] Noam Chomsky. Syntactic Structures. Mouton, Berlin, Germany, 1957.\n[6] Ishita Dasgupta, Demi Guo, Andreas Stuhlmuller, Samuel J Gershman, and Noah D Goodman.\n\nEvaluating Compositionality in Sentence Embeddings. arXiv preprint, 2018.\n\n[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast\nAdaptation of Deep Networks. International Conference on Machine Learning (ICML), 2017.\n[8] Jerry Fodor and Zenon Pylyshyn. Connectionism and cognitive architecture: A critical analysis.\n\n[9] Kanishk Gandhi and Brenden M Lake. Mutual exclusivity as a challenge for neural networks.\n\nCognition, 28:3\u201371, 1988.\n\narXiv preprint, 2019.\n\n[10] Samuel J Gershman and Joshua B Tenenbaum. Phrase similarity in humans and machines. In\n\nProceedings of the 37th Annual Conference of the Cognitive Science Society, 2015.\n\n[11] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,\nAdri\u00e0 Puigdom\u00e8nech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain,\nHelen King, Christopher Summer\ufb01eld, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis.\nHybrid computing using a neural network with dynamic external memory. Nature, 2016.\n\n[12] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to\nTransduce with Unbounded Memory. In Advances in Neural Information Processing Systems,\n2015.\n\n[13] Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-Learning for Low-\nResource Neural Machine Translation. In Empirical Methods in Natural Language Processing\n(EMNLP), 2018.\n\n[14] S Hochreiter and J Schmidhuber. Long short-term memory. Neural computation, 9:1735\u20131780,\n\n[15] Diederik P Kingma and Max Welling. Ef\ufb01cient Gradient-Based Inference through Transforma-\ntions between Bayes Nets and Neural Nets. In International Conference on Machine Learning\n(ICML 2014), 2014.\n\n[16] Brenden M Lake and Marco Baroni. Generalization without Systematicity: On the Composi-\ntional Skills of Sequence-to-Sequence Recurrent Networks. In International Conference on\nMachine Learning (ICML), 2018.\n\n[17] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building\n\nmachines that learn and think like people. Behavioral and Brain Sciences, 40:E253, 2017.\n\n[18] Brenden M Lake, Tal Linzen, and Marco Baroni. Human few-shot learning of compositional\ninstructions. In Proceedings of the 41st Annual Conference of the Cognitive Science Society,\n2019.\n\n[19] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. The Omniglot Challenge:\n\nA 3-Year Progress Report. Current Opinion in Behavioral Sciences, 29:97\u2013104, 2019.\n\n1997.\n\n10\n\n\f[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436\u2013444, 2015.\n[21] Jo\u00e3o Loula, Marco Baroni, and Brenden M Lake. Rearranging the Familiar: Testing\nCompositional Generalization in Recurrent Networks. arXiv preprint, 2018. URL http:\n//arxiv.org/abs/1807.07545.\n\n[22] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to\nAttention-based Neural Machine Translation. In Empirical Methods in Natural Language\nProcessing (EMNLP), 2015.\n\n[23] Gary Marcus. Deep Learning: A Critical Appraisal. arXiv preprint, 2018.\n[24] Gary F Marcus. Rethinking Eliminative Connectionism. Cognitive Psychology, 282(37):\n\n[25] Gary F Marcus. The Algebraic Mind: Integrating Connectionism and Cognitive Science. MIT\n\n243\u2013282, 1998.\n\nPress, Cambridge, MA, 2003.\n\n[26] Ellen M Markman and Gwyn F Wachtel. Children\u2019s Use of Mutual Exclusivity to Constrain the\n\nMeanings of Words. Cognitive Psychology, 20:121\u2013157, 1988.\n\n[27] Richard Montague. Universal Grammar. Theoria, 36:373\u2013398, 1970.\n[28] Maxwell Nye, Luke Hewitt, Joshua B. Tenenbaum, and Armando Solar-lezama. Learning to\n\nInfer Program Sketches. International Conference on Machine Learning (ICML), 2019.\n\n[29] Scott Reed and Nando de Freitas. Neural Programmer-Interpreters. In International Conference\n\non Learning Representations (ICLR), 2016.\n\n[30] Jake Russin, Jason Jo, Randall C. O\u2019Reilly, and Yoshua Bengio. Compositional generalization\nin a deep seq2seq model by separating syntax and semantics. arXiv preprint, 2019. URL\nhttp://arxiv.org/abs/1904.09708.\n\n[31] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.\nMeta-Learning with Memory-Augmented Neural Networks. In International Conference on\nMachine Learning (ICML), 2016.\n\n[32] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[33] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-To-End Memory\n\nNetworks. In Advances in Neural Information Processing Systems 29, 2015.\n\n[34] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to Sequence Learning with Neural\n\nNetworks. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. Advances in Neural Information\nProcessing Systems., 2017.\n\n[36] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra.\nMatching Networks for One Shot Learning. In Advances in Neural Information Processing\nSystems 29 (NIPS), 2016.\n\n[37] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,\nCharles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.\narXiv, 2016. URL http://arxiv.org/abs/1611.05763.\n\n[38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc Le, Mohammad Norouzi, Wolfgang Macherey,\nMaxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin\nJohnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto\nKazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith,\nJason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean.\nGoogle\u2019s neural machine translation system: Bridging the gap between human and machine\ntranslation. http://arxiv.org/abs/1609.08144, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5200, "authors": [{"given_name": "Brenden", "family_name": "Lake", "institution": "New York University"}]}