{"title": "Ordered Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 5037, "page_last": 5048, "abstract": "Stack-augmented recurrent neural networks (RNNs) have been of interest to the deep learning community for some time. However, the difficulty of training memory models remains a problem obstructing the widespread use of such models. In this paper, we propose the Ordered Memory architecture. Inspired by Ordered Neurons (Shen et al., 2018), we introduce a new attention-based mechanism and use its cumulative probability to control the writing and erasing operation of the memory. We also introduce a new Gated Recursive Cell to compose lower-level representations into higher-level representation. We demonstrate that our model achieves strong performance on the logical inference task (Bowman et al., 2015) and the ListOps (Nangia and Bowman, 2018) task. We can also interpret the model to retrieve the induced tree structure, and find that these induced structures align with the ground truth. Finally, we evaluate our model on the Stanford Sentiment Treebank tasks (Socher et al., 2013), and find that it performs comparatively with the state-of-the-art methods in the literature.", "full_text": "Ordered Memory\n\nYikang Shen\u2217\n\nShawn Tan\u2217\n\nArian Hosseini\u2217\n\nMila/Universit\u00b4e de Montr\u00b4eal\n\nMila/Universit\u00b4e de Montr\u00b4eal\n\nMila/Universit\u00b4e de Montr\u00b4eal\n\nand Microsoft Research\n\nMontr\u00b4eal, Canada\n\nMontr\u00b4eal, Canada\n\nand Microsoft Research\n\nMontr\u00b4eal, Canada\n\nZhouhan Lin\n\nMila/Universit\u00b4e de Montr\u00b4eal\n\nMontr\u00b4eal, Canada\n\nAlessandro Sordoni\nMicrosoft Research\nMontr\u00b4eal, Canada\n\nAaron Courville\n\nMila/Universit\u00b4e de Montr\u00b4eal\n\nMontr\u00b4eal, Canada\n\nAbstract\n\nStack-augmented recurrent neural networks (RNNs) have been of interest to the\ndeep learning community for some time. However, the dif\ufb01culty of training mem-\nory models remains a problem obstructing the widespread use of such models.\nIn this paper, we propose the Ordered Memory architecture.\nInspired by Or-\ndered Neurons (Shen et al., 2018), we introduce a new attention-based mechanism\nand use its cumulative probability to control the writing and erasing operation of\nmemory. We also introduce a new Gated Recursive Cell to compose lower level\nrepresentations into higher level representation. We demonstrate that our model\nachieves strong performance on the logical inference task (Bowman et al., 2015)\nand the ListOps (Nangia and Bowman, 2018) task. We can also interpret the model\nto retrieve the induced tree structure, and \ufb01nd that these induced structures align\nwith the ground truth. Finally, we evaluate our model on the Stanford Sentiment\nTreebank tasks (Socher et al., 2013), and \ufb01nd that it performs comparatively with\nthe state-of-the-art methods in the literature2.\n\n1\n\nIntroduction\n\nA long-sought after goal in natural language processing is to build models that account for the com-\npositional nature of language \u2014 granting them an ability to understand complex, unseen expressions\nfrom the meaning of simpler, known expressions (Montague, 1970; Dowty, 2007). Despite being\nsuccessful in language generation tasks, recurrent neural networks (RNNs, Elman (1990)) fail at\ntasks that explicitly require and test compositional behavior (Lake and Baroni, 2017; Loula et al.,\n2018). In particular, Bowman et al. (2015), and later Bahdanau et al. (2018) give evidence that,\nby exploiting the appropriate compositional structure of the task, models can generalize better to\nout-of-distribution test examples. Results from Andreas et al. (2016) also indicate that recursively\ncomposing smaller modules results in better representations. The remaining challenge, however, is\nlearning the underlying structure and the rules governing composition from the observed data alone.\nThis is often referred to as the grammar induction (Chen, 1995; Cohen et al., 2011; Roark, 2001;\nChelba and Jelinek, 2000; Williams et al., 2018).\n\nFodor and Pylyshyn (1988) claim that \u201ccognitive capacities always exhibit certain symmetries, so\nthat the ability to entertain a given thought implies the ability to entertain thoughts with seman-\ntically related contents,\u201d and use the term systematicity to describe this phenomenon. Exploiting\n\n\u2217yi-kang.shen@umontreal.ca, tanjings@mila.quebec, arian.hosseini9@gmail.com.\n2The code can be found at https://github.com/yikangshen/Ordered-Memory\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fknown symmetries in the structure of the data has been a useful technique for achieving good gener-\nalization capabilities in deep learning, particularly in the form of convolutions (Fukushima, 1980),\nwhich leverage parameter-sharing. If we consider architectures used in Socher et al. (2013) or Tai\net al. (2015), the same recursive operation is performed at known points along the input where the\nsubstructures are meant to be composed. Could symmetries in the structure of natural language data\nbe learned and exploited by models that operate on them?\n\nIn recent years, many attempts have been made in this direction using neural network architec-\ntures (Grefenstette et al., 2015; Bowman et al., 2016; Williams et al., 2018; Yogatama et al., 2018;\nShen et al., 2018; Dyer et al., 2016). These models typically augment a recurrent neural network with\na stack and a buffer which operate in a similar way to how a shift-reduce parser builds a parse-tree.\nWhile some assume that ground-truth trees are available for supervised learning (Bowman et al.,\n2016; Dyer et al., 2016), others use reinforcement learning (RL) techniques to learn the optimal\nsequence of shift reduce actions in an unsupervised fashion (Yogatama et al., 2018).\n\nTo avoid some of the challenges of RL training (Havrylov et al., 2019), some approaches use a con-\ntinuous stack (Grefenstette et al., 2015; Joulin and Mikolov, 2015; Yogatama et al., 2018). These\ncan usually only perform one push or pop action per time step, requiring different mechanisms \u2014\nakin to adaptive computation time (ACT, Graves (2016); Jernite et al. (2016)) \u2014 to perform the right\nnumber of shift and reduce steps to express the correct parse. In addition, continuous stack mod-\nels tend to \u201cblur\u201d the stack due to performing a \u201csoft\u201d shift of either the pointer to the head of the\nstack (Grefenstette et al., 2015), or all the values in the stack (Joulin and Mikolov, 2015; Yogatama\net al., 2018). Finally, while these previous models can learn to manipulate a stack, they lack the\ncapability to lookahead to future tokens before performing the stack manipulation for the current\ntime step.\n\nIn this paper, we propose a novel architecture: Ordered Memory (OM), which includes a new mem-\nory updating mechanism and a new Gated Recursive Cell. We demonstrate that our method general-\nizes for synthetic tasks where the ability to parse is crucial to solving them. In the Logical inference\ndataset (Bowman et al., 2015), we show that our model can systematically generalize to unseen com-\nbination of operators. In the ListOps dataset (Nangia and Bowman, 2018), we show that our model\ncan learn to solve the task with an order of magnitude less training examples than the baselines.\nThe parsing experiments shows that our method can effectively recover the latent tree structure of\nthe both tasks with very high accuracy. We also perform experiments on the Stanford Sentiment\nTreebank, in both binary classi\ufb01cation and \ufb01ne-grained settings (SST-2 & SST-5), and \ufb01nd that we\nachieve comparative results to the current benchmarks.\n\n2 Related Work\n\nComposition with recursive structures has been shown to work well for certain types of tasks. Pol-\nlack (1990) \ufb01rst suggests their use with distributed representations. Later, Socher et al. (2013)\nshows their effectiveness on sentiment analysis tasks. Recent work has demonstrated that recursive\ncomposition of sentences is crucial to systematic generalisation (Bowman et al., 2015; Bahdanau\net al., 2018). Kuncoro et al. (2018) also demonstrate that architectures like Dyer et al. (2016) handle\nsyntax-sensitive dependencies better for language-related tasks.\n\nSch\u00a8utzenberger (1963) \ufb01rst showed an equivalence between push-down automata (stack-augmented\nautomatons) and context-free grammars. Knuth (1965) introduced the notion of a shift-reduce parser\nthat uses a stack for a subset of formal languages that can be parsed from left to right. This technique\nfor parsing has been applied to natural language as well: Shieber (1983) applies it to English, using\nassumptions about how native English speakers parse sentences to remove ambiguous parse candi-\ndates. More recently, Maillard et al. (2017) shows that a soft tree could emerge from all possible\ntree structures through back propagation.\n\nThe idea of using neural networks to control a stack is not new. Zeng et al. (1994) uses gradient\nestimates to learn to manipulate a stack using a neural network. Das et al. (1992) and Mozer and\nDas (1993) introduced the notion of a continuous stack in order for the model to be fully differ-\nentiable. Much of the recent work with stack-augmented networks built upon the development of\nneural attention (Graves, 2013; Bahdanau et al., 2014; Weston et al., 2014). Graves et al. (2014)\nproposed methods for reading and writing using a head, along with a \u201csoft\u201d shift mechanism. Apart\nfrom using attention mechanisms, Grefenstette et al. (2015) proposed a neural stack where the push\n\n2\n\n\fFigure 1: An example run of the OM model. Let the input sequence a, b, c, d, e and its hierarchical\nstructure be as shown in the \ufb01gure. Ideally, the OM model will output the values shown in the above\ntables. The occupied slots in Mt are highlighted in gray. The yellow slots in \u02c6Mt are slots that can be\nattended on in time-step t + 1. At the \ufb01rst time-step (t = 1), the model will initialize the candidate\nmemory \u02c6M1 with input a and the memory M0 with zero vectors. At t = 2, the model attends on\nthe last memory slot to compute M1 (Eqn. 5), followed by \u02c6M2 (Eqn. 7). At t = 3, given the\ninput c, the model will attend on the last slot. Consequently the memory slot for b is erased by \u2212\u2192\u03c0 3.\nGiven Eqns. 6 and 7, our model will recursively compute every slot in the candidate memory \u02c6M i\nt to\ninclude information from \u02c6M i\u22121\nt\u22121. Since the cell(\u00b7) function only takes 2 inputs, the actual\ncomputation graph is a binary tree.\n\nand M i\n\nt\n\nand pop operations are made to be differentiable, which worked well in synthetic datasets. Yo-\ngatama et al. (2016) proposes RL-SPINN where the discrete stack operations are directly learned by\nreinforcement learning.\n\n3 Model\n\nThe OM model actively maintains a stack and processes the input from left to right, with a one-step\nlookahead in the sequence. This allows the OM model to decide the local structure more accurately,\nmuch like a shift-reduce parser (Knuth, 1965).\n\nAt a given point t in the input sequence x (the t-th time-step), we have a memory of candidate\nsub-trees spanning the non-overlapping sub-sequences in x1, . . . , xt\u22121, with each sub-tree being\nrepresented by one slot in the memory stack. We also maintain a memory stack of sub-trees that\ncontains x1, . . . , xt\u22122. We use the input xt to choose its parent node from our previous candidate\nsub-trees. The descendant sub-trees of this new sub-tree (if they exist) are removed from the memory\nstack, and this new sub-tree is then added. We then build the new candidate sub-trees that include xt\nusing the current input and the memory stack. In what follows, we describe the OM model in detail.\nTo facilitate a clearer description, a discrete attention scheme is assumed, but only \u201csoft\u201d attention\nis used in both the training and evaluation of this model.\n\nLet D be the dimension of each memory slot and N be the number of memory slots. At time-step t,\nthe model takes four inputs:\n\nthat include the leaf node xt\u22121;\n\nrepresentation for sub-trees spanning the non-overlapping subsequences in x1, ..., xt\u22122;\n\n\u2022 Mt\u22121: a memory matrix of dimension N \u00d7 D, where each occupied slot is a distributed\n\u2022 \u02c6Mt\u22121: a matrix of dimension N \u00d7 D that contains representations for candidate subtrees\n\u2022 \u2212\u2192\u03c0 t\u22121: a vector of dimension N , where each element indicate whether the respective slot\n\u2022 xt: a vector of dimension Din, the input at time-step t.\n\nin Mt\u22121 occupied by a subtree.\n\nThe model \ufb01rst transforms xt to a D dimension vector.\n\n\u02dcxt = LN (W xt + b)\n\n(1)\n\n3\n\n\fwhere LN (\u00b7) is the layer normalization function (Ba et al., 2016).\nTo select the candidate representations from \u02c6Mt\u22121, the model uses \u02dcxt as its query to attend on \u02c6Mt\u22121:\n\npj\nt\n\npt = Att(\u02dcxt, \u02c6Mt\u22121, \u2212\u2192\u03c0 t\u22121)\n\u2212\u2192\u03c0 i\nt = Xj\u2264i\n\u2190\u2212\u03c0 i\nt = Xj\u2265i\n\npj\nt\n\n(2)\n\n(3)\n\n(4)\n\nwhere Att(\u00b7) is a masked attention function, \u2212\u2192\u03c0 t\u22121 is the mask, pt is a distribution over different\nmemory slots in \u02c6Mt\u22121, and pj\nt is the probability on the j-th slot. The attention mechanism will be\ndescribed in section 3.1. Intuitively, pt can be viewed as a pointer to the head of the stack, \u2212\u2192\u03c0 t is an\nindicator value over where the stack exists, and \u2190\u2212\u03c0 t is an indicator over where the top of the stack is\nand where it is non-existent.\nTo compute Mt, we combine \u02c6Mt\u22121 and Mt\u22121 through:\nt\u22121 \u00b7 (1 \u2212 \u2190\u2212\u03c0 )i + \u02c6M i\n\nt\u22121 \u00b7 \u2190\u2212\u03c0 i\nt,\n\nt = M i\n\n\u2200i\n\nM i\n\n(5)\n\nt\u22121 to M i\n\nSuppose pt points at a memory slot yt in \u02c6m.\nThen \u2190\u2212\u03c0 t will write \u02c6M i\nt for i \u2264 yt,\nand (1\u2212\u2190\u2212\u03c0 t) will write M i\nt for i > yt.\nIn other words, Eqn. 5 copies everything from\nMt\u22121 to the current timestep, up to the where\npt is pointing.\n\nt\u22121 to M i\n\nWe believe that this is a crucial point that differ-\nentiates our model from past stack-augmented\nmodels like Yogatama et al. (2016) and Joulin\nand Mikolov (2015). Both constructions have\nthe 0-th slot as the top of the stack, and perform\na convex combination of each slot in the mem-\nory / stack given the action performed. More\nconcretely, a distribution over the actions that\nis not sharp (e.g. 0.5 for pop) will result in\na weighted sum of an un-popped stack and a\npop stack, resulting in a blurred memory state.\nCompounded, this effect can make such models\n\nhard to train. In our case, because (1 \u2212 \u2190\u2212\u03c0 t)i is\n\nData: x1, ..., xT\nResult: oN\nT\ninitialize M0, \u02c6M0;\nfor i \u2190 1 to T do\n\nt ;\nt ;\n\n\u02dcxt = LN (W xt + b);\npt = Att(\u02dcxt, \u02c6Mt\u22121,\u2212\u2192\u03c0 t\u22121);\n\u2212\u2192\u03c0 i\nt = Pj\u2264i pj\n\u2190\u2212\u03c0 i\nt = Pj\u2265i pj\n\u02c6M 0\nfor i \u2190 1 to N do\nt = M i\nM i\noi\nt = cell(M i\nt = \u02dcxt \u00b7 (1 \u2212 \u2212\u2192\u03c0 t)i + oi\n\u02c6M i\n\nt\u22121 \u00b7 (1 \u2212 \u2190\u2212\u03c0 t)i + \u02c6M i\nt \u00b7 \u2212\u2192\u03c0 i\nt;\n\nt , \u02c6M i\u22121\n\nt = \u02dcxt;\n\n);\n\nt\n\nend\n\nt\u22121 \u00b7 \u2190\u2212\u03c0 i\nt;\n\nnon-decreasing with i, its value will accumulate\nto 1 at or before N . This results in a full copy,\nguaranteeing that the earlier states are retained.\nThis full retention of earlier states may play a\npart in the training process, as it is a strategy\nalso used in Gulcehre et al. (2017), where all\nthe memory slots are \ufb01lled before any erasing or writing takes place.\n\nend\nreturn oN\nT ;\nAlgorithm 1: Ordered Memory algorithm. The\nattention function Att(\u00b7) is de\ufb01ned in section 3.1.\nThe recursive cell function cell(\u00b7) is de\ufb01ned in\nsection 3.2.\n\nTo compute candidate memories for time step t, we recurrently update all memory slots with\n\nt , \u02c6M i\u22121\n\noi = cell(M i\nt = \u02dcxt \u00b7 (1 \u2212 \u2212\u2192\u03c0 t)i+1 + oi\n\u02c6M i\n\n)\n\nt\n\nt \u00b7 \u2212\u2192\u03c0 i\n\nt,\u2200i\n\n(6)\n\n(7)\n\nt is xt. The cell(\u00b7) function can be seen as a recursive composition function in a recursive\n\nwhere \u02c6M 0\nneural network (Socher et al., 2013). We propose a new cell function in section 3.2.\nThe output of time step t is the last memory slot \u02c6M N\nt of the new candidate memory, which summa-\nrizes all the information from x1, ..., xt using the induced structure. The pseudo-code for the OM\nalgorithm is shown in Algorithm 1.\n\n4\n\n\f3.1 Masked Attention\n\nGiven the projected input \u02dcxt and candidate memory \u02c6M i\nfeed-forward network:\n\nt\u22121. We feed every (\u02dcxt, \u02c6M i\n\nt\u22121) pair into a\n\nwAtt\n\n2\n\n\u03b1i\n\nt =\n\ntanh(cid:18)WAtt\n\nt\u22121\n\n\u02dcxt (cid:21) + b1(cid:19) + b2\n\n1 (cid:20) \u02c6M i\n\u221aN\nt(cid:19)\n\u03b1j\n\n\u03b2i\n\nt = exp(cid:18)\u03b1i\nis N \u00d7 2N matrix, wAtt\n\n2\n\nj\n\nt \u2212 max\nis N dimension vector, and the output \u03b2i\n\n1\n\nwhere WAtt\nt is a scalar. The\npurpose of dividing by \u221aN is to scale down the logits before softmax is applied, a technique similar\nto the one seen in Vaswani et al. (2017). We further mask the \u03b2t with the cumulative probability\nfrom the previous time step to prevent the model attending on non-existent parts of the stack:\n\nwhere \u2212\u2192\u03c0 N +1\n\nt\u22121 = 1 and \u2212\u2192\u03c0 \u2264N\n\n0 = 0. We can then compute the probability distribution:\n\n\u02c6\u03b2i\nt = \u03b2i\n\nt\u2212\u2192\u03c0 i+1\n\nt\u22121\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\npi\nt =\n\n\u02c6\u03b2i\nt\n\u02c6\u03b2j\nPj\nt\n\nThis formulation bears similarity to the method used for the multi-pop mechanism seen in Yogatama\net al. (2018).\n\n3.2 Gated Recursive Cell\n\nInstead of using the recursive cell proposed in TreeLSTM (Tai et al., 2015) and RNTN (Socher\net al., 2010), we propose a new gated recursive cell, which is inspired by the feed-forward layer in\nTransformer (Vaswani et al., 2017). The inputs M i\nare concatenated and fed into a fully\nconnect feed-forward network:\n\nt and \u02c6M i\u22121\n\nt\n\nvi\nt\nhi\nt\nci\nt\nui\nt\n\n\uf8ee\n\uf8ef\uf8f0\n\n\uf8f9\n\uf8fa\uf8fb\n\n= WCell\n\n2 ReLU(cid:18)WCell\n\n1\n\n(cid:20) \u02c6M i\u22121\n\nt\nM i\n\nt (cid:21) + b1(cid:19) + b2\n\nLike the TreeLSTM, we compute the output with a gated combination of the inputs and ui\nt:\n\nt) \u2299 \u02c6M i\u22121\n\nt + \u03c3(hi\n\noi\nt = LN (\u03c3(vi\n\nt) \u2299 ui\nt)\nt is the vertical gate that controls the input from previous slot, hi\nt is cell gate that control the ui\n\nwhere vi\nt is horizontal gate that\ncontrols the input from previous time step, cgi\nt, oi\nt is the output of cell\nfunction, and LN (\u00b7) share the same parameters with the one used in the Eqn. 1.\n3.3 Relations to ON-LSTM and Shift-reduce Parser\n\nt) \u2299 M i\n\nt + \u03c3(ci\n\nOrdered Memory is implemented following the principles introduced in Ordered Neurons (Shen\net al., 2018). Our model is related to ON-LSTM in several aspects: 1) The memory slots are simi-\nlar to the chunks in ON-LSTM, when a higher ranking memory slot is forgotten/updated, all lower\nranking memory slots should likewise be forgotten/updated; 2) ON-LSTM uses the monotonically\nnon-decreasing master forget gate to preserve long-term information while erasing short term in-\nformation, the OM model uses the cumulative probability \u2212\u2192\u03c0 t; 3) Similarly, the master input gate\nused by ON-LSTM to control the writing of new information into the memory is replaced with the\nreversed cumulative probability \u2190\u2212\u03c0 t in the OM model.\nAt the same time, the internal mechanism of OM can be seen as a continuous version of a shift-\nreduce parser. At time step t, a shift-reduce parser could perform zero or several reduce steps\nto combine the heads of stack, then shift the word t into stack. The OM implement the reduce\nstep with Gated Recursive Cell. It combines \u02c6M i\u22121\n, the output of previous reduce step, and M i\nt ,\n\nt\n\n5\n\n\fTable 1: Test accuracy of the models, trained on operation lengths of \u2264 6, with their out-of-\ndistribution results shown here (lengths 7-12). We ran 5 different runs of our models, giving the\nerror bounds in the last row. The F1 score is the parsing score with respect to the ground truth tree\nstructure. The TreeCell is a recursive neural network based on the Gated Recursive Cell function\nproposed in section 3.2. For the Transformer and Universal Transformer, we follow the entailment\narchitecture introduced in Radford et al. (2018). The model takes sentence1 \nsentence2 as input, then use the vector representation for position at last\nlayer for classi\ufb01cation. \u2217The results for RRNet were taken from Jacob et al. (2018).\n\nModel\n\nNumber of Operations\n\n7\n\nSequential sentence representation\nLSTM\nRRNet*\nON-LSTM\n\n88\n84\n91\n\nInter sentence attention\nTransformer\nUniversal Transformer\n\n51\n51\n\n8\n\n84\n81\n87\n\n52\n52\n\n9\n\n80\n78\n85\n\n51\n51\n\n10\n\n78\n74\n81\n\n51\n51\n\n11\n\n71\n72\n78\n\n51\n51\n\n12\n\n69\n71\n75\n\n48\n48\n\nSys. Gen.\nA B C\n\n84 60 59\n\u2013\n\u2013\n70 63 60\n\n\u2013\n\n53 51 51\n53 51 51\n\nOur model\nAccuracy\nParsing F1\n\n98 \u00b1 0.0 97 \u00b1 0.4 96 \u00b1 0.5 94 \u00b1 0.8 93 \u00b1 0.5 92 \u00b1 1.1\n\n94 91 81\n\n84.3 \u00b1 14.4\n\nAblation tests\ncell(\u00b7) TreeRNN Op.\n\n69\n\nRecursive NN + ground-truth structure\nTreeLSTM\nTreeCell\nTreeRNN\n\n94\n98\n98\n\n67\n\n92\n96\n98\n\n65\n\n92\n96\n97\n\n61\n\n88\n95\n96\n\n57\n\n87\n93\n95\n\n53\n\n86\n92\n96\n\n\u2013\n\n\u2013\n\n\u2013\n\n91 84 76\n95 95 90\n94 92 86\n\nthe next element in stack, into \u02c6M i\nt , the representation for new sub-tree. The number of reduce\nsteps is modeled with the attention mechanism. The probability distribution pt models the position\nof the head of stack after all necessary reduce operations are performed. The shift operations is\nimplemented as copying the current input word xt into memory.\n\nThe upshot of drawing connections between our model and the shift-reduce parser is interpretability:\nWe can approximately infer the computation graph constructed by our model with Algorithm 2 (see\nappendix). The algorithm can be used for the latent tree induction tasks in (Williams et al., 2018).\n\n4 Experiments\n\nWe evaluate the tree learning capabilities of our model on two datasets: logical inference (Bowman\net al., 2015) and ListOps (Nangia and Bowman, 2018). In these experiments, we infer the trees\nwith our model by using Alg. 2 and compare them with the ground-truth trees used to generate the\ndata. We evaluate parsing performance using the F1 score3. We also evaluate our model on Stanford\nSentiment Treebank (SST), which is the sequential labeling task described in Socher et al. (2013).\n\n4.1 Logical Inference\n\nThe logical inference task described in Bowman et al. (2015) has a vocabulary of six words and\nthree logical operations, or, and, not. The task is to classify the relationship of two logical clauses\ninto seven mutually exclusive categories. We use a multi-layer perceptron (MLP) with (h1, h2, h1 \u25e6\nh2,|h1\u2212h2|) as input, where h1 and h2 are the \u02c6M N\nT of their respective input sequences. We compare\nour model with LSTM, RRNet (Jacob et al., 2018), ON-LSTM (Shen et al., 2018), Tranformer\n(Vaswani et al., 2017), Universal Transformer (Dehghani et al., 2018), TreeLSTM (Tai et al., 2015),\nTreeRNN (Bowman et al., 2015), and TreeCell. We used the same hidden state size for our model\nand baselines for proper comparison. Hyper-parameters can be found in Appendix B. The model\n\n3All parsing scores are given by Evalb https://nlp.cs.nyu.edu/evalb/\n\n6\n\n\fTable 2: Partitions of the Logical Inference task from Bowman et al. (2014). Each partitions include\na training set \ufb01ltered out all data points that match the rule indicated in Excluded, and a test set\nformed by matched data points.\n\nPart. Excluded\n\nTraining set size Test set example\n\nA * ( and (not a) ) *\nB * ( and (not *) ) *\nC * ( {and,or} (not *) ) *\n\nFull\n\n128,969\n87,948\n51,896\n\n135,529\n\nf (and (not a))\nc (and (not (a (or b))))\na (or (e (and c)))\n\na or not d and not not b and c\n\na or not d and not not b and c\n\na or not d and not not b and c\n\nFigure 2: Variations in induced parse trees under different runs of the logical inference experiment.\nThe left most tree is the ground truth and one of induced structures. We have removed the parenthe-\nses in the original sequence for this visualization. It is interesting to note that the different structures\ninduced by our model are all valid computation graphs to produce the correct results.\n\nis trained on sequences containing up to 6 operations and tested on sequences with higher number\n(7-12) of operations.\n\nThe Transformer models were implemented by modifying the code from the Annotated Trans-\nformer4. The number of Transformer layers are the same as the number of slots in Ordered Memory.\nUnfortunately, we were not able to successfully train a Transformer model on the task, resulting in a\nmodel that only learns the marginal over the labels. We also tried to used Transformer as a sentence\nembedding model, but to no avail. Tran et al. (2018) achieves similar results, suggesting this could\nbe a problem intrinsic to self-attention mechanisms for this task.\n\nLength Generalization Tests The TreeRNN model represents the best results achievable if the\nstructure of the tree is known. The TreeCell experiment was performed as a control to isolate the\nperformance of using the cell(\u00b7) composition function versus using both using cell(\u00b7) and learning\nthe composition with OM. The performance of our model degrades only marginally with increasing\nnumber of operations in the test set, suggesting generalization on these longer sequences never seen\nduring training.\n\nParsing results There is a variability in parsing performance over several runs under different\nrandom seeds, but the model\u2019s ability to generalize to longer sequences remains fairly constant.\nThe model learns a slightly different method of composition for consecutive operations. Perhaps\npredictably, these are variations that do not affect the logical composition of the subtrees. The\nsource of different parsing results can be seen in Figure 2. The results suggest that these latent\nstructures are still valid computation graphs for the task, in spite of the variations.\n\nSystematic Generalization Tests\nInspired by Loula et al. (2018), we created three splits of the\noriginal logical inference dataset with increasing levels of dif\ufb01culty. Each consecutive split removes\na superset of the previously excluded clauses, creating a harder generalization task. Each model is\nthen trained on the ablated training set, and tested on examples unseen in the training data. As a\nresult, the different test splits have different numbers of data points. Table 2 contains the details of\nthe individual partitions.\n\nThe results are shown in the right section of Table 1 under Sys. Gen. Each column labeled A, B,\nand C are the model\u2019s aggregated accuracies over the unseen operation lengths. As with the length\ngeneralization tests, the models with the known tree structure performs the best on unseen structures,\nwhile sequential models degrade quickly as the tests get harder. Our model greatly outperforms all\nthe other sequential models, performing slightly below the results of TreeRNN and TreeCell on the\ndifferent partitions.\n\n4http://nlp.seas.harvard.edu/2018/04/03/attention.html\n\n7\n\n\fModel\n\nAccuracy\n\nF1\n\nBaselines\nLSTM*\nRL-SPINN*\nGumbel Tree-LSTM*\nTransformer\nUniversal Transformer\nHavrylov et al. (2019)\n\n71.5\u00b11.5\n60.7\u00b12.6\n57.6\u00b12.9\n57.4\u00b10.4\n71.5\u00b17.8\n99.2\u00b10.5\n\n\u2013\n\n71.1\n57.3\n\n\u2013\n\u2013\n\u2013\n\nOurs\n\n99.97\u00b10.014\n\n100\n\nAblation tests\ncell(\u00b7) TreeRNN Op.\n\n63.1\n\n\u2013\n\n(a)\n\n(b)\n\nFigure 3: (a) shows the accuracy of different models on the ListOps dataset. All models have 128\ndimensions. Results for models with * are taken from Nangia and Bowman (2018). (b) shows our\nmodel accuracy on the ListOps task when varying the the size of the training set.\n\nCombined with the parsing results, and our model\u2019s performance on these generalization tests, we\nbelieve this is strong evidence that the model has both (i) learned to exploit symmetries in the\nstructure of the data by learning a good cell(\u00b7) function, and (ii) learned where and how to apply\nsaid function by operating its stack memory.\n\n4.2 ListOps\n\nNangia and Bowman (2018) build a dataset with nested summary operations on lists of single digit\nintegers. The sequences comprise of the operators MAX, MIN, MED, and SUM MOD. The output is\nalso an integer in [0, 9] As an example, the input: [MAX 2 9 [MIN 4 7 ] 0 ] has the solution 9.\nAs the task is formulated to be easily solved with a correct parsing strategy, the task provides an\nexcellent test-bed to diagnose models that perform tree induction. The authors binarize the structure\nby choosing the subtree corresponding to each list to be left-branching: the model would \ufb01rst take\ninto account the operator, and then proceed to compute the summary statistic within the list. A\nright-branching parse would require the entire list to be maintained in the model\u2019s hidden state.\n\nOur model achieves 99.9% accuracy, and an F1 score of 100% on the model\u2019s induced parse tree (See\nTable 3a). This result is consistent across 3 different runs of the same experiment. In Nangia and\nBowman (2018), the authors perform an experiment to verify the effect of training set size on the\nlatent tree models. As the latent tree models (RL-SPINN and ST-Gumbel) need to parse the input\nsuccessfully to perform well on the task, the better performance of the LSTM than those models\nindicate that the size of the dataset does not affect the ability to learn to parse much for those models.\nOur model seems to be more data ef\ufb01cient and solves the task even when only training on a subset\nof 90k examples (Fig. 3b).\n\n4.3 Ablation studies\n\nWe replaced the cell(\u00b7) operator with the RNN operator found in TreeRNN, which is the best per-\nforming model that explicitly uses the structure of the logical clause. In this test, we \ufb01nd that the\nTreeRNN operator results in a large drop across the different tasks. The detailed results for the\nablation tests on both the logical inference and the ListOps tasks are found in Table 1 and 3a.\n\n4.4 Stanford Sentiment Treebank\n\nThe Stanford Sentiment Treebank is a classi\ufb01cation task described in Socher et al. (2013). There are\ntwo settings: SST-2, which reduces the task down to a positive or negative label for each sentence\n(the neutral sentiment sentences are ignored), and SST-5, which is a \ufb01ne-grained classi\ufb01cation task\nwhich has 5 labels for each sentence.\n\nCurrent state-of-the-art models use pretrained contextual embeddings Radford et al. (2018); McCann\net al. (2017); Peters et al. (2018). Building on ELMO Peters et al. (2018), we achieve a performance\n\n8\n\n\fTable 3: Accuracy results of models on the SST.\n\nSST-2\n\nSST-5\n\nSequential sentence representation & other methods\nRadford et al. (2017)\nPeters et al. (2018)\nBrahma (2018)\nDevlin et al. (2018)\nLiu et al. (2019)\n\n52.9\n54.7\n56.2\n\n91.2\n94.9\n95.6\n\n\u2013\n\u2013\n\n91.8\n\n\u2013\n\nRecursive NN + ground-truth structure\nTai et al. (2015)\nMunkhdalai and Yu (2017)\nLooks et al. (2017)\n\n88.0\n89.3\n89.4\n\nRecursive NN + latent / learned structure\nChoi et al. (2018)\nHavrylov et al. (2019)\n\n90.7\n\nOurs (Glove)\nOurs (ELMO)\n\n90.2\u00b10.2\n\n90.4\n92.0\n\n51.0\n53.1\n52.3\n\n53.7\n\n51.5\u00b10.4\n\n52.2\n55.2\n\ncomparable with the current state-of-the-art for both SST-2 and SST-5 settings. However, it should\nbe noted that our model is a sentence representation model. Table 3 lists our and related work\u2019s\nrespective performance on the SST task in both settings.\n\n5 Conclusion\n\nIn this paper, we introduce the Ordered Memory architecture. The model is conceptually close\nto previous stack-augmented RNNs, but with two important differences: 1) we replace the pop and\npush operations with a new writing and erasing mechanism inspired by Ordered Neurons (Shen et al.,\n2018); 2) we also introduce a new Gated Recursive Cell to compose lower level representations into\nhigher level one. On the logical inference and ListOps tasks, we show that the model learns the\nproper tree structures required to solve them. As a result, the model can effectively generalize to\nlonger sequence and combination of operators that is unseen in the training set, and the model is\ndata ef\ufb01cient. We also demonstrate that our results on the SST are comparable with state-of-the-art\nmodels.\n\nReferences\n\nJacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39\u201348,\n2016.\n\nJimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\nDzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and\narXiv\n\nAaron Courville. Systematic generalization: What is required and can it be learned?\npreprint arXiv:1811.12889, 2018.\n\nSamuel R Bowman, Christopher Potts, and Christopher D Manning. Recursive neural networks can\n\nlearn logical semantics. arXiv preprint arXiv:1406.1827, 2014.\n\nSamuel R Bowman, Christopher D Manning, and Christopher Potts. Tree-structured composition in\n\nneural networks without tree-structured architectures. arXiv preprint arXiv:1506.04834, 2015.\n\n9\n\n\fSamuel R Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D Manning, and\nChristopher Potts. A fast uni\ufb01ed model for parsing and sentence understanding. arXiv preprint\narXiv:1603.06021, 2016.\n\nSiddhartha Brahma. Improved sentence modeling using suf\ufb01x bidirectional lstm. 2018.\n\nCiprian Chelba and Frederick Jelinek. Structured language modeling. Computer Speech & Lan-\n\nguage, 14(4):283\u2013332, 2000.\n\nStanley F Chen. Bayesian grammar induction for language modeling. In Proceedings of the 33rd\nannual meeting on Association for Computational Linguistics, pages 228\u2013235. Association for\nComputational Linguistics, 1995.\n\nJihun Choi, Kang Min Yoo, and Sang-goo Lee. Learning to compose task-speci\ufb01c tree structures.\nIn Proceedings of the 2018 Association for the Advancement of Arti\ufb01cial Intelligence (AAAI). and\nthe 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2018.\n\nShay B Cohen, Dipanjan Das, and Noah A Smith. Unsupervised structure prediction with non-\nparallel multilingual guidance. In Proceedings of the Conference on Empirical Methods in Natural\nLanguage Processing, pages 50\u201361. Association for Computational Linguistics, 2011.\n\nSreerupa Das, C Lee Giles, and Guo-Zheng Sun. Learning context-free grammars: Capabilities and\nlimitations of a recurrent neural network with an external stack memory. In Proceedings of The\nFourteenth Annual Conference of Cognitive Science Society. Indiana University, page 14, 1992.\n\nMostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and \u0141ukasz Kaiser. Universal\n\ntransformers. arXiv preprint arXiv:1807.03819, 2018.\n\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep\n\nbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\nDavid Dowty. 4. Direct compositionality, 14:23\u2013101, 2007.\n\nChris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. Recurrent neural network\ngrammars. In Proceedings of the 2016 Conference of the North American Chapter of the Associ-\nation for Computational Linguistics: Human Language Technologies, pages 199\u2013209, 2016.\n\nJeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179\u2013211, 1990.\n\nJerry A Fodor and Zenon W Pylyshyn. Connectionism and cognitive architecture: A critical analy-\n\nsis. Cognition, 28(1-2):3\u201371, 1988.\n\nKunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of\npattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193\u2013202, 1980.\n\nAlex Graves.\n\nGenerating sequences with recurrent neural networks.\n\narXiv preprint\n\narXiv:1308.0850, 2013.\n\nAlex Graves.\n\nAdaptive computation time for recurrent neural networks.\n\narXiv preprint\n\narXiv:1603.08983, 2016.\n\nAlex Graves, Greg Wayne, and Ivo Danihelka. Neural\n\nturing machines.\n\narXiv preprint\n\narXiv:1410.5401, 2014.\n\nEdward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to\nIn Advances in Neural Information Processing Systems,\n\ntransduce with unbounded memory.\npages 1828\u20131836, 2015.\n\nCaglar Gulcehre, Sarath Chandar, and Yoshua Bengio. Memory augmented neural networks with\n\nwormhole connections. arXiv preprint arXiv:1701.08718, 2017.\n\nSerhii Havrylov, Germ\u00b4an Kruszewski, and Armand Joulin. Cooperative learning of disjoint syntax\n\nand semantics. In Proc. of NAACL-HLT, 2019.\n\n10\n\n\fAthul Paul Jacob, Zhouhan Lin, Alessandro Sordoni, and Yoshua Bengio. Learning hierarchical\nstructures on-the-\ufb02y with a recurrent-recursive model for sequences. In Proceedings of The Third\nWorkshop on Representation Learning for NLP, pages 154\u2013158, 2018.\n\nYacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recur-\n\nrent neural networks. arXiv preprint arXiv:1611.06188, 2016.\n\nArmand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent\n\nnets. In Advances in neural information processing systems, pages 190\u2013198, 2015.\n\nDonald E Knuth. On the translation of languages from left to right. Information and control, 8(6):\n\n607\u2013639, 1965.\n\nAdhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom.\nLstms can learn syntax-sensitive dependencies well, but modeling structure makes them better. In\nProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume\n1: Long Papers), volume 1, pages 1426\u20131436, 2018.\n\nBrenden M Lake and Marco Baroni. Generalization without systematicity: On the compositional\n\nskills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350, 2017.\n\nXiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks\nfor natural language understanding. CoRR, abs/1901.11504, 2019. URL http://arxiv.org/\nabs/1901.11504.\n\nMoshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. Deep learning with\n\ndynamic computation graphs. arXiv preprint arXiv:1702.02181, 2017.\n\nJoao Loula, Marco Baroni, and Brenden M Lake. Rearranging the familiar: Testing compositional\n\ngeneralization in recurrent networks. arXiv preprint arXiv:1807.07545, 2018.\n\nJean Maillard, Stephen Clark, and Dani Yogatama. Jointly learning sentence embeddings and syntax\n\nwith unsupervised tree-lstms. arXiv preprint arXiv:1705.09189, 2017.\n\nBryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation:\nIn Advances in Neural Information Processing Systems, pages\n\nContextualized word vectors.\n6294\u20136305, 2017.\n\nRichard Montague. Universal grammar. Theoria, 36(3):373\u2013398, 1970.\n\nMichael C Mozer and Sreerupa Das. A connectionist symbol manipulator that discovers the structure\nof context-free languages. In Advances in neural information processing systems, pages 863\u2013870,\n1993.\n\nTsendsuren Munkhdalai and Hong Yu. Neural tree indexers for text understanding. In Proceedings\nof the conference. Association for Computational Linguistics. Meeting, volume 1, page 11. NIH\nPublic Access, 2017.\n\nNikita Nangia and Samuel R Bowman. Listops: A diagnostic dataset for latent tree learning. arXiv\n\npreprint arXiv:1804.06028, 2018.\n\nMatthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and\nLuke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,\n2018.\n\nJordan B Pollack. Recursive distributed representations. Arti\ufb01cial Intelligence, 46(1-2):77\u2013105,\n\n1990.\n\nAlec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering\n\nsentiment. arXiv preprint arXiv:1704.01444, 2017.\n\nAlec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.\n\nImproving language un-\nderstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-\nassets/research-covers/languageunsupervised/language understanding paper. pdf, 2018.\n\n11\n\n\fBrian Roark. Probabilistic top-down parsing and language modeling. Computational linguistics, 27\n\n(2):249\u2013276, 2001.\n\nMarcel Paul Sch\u00a8utzenberger. On context-free languages and push-down automata. Information and\n\ncontrol, 6(3):246\u2013264, 1963.\n\nYikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. Ordered neurons: Integrating\n\ntree structures into recurrent neural networks. arXiv preprint arXiv:1810.09536, 2018.\n\nStuart M Shieber. Sentence disambiguation by a shift-reduce parsing technique. In Proceedings of\nthe 21st annual meeting on Association for Computational Linguistics, pages 113\u2013118. Associa-\ntion for Computational Linguistics, 1983.\n\nRichard Socher, Christopher D Manning, and Andrew Y Ng. Learning continuous phrase represen-\ntations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS-2010\nDeep Learning and Unsupervised Feature Learning Workshop, volume 2010, pages 1\u20139, 2010.\n\nRichard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. In Proceedings of the 2013 conference on empirical methods in natural language pro-\ncessing, pages 1631\u20131642, 2013.\n\nKai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations\nfrom tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.\n\nShawn Tan and Khe Chai Sim. Towards implicit complexity control using variable-depth deep neural\nnetworks for automatic speech recognition. In 2016 IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP), pages 5965\u20135969. IEEE, 2016.\n\nKe Tran, Arianna Bisazza, and Christof Monz. The importance of being recurrent for modeling\n\nhierarchical structure. arXiv preprint arXiv:1803.03585, 2018.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\nIn Advances in Neural Infor-\n\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need.\nmation Processing Systems, pages 5998\u20136008, 2017.\n\nJason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.\n\narXiv preprint\n\narXiv:1410.3916, 2014.\n\nAdina Williams, Andrew Drozdov*, and Samuel R Bowman. Do latent tree learning models identify\nmeaningful structure in sentences? Transactions of the Association of Computational Linguistics,\n6:253\u2013267, 2018.\n\nDani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. Learning to\ncompose words into sentences with reinforcement learning. arXiv preprint arXiv:1611.09100,\n2016.\n\nDani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil\n\nBlunsom. Memory architectures in recurrent neural network language models. 2018.\n\nZheng Zeng, Rodney M Goodman, and Padhraic Smyth. Discrete recurrent neural networks for\n\ngrammatical inference. IEEE Transactions on Neural Networks, 5(2):320\u2013330, 1994.\n\n12\n\n\f", "award": [], "sourceid": 2771, "authors": [{"given_name": "Yikang", "family_name": "Shen", "institution": "Mila, University of Montreal, MSR Montreal"}, {"given_name": "Shawn", "family_name": "Tan", "institution": "Mila"}, {"given_name": "Arian", "family_name": "Hosseini", "institution": "Mila, University of Montreal, MSR Montreal"}, {"given_name": "Zhouhan", "family_name": "Lin", "institution": "MILA"}, {"given_name": "Alessandro", "family_name": "Sordoni", "institution": "Microsoft Research"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "U. Montreal"}]}