{"title": "Novel positional encodings to enable tree-based transformers", "book": "Advances in Neural Information Processing Systems", "page_first": 12081, "page_last": 12091, "abstract": "Neural models optimized for tree-based problems are of great value in tasks like SQL query extraction and program synthesis.\nOn sequence-structured data, transformers have been shown to learn relationships across arbitrary pairs of positions more reliably than recurrent models.\nMotivated by this property, we propose a method to extend transformers to tree-structured data, enabling sequence-to-tree, tree-to-sequence, and tree-to-tree mappings.\nOur approach abstracts the transformer's sinusoidal positional encodings, allowing us to instead use a novel positional encoding scheme to represent node positions within trees.\nWe evaluated our model in tree-to-tree program translation and sequence-to-tree semantic parsing settings, achieving superior performance over both sequence-to-sequence transformers and state-of-the-art tree-based LSTMs on several datasets.\nIn particular, our results include a 22% absolute increase in accuracy on a JavaScript to CoffeeScript translation dataset.", "full_text": "Novel positional encodings to enable\n\ntree-based transformers\n\nVighnesh Leonardo Shiv\n\nMicrosoft Research\n\nRedmond, WA\n\nChris Quirk\n\nMicrosoft Research\n\nRedmond, WA\n\nvishiv@microsoft.com\n\nchrisq@microsoft.com\n\nAbstract\n\nNeural models optimized for tree-based problems are of great value in tasks like\nSQL query extraction and program synthesis. On sequence-structured data, trans-\nformers have been shown to learn relationships across arbitrary pairs of positions\nmore reliably than recurrent models. Motivated by this property, we propose\na method to extend transformers to tree-structured data, enabling sequence-to-\ntree, tree-to-sequence, and tree-to-tree mappings. Our approach abstracts the\ntransformer\u2019s sinusoidal positional encodings, allowing us to instead use a novel\npositional encoding scheme to represent node positions within trees. We evalu-\nated our model in tree-to-tree program translation and sequence-to-tree semantic\nparsing settings, achieving superior performance over both sequence-to-sequence\ntransformers and state-of-the-art tree-based LSTMs on several datasets. In partic-\nular, our results include a 22% absolute increase in accuracy on a JavaScript to\nCoffeeScript translation dataset.\n\n1\n\nIntroduction\n\n1.1 Sequence modeling\n\nNeural networks have been successfully applied to an increasing range of tasks, including speech\nrecognition and machine translation. These domains crucially depend on techniques for modeling\nstreams of audio and text, represented as dynamically sized sequences of tokens. Researchers have\nhistorically handled such data primarily with recurrent techniques, which encode sequences into\n\ufb01xed-length representations. The sequence-to-sequence LSTM model (Sutskever et al., 2014) is a\nparticularly notable example in recent times.\nRecurrent architectures have some disadvantages. From a generalization perspective, recurrent\ncells face the challenge of learning relationships between tokens many time steps apart. Attention\nmechanisms are now commonly employed to mitigate this issue, driving new state-of-the-art results in\ndif\ufb01cult tasks such as machine translation (Wu et al., 2016). From an ef\ufb01ciency standpoint, recurrence\ndoes not lend itself to parallelism, often rendering recurrent models expensive to train. Recurrent\nmodels are also dif\ufb01cult to interpret, employing an obtuse series of neural layers between time steps\nthat render relationships modeled within the data unclear.\nThe transformer (Vaswani et al., 2017) is a stateless sequence-to-sequence architecture motivated by\nthese issues, constructed by forgoing recurrence altogether in favor of extensive attention. This design\nallows information to \ufb02ow over unbounded distances during training and inference, without the need\nfor complex gates and gradient clipping. This type of long distance \ufb02ow, driven by learned attention\ntransforms over positional encoding, provides a powerful computational mechanism. Transformers\nalso lend themselves to easier interpretation, as their attention layers can at least reveal information\nabout learned relationships between elements of a sequence.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.2 Hierarchical modeling\n\nRecent work has begun to apply neural networks to programming tasks (Allamanis et al., 2018). In\nrecent years, programming language analysis techniques have begun to exploit statistical techniques\ncommonly used on large natural language corpora (Hindle et al., 2012). These can be used to identify\nidioms in software, enable searching for code clones, searching code by natural language, or even\ntranslating from one programming language to another.\nRepresenting programs is an interesting challenge. One option is to view them as a one dimensional\nsequence of tokens and use techniques common in the natural language programming literature.\nHowever, these programs are intentionally endowed with hierarchical structure; using purely sequence-\noriented methods may result in losing valuable structural information.\nExpanding past sequential modeling, a common approach is to pass information through neighbors in\nthe graph, in a manner that is reminiscent of message passing in graphical models (Li et al., 2016). To\nensure that information can fully propagate across the graph, this message-passing must be applied\nmultiple times, bounded by the diameter of the graph. While this allows us to exploit hierarchical\nstructure, ideally we would like to do so while capturing the ef\ufb01cient information \ufb02ow and other\nbene\ufb01ts of transformer models.\nIn this work, we generalize transformers to embed tree representations. Our work introduces novel\npositional encodings for tree-structured data1. Using these encodings, we can apply transformers to\ntree-structured domains, allowing information to percolate fully across the graph in a single layer.\nThis can potentially extend the transformer to settings ranging from natural language parse trees\nto program abstract syntax trees. We evaluate our tree-transformers on programming language\ntranslation tasks such as translating JavaScript to CoffeeScript (Chen et al., 2018) as well as semantic\nparsing tasks including extracting a database query from a natural language request (Dahl et al.,\n1994), demonstrating improved performance over sequential transformers.\n\n2 Positional encodings in attention models\n\nThe order of a sequence is rich in information; order-agnostic (bag-of-words) models are limited in\npower by their inability to use this information. One particularly common way to capture order is\nthrough recurrence; recurrent models inherently consider the order of an input sequence by processing\nits elements sequentially. As transformers forgo recurrence, they require information about the input\nsequence\u2019s order in some other form. This additional information is provided in the form of positional\nencodings. Each position in the input sequence is associated with a vector, which is added to the\nembedding of the token at that position. This allows the transformer to learn positional relationships,\nas well as relationships between the token embedding and positional encoding spaces.\n\n2.1 Properties\n\nThe transformer\u2019s original positional encoding scheme has two key properties. First, every position\nhas a unique positional encoding, allowing the model to attend to any given absolute position. Second,\nany relationship between two positions can be modeled by an af\ufb01ne transform between their positional\nencodings. The positional encodings take the form\n\nP Epos,2i = sin(pos/f (i))\nP Epos,2i+1 = cos(pos/f (i))\n\nwhere f (i) = 100002i/dmodel for i \u2208 [0, dmodel/2). Considering the identities\n\ncos(\u03b1 + \u03b2) = cos(\u03b1) cos(\u03b2) \u2212 sin(\u03b1) sin(\u03b2)\nsin(\u03b1 + \u03b2) = sin(\u03b1) cos(\u03b2) + cos(\u03b1) sin(\u03b2)\n\n1Implemented in Microsoft ICECAPS: https://github.com/microsoft/icecaps\n\n2\n\n\fwe can see that transformer can attend to relative offsets using linear transforms. For instance, the\nencoding of position x + y can be phrased as a linear combination of x and y\u2019s positional encodings:\n\nP Ex+y,2i = sin ((x + y)/f (i)) = sin (x/f (i) + y/f (i))\n\n= sin (x/f (i)) cos (y/f (i)) + cos (x/f (i)) sin (y/f (i))\n= P Ex,2iP Ey,2i+1 + P Ex,2i+1P Ey,2i\n\nP Ex+y,2i+1 = cos ((x + y)/f (i)) = cos (x/f (i) + y/f (i))\n\n= cos (x/f (i)) cos (y/f (i)) \u2212 sin (x/f (i)) sin (y/f (i))\n= P Ex,2i+1P Ey,2i+1 \u2212 P Ex,2iP Ey,2i\n\n2.2 Bag interpretation\n\nPositional encodings address the power limitations of bag-of-words representations by upgrading\nthe bag of words to a bag of annotated words. Indeed, the transformer\u2019s core attention mechanism is\norder-agnostic, treating keys as a bag. The calculations performed on any given element of a sequence\nare entirely independent of the order of the rest of that sequence in that layer; this leaves most of the\nwork of exploiting positional information to the positional encodings (Vaswani et al., 2017), though\ndecoder-side self-attention masking and autoregression also play a role.\nNow, a bag of words annotated with positions can be equivalently thought of as a bag of positions\nannotated with words. From this perspective, we see that it is not at all necessary that our input\n\u201csequence\u201d of positions have any direct correspondence with the sequence of associated \u201cindices,\u201d i.e.\nan evenly distributed number line. While the original transformer\u2019s positional encodings do form\nthis correspondence for the purposes of sequence modeling, we can consider alternative positional\nencodings to represent non-sequential structures in a positional space. We use this idea to extend\nthe transformer to tree-structured data, representing structural relationships between elements as\nrelationships between points in positional space.\n\n3 Tree positional encodings\n\nNow we construct our positional encoding scheme for trees. We focus on directed trees with ordered\nlists of children. Each node has a unique parent (besides the root node) and a numbered \ufb01nite list of\nchildren. Each node\u2019s position can be de\ufb01ned as its path from the root node, and paths between nodes\ncan climb up through parent relationships or down through child relationships.\n\n3.1 Properties\n\nIn building tree positional embeddings, we aim to preserve the properties described in Section 2.1\nwith our new scheme. While the uniqueness property needs no adjustment, the positional relationship\nproperty needs to be modi\ufb01ed to suit trees rather than sequences. In the context of sequences, the\nrelationship between two positions is simply the distance j that separates them. For trees though,\nthe relation between two nodes is a path: a series of steps along tree branches, with each step either\ngoing up to the parent or down to a child.\nTherefore, our desired property is that for all paths \u03c6, there is a corresponding af\ufb01ne transform in the\npositional space A\u03c6 that captures the same relationship. Speci\ufb01cally, if a and b are two positions in a\ntree such that the path between them is \u03c6, then we desire the following:\n\nP Eb = A\u03c6P Ea\n\nThis allows the transformer to learn path-wise relationships within its embedding layers.\nFrom a given node in an n-ary tree, there are (n + 1) potential length-1 paths: a step down to any of\nits n children, and a step up to the parent. Any longer path \u03c6 can be built as a composition of these\nlength-1 paths.\nIn the positional space, we will associate the step down to child positions 1, . . . , n with the af\ufb01ne\noperators D1,...,n, and the step up to the parent with the af\ufb01ne operator U. For any path \u03c6, we\ncan construct the corresponding transform A\u03c6 as a composition of D\u2019s and U\u2019s. For example,\nif we wish to denote the positional encoding of node x\u2019s grandparent\u2019s \ufb01rst child (e.g., the path\n\n3\n\n\fFigure 1: Example computations of positional encodings for nodes in a regular tree. The sequence\nof branch choices b determines a sequence of transforms Db1, Db2, . . . to apply to the root node\u2019s\npositional encoding. U is de\ufb01ned as the complement: applying it to any node results in that node\u2019s\nparent (e.g. r = U x = U 2y = U 3z). The transforms Di, U are de\ufb01ned in Equations 2 and 3.\n\nFigure 2: Nearest neighbor heatmap of parameter-free tree encoding scheme. We number the nodes\nin the tree according to a breadth-\ufb01rst left-to-right traversal of a balanced binary tree: position 0 is\nthe root, 1 is the \ufb01rst child of root, 2 is the second child of root, 3 is the \ufb01rst child of the \ufb01rst child of\nroot, and so on. In each case, we consider the row position as a \u201cquery\u201d and each column position as\na potential \u201cvalue\u201d. The attention score of solely the positional encoding after softmax is represented\nas a heatmap scaling from black (0.0) through red and yellow to white (1.0).\n\n\u03c6 = (cid:104)PARENT, PARENT, CHILD-1(cid:105)), we can write:\n\nP ECHILD-1(PARENT(PARENT(x))) = P E\u03c6(x) = A\u03c6P Ex = D1U 2P Ex\n\nAs every path can be broken down into a composition of these (n + 1) operators, we need only focus\non these basic operators\u2019 relationships. The fundamental relationship between these operators is that\ntraveling up a branch negates traveling down any branch. Our constraint then is:\n\nU Di = I \u2200i \u2208 {1, . . . , n}\n\n(1)\n\n3.2 Proposed encoding scheme\n\nWe propose a stack-like encoding for tree positions. This scheme adheres to the above constraints for\nall trees up to a speci\ufb01ed depth, and still works well in practice for even deeper trees. We will start by\ndescribing a parameter-free version of our positional encoding scheme for simplicity. Our scheme\ntakes two hyperparameters: n, the degree of our tree, assumed to be regular; and k, the maximum tree\ndepth for which our constraint is preserved. Each positional encoding has dimension n \u00b7 k, and each\noperator U, D1,...,n preserves this dimensionality. The root position is encoded as the zero vector,\nand every other node position is encoded according to its path from the root vector. As paths from the\nroot consist only of steps downward, we can denote this path as (cid:104)b1, . . . , bL(cid:105), where bi is the step\nchoice at the ith layer and L is the layer at which the node resides. Then, for any node x, we compute\nits positional encoding as demonstrated in Figure 1:\n\nx = DbLDbL\u22121 . . . Db1 0\n\nNow we de\ufb01ne Di and U. The intuition behind our positional encoding scheme is to treat the\npositional encodings as a stack of length-1 component paths. Every Di operation pushes a length-1\n\n4\n\nrxyz..................\fFigure 3: Common traversals and mixtures thereof can be represented as linear transforms. Using the\nposition encoding described in this paper, \ufb01nding the parent, left child, or right child of a given node\ncan be represented as linear transforms U, D1, and D2. Complex traversals can be represented also\nas linear transforms by composing these operations. The attention heatmaps below demonstrate the\nsimilarity of tree positional encodings applied to different points in the tree when the \u201cquery\u201d has\nbeen transformed before dot product with the value.\n\n(a) Parent: P\n\n(b) Siblings: D1U +D2U\n\n2\n\n(c) Aunts: (D1+D2)U 2\n\n2\n\n(d) Cousins: (D1+D2)2U 2\n\n4\n\npath onto the stack, while U pops an element. The stack can contain at most k component paths.\nTo the extent that our assumption holds that L \u2264 k, these properties enforce Equation 1. In more\nexplicit terms, for a given node x we compute Dix by concatenating a one-hot n-vector with hot bit\ni (en\ni ) to the left side of x, and truncating x on the right to preserve dimensionality. We de\ufb01ne U\ncomplementarily. In other words, for a given node x,\n\ni ; x[: \u2212n]\nU x = x[n :]; 0n\n\nDix = en\n\n(2)\n(3)\nwhere ; represents concatenation, and [n :] and [: \u2212n] represent truncation by n elements on the left\nand right side, respectively (as per Python notation). Figure 2 depicts visually how this parameter-free\npositional encoding scheme distinguishes between different nodes, and Figure 3 demonstrates how\nthis scheme can ef\ufb01ciently represent speci\ufb01c structural relationships.\nWhile these D, U satisfy our constraint whenever L \u2264 k, it should be noted that for L > k, U Di is\nnot necessarily the identity. Traveling down more than k layers will cause this scheme to \u201cforget\u201d\nnodes more than k layers up, which cannot be inverted. In practice, we make the simplifying\nassumption that this loss of information is insigni\ufb01cant for suf\ufb01ciently large k.\nThe positional encoding scheme as proposed so far approximately ful\ufb01lls both the uniqueness\nand linear composition properties. This scheme is parameter-free; however, we \ufb01nd that adding\na parametrizable component helps diversify our encodings, improving their inductive bias. Our\nencoding consists of a sequence of one-hot chunks, each representing a different layer of the tree.\nOne will note that we can weigh these one-hot chunks with any geometric series without disrupting\nthe af\ufb01ne property:\n\nx(cid:48) = x (cid:12) (1n; pn; p2\n\nn; . . . )\n\nx(cid:48) here satis\ufb01es the same properties as x. Here, p is a parameter and pn is a n-vector of p\u2019s. Figure\n4 demonstrates how different values for p can radically alter attention biases. Analogous to the\noriginal transformer\u2019s combination of sinusoidal encodings, we propose concatenating multiple tree\nencodings, each equipped with its own p to be learned. To prevent the encodings\u2019 norms from\nexploding, we apply tanh to p to bound it between -1 and 1, and multply the encodings by a factor of\n\n(cid:112)1 \u2212 p2 to approximately normalize it. We then scale it further by a factor of(cid:112)dmodel/2 to achieve\n\nnorms more similar to the original transformer\u2019s positional encoding scheme.\n\n4 Decoder\n\nTo accommodate a new positional encoding scheme, we need to slightly modify the decoder. The\noriginal transformer\u2019s decoder concatenates a start token to the beginning of the sequence without\nmodifying the positional encodings. This results in misalignment between autoregressed outputs and\npositional encodings, e.g. the encoding for the second position is summed with the embedding of the\n\ufb01rst output. This is not an issue in the sequential case; the positional encodings are self-similar, so\n\n5\n\n\fFigure 4: Nearest neighbor heatmaps of parameterized tree encodings with example values of p. As\nshown in Figure 2, many of the lower-level positions in the tree are quite similar in the absence of a\ndecay factor. For example, position 5 (Root,D2,D1) is most similar to itself (score of 0.44), but quite\nsimilar to position 6 (Root,D2,D2) and position 3 (Root,D1,D1) with scores of 0.16. An appropriate\nlevel of decay allows each position to be uniquely identi\ufb01ed as in (a); too much decay provides little\nadditional information as in (b).\n\n(a) Decay factor p = 0.9 (b) Decay factor p = 0.7\n\nthis \u201cmisalignment\u201d is a linear transform away from the \u201ccorrect\u201d alignment. However, no traversal\nthrough a tree\u2019s nodes have this self-similarity property, so proper alignment here is critical.\nWe use a zero vector for the start token\u2019s positional encoding, and use the appropriate positional\nencoding for each autoregressed output. Our decoder must dynamically compute the new positional\nencoding whenever it produces a token. The decoder must keep track of the partial tree structure that\nit constructs, to correctly traverse to the next position based on history. In order to build this partial\ntree structure, the decoder must be aware of how many children each node must have. To this end,\nwe construct our vocabularies such that each symbol is annotated with a number of children. When\nsymbols have a varying number of children, they are added multiple times to the vocabulary, each\nwith a different annotation. Given this information, the decoder can \ufb02exibly construct trees using any\ntree traversal algorithm, as long as it is applied consistently. In our experiments, we explored both\ndepth-\ufb01rst and breadth-\ufb01rst traversals for decoding.\n\n5 Experiments and results\n\nFor our evaluation, we consider both tree-to-tree and sequence-to-tree tasks. Both categories test our\nmodel\u2019s ability to decode tree-structured data; the sequence-to-tree task additionally tests our model\u2019s\nability to translate between different positional encoding schemes. Our tree-to-tree evaulation centers\naround program translation tasks, while our sequence-to-tree evaluation focuses on semantic parsing.\nAs our model expects regular trees, we preprocess all tree data by converting trees to left-child-right-\nsibling representations, binarizing them 2. This enforces n = 2 for our model. We use a maximum\ntree depth k = 32 for all experiments. Unless listed otherwise, we performed all of our experiments\nwith Adam (Kingma & Ba, 2015), a batch size of 128, a dropout rate of 0.1 (Srivastava et al., 2014),\nand gradient clipping for norms above 10.0.\n\n5.1 Tree-to-tree: program translation\n\nFor tree-to-tree evaluation, we focused on three sets of program translation tasks from the literature\nto test our model against. The \ufb01rst set of tasks is For2Lam, a synthetic translation dataset between\nan invented imperative and functional language. The dataset is split into two tasks: one for small\nprograms and one for large programs. The second set of tasks involves translating between generated\nCoffeeScript and JavaScript code. The data is similarly spllit, here both by program length and\n\n2 Although binarizing trees may not always be necessary, both programming language and natural language\ntrees of ten have constructs with unbounded numbers of children (e.g. statement blocks). For years, natural\nlanguage parsing efforts have converted n-ary grammars into binary forms to enable ef\ufb01cient algorithms and\nestimation (Klein & Manning, 2003). We explored left-child-right-sibling representations primarily to be\nconsistent with past work (Chen et al., 2018); it would be interesting to measure the impact of alternate\nbinarization strategies (or omitting binarization altogether) when using tree transformers.\n\n6\n\n\fFigure 5: Whole program error rates for synthetic tasks with comparisons to tree2tree LSTMs.\nThe tree-transformer demonstrates advantages over both the sequence-transformer and tree2tree\nLSTM, which become particularly clear for the long-sequence dataset. This indicates that the custom\npositional encodings may be providing useful structural information.\n\n(a) Synthetic short-sequence results.\n\n(b) Synthetic long-sequence results.\n\nvocabulary. Each set of tasks contains 100,000 training examples and 10,000 test examples total.\nMore details about the data sets can be found at Chen et al. (2018). We report all results in terms of\nwhole program accuracy.\n\n5.1.1 Synthetic translation tasks\n\nFor the synthetic translation tasks, we trained two tree transformers on parse tree representations,\none using depth-\ufb01rst traversal and the other using breadth-\ufb01rst. We also trained a classic sequence-\ntransformer on linearized parse trees. Both models were trained with four layers and dmodel = 256.\nThe sequence-transformer was trained with df f = 1024 and a positional encoding dimension\nthat matched dmodel, in line with the hyperparameters used in the original transformer. The tree-\ntransformer, however, was given a larger positional encoding size of 2048 in exchange for a smaller\ndf f of 512. This was to emphasize the role of our tree positional encodings, which are inherently\nbulkier than the sequential positional encodings, while maintaining a similar parameter count.\nThe results for the synthetic tasks can be found in Figure 5, with comparisons to the state-of-the-art\nsystem (Chen et al., 2018). All methods compared get very close to solving the small program\ndataset. The results on long programs are of more interest: both tree-transformer models perform\nsigni\ufb01cantly better than the sequence-transformers, suggesting that the positional encodings help\nconsiderably for larger trees. The depth-\ufb01rst search variant outperforms breadth-\ufb01rst search in both\ncases. We conjecture that depth-\ufb01rst search may be a more favorable traversal method in general; it\ntends to construct more subtrees similar to each other earlier in the process. The depth-\ufb01rst variant\nalso outperformed the tree2tree LSTM on both synthetic datasets, suggesting that the transformer\u2019s\nattention-based approach may be of value to program language translation just as it is to natural\nlanguage translation.\n\n5.1.2 CoffeeScript-JavaScript translation\n\nGiven the results on the synthetic tasks, we focused on training depth-\ufb01rst traversal tree-transformers\nfor this task. The data is partitioned four ways, into two sets of vocabulary (\u2018A\u2019 and \u2018B\u2019) and two\ncategories of program length (short and long). We use the same hyperparameters as in the synthetic\ntasks, and once again compare our results with the tree2tree LSTM model. For memory-related\nreasons, a batch size of 64 was used instead for the tasks with longer program lengths.\nThe results for CoffeeScript-Javascript translation can be found in Figure 6. The tree-transformer\nobtains state-of-the-art results on over half the datasets, while still producing competitive results on\nthe other datasets. This results demonstrate that the advantages of the tree-transformer\u2019s design are\nmore prominent with large data. While its performance tends to be slightly weaker on the simpler\nshort-sequence tasks, the tree-transformer gains up to 20 percentage point improvements over the\n\n7\n\n0.000.050.100.150.200.250.300.35Whole Sequence Error Rate (%)0.01.02.03.04.05.06.0Tree-transformer, DFSTree-transformer, BFSSeq-transformerTree2tree LSTM\fFigure 6: Whole program error rate data for CoffeeScript-JavaScript translation tasks. Here, the tree-\ntransformer is compared to Chen et al.\u2019s tree-to-tree model (Chen et al., 2018) which has previously\nproduced state-of-the-art results. The tree-transformer improved results on over half the datasets,\ndemonstrating the largest gains on the most dif\ufb01cult datasets.\n\nTable 1: Metrics for semantic parsing tasks. The sequence-to-tree transformer achieves state-of-the-art\nresults on the largest dataset studied here, and outperforms the baseline transformer by several points\non two of the three datasets. This suggests that the induced bias of explicit tree structure outweighs\nthe additional hurdle of converting between positional encoding schemes. Both transformers saw less\nsuccess on the two smaller datasets relative to the literature, perhaps indicating a tendency to over\ufb01t.\n\nDataset\nJOBS\nGEO\nATIS\n\nSeq2Tree Tform Seq2Seq Tform Literature\n\n84.3\n84.6\n86.4\n\n85.0\n81.1\n84.4\n\n90.7 (Liang et al., 2011)\n89.0 (Kwiatkowski et al., 2013)\n84.6 (Dong & Lapata, 2016)\n\ntree2tree LSTM on the most dif\ufb01cult tasks here. Overall, these results are promising for applying\ntree-transformers to larger-scale tree-to-tree scenarios.\n\n5.2 Sequence-to-tree: semantic parsing\n\nFor sequence-to-tree evaluation, we focused on several benchmark semantic parsing tasks. In each\ntask, the model must transform natural language queries into tree-structured code snippets against a\nparticular query language or API. The three datasets we consider are:\n\n\u2022 JOBS (Califf & Mooney, 1999), a job listing database retrieval task. This dataset consists of\n\n500 training examples and 140 evaluation examples of Prolog-style query extraction.\n\n\u2022 GEO (Tang & Mooney, 2001), a geographical database retrieval task. This dataset consists\nof 680 training examples and 200 evaluation examples of lamba-calculus based semantic\nparses.\n\n\u2022 ATIS (Dahl et al., 1994), a \ufb02ight booking task. The most informative results are for ATIS,\nwhere there is ample training data, featuring 4480 training examples and 450 evaluation\nexamples. Each data point pairs a sentence with an argument-identi\ufb01ed lamba-calculus\nexpression.\n\nJOBS and GEO provide far less data, each fewer than 1000 training examples, so their results are less\nreliable.\nWe provide whole program accuracy as our key metric to properly compare our model against\nthe literature.\nFor all datasets, we train four-layer sequence-to-tree and sequence-to-sequence\ntransformers with dmodel = 256, df f = 1024, and dpos = 2048.\n\n8\n\n0.05.010.015.020.025.0Tree2tree LSTMTree-transformer, DFSWhole Sequence Error Rate (%)CJ-A-shortCJ-B-shortCJ-A-longCJ-B-longJC-A-shortJC-B-shortJC-A-longJC-B-long\fThe results for our semantic parsing experiments can be found in Table 1. Here, we compare our\nmetrics against the best in the literature as surveyed by Dong & Lapata (2016). We see that our\nmodel generally outperforms the classic transformer, leading by several percentage points on ATIS\nand GEO and performing only slightly worse on JOBS, the smallest evaluated dataset. It may be\npossible to improve our results on this smaller dataset through a cross-validated hyperparameter\nsearch, though we do not explore that here. Enforcing hierarchical structure upon the transformer\nappears to be worth the additional challenge of converting between positional encodings. The\nsequence-to-tree transformer outperforms state-of-the-art recurrent methods on ATIS, but suffers\nrelatively on the smaller datasets. This perhaps indicates that the transformer-based approach is more\nprone to over\ufb01tting,\n\n6 Related work\n\nAlthough ours is the \ufb01rst effort in applying transformer models to hierarchically shaped data, there\nhas been a range of prior work in tree-structured extensions of recurrent architectures. Soon after the\nrecent resurgence of recurrent neural networks over linear sequences, researchers began to consider\nextensions of these models that accommodate structures more complex than linear chains. Initial\nefforts focused on input tree structures, where the shape of the input tree is \ufb01xed in advance. Tree-\nLSTMs demonstrated bene\ufb01ts in tasks such as sentence similarity, sentiment analysis (Tai et al.,\n2015), and information extraction (Miwa & Bansal, 2016). With a few changes, these models can be\nextended to cover graph-like structures as well (Peng et al., 2017).\nSequence-to-sequence models without explicit tree modeling have been applied to tree generation\nusing only a simple linearization of the tree structure (Vinyals et al., 2015; Eriguchi et al., 2017;\nAharoni & Goldberg, 2017). Later work has proposed generation methods that are more sensitive to\ntree structures and well-formedness constraints (Dong & Lapata, 2016; Alvarez-Melis & Jaakkola,\n2017), leading to new-state-of-the-art results.\nRather than explicitly modeling hierarchically structured data, some recent work imposes hyperbolic\ngeometry on the activations of neural networks (Gulcehre et al., 2018). De\ufb01ning attention in terms\nof hyperbolic operations allows modeling of latent hierarchical structures. In contrast, our work\nfocuses on the case of explicit hierarchical structure. Another recent method imposes implicit\nhierarchical structure over linear strings using ordered neurons. (Shen et al., 2019). The ordering can\nbe interpreted as a fuzzy hierarchy over recurrent memory cells, with commonly-forgotten neurons\ncorresponding to deeper tree nodes and long-retained neurons corresponding to nodes closer to the\nroot. A hierarchy over words can be reconstructed given that activations of the model\u2019s master forget\ngate. This method is appropriate for imposing an inductive bias, but is less suited for scenarios that\nrequire strict enforcement of tree structures.\nA particularly relevant transformer variant explicitly captures relative position, rather than relying on\nsinusoidal models to indirectly model distances (Shaw et al., 2018). However, this clear precursor to\nmodeling labeled, directed graphs is limited to relative linear positions.\n\n7 Conclusion\n\nWe have proposed a novel scheme of custom positional encodings to extend transformers to tree-\ndomain tasks. By leveraging the strengths of the transformer, we have achieved an ef\ufb01ciently\nparallelizable model that can consider relationships between arbitrary pairs of tree nodes in a single\nstep. Our experiments have shown that our model can often outperform sequence-transformers in\ntree-oriented tasks. We intend to experiment with employing the model on other tree-domain tasks of\ninterest as future work.\nBy abstracting the transformer\u2019s positional encodings, we have established the potential for general-\nized transformers to consider other nonlinear structures, given proper implementations. As future\nwork, we are interested in exploring alternative implementations for other domains, in particular\ngraph-structured data as motivated by structured knowledge tasks.\nFinally, in this paper we have only considered binary trees: in particular, binary tree representations\nof trees not originally structured as such. Arbitrary tree representations have their own advantages\nand complications; we would like to explore training on them directly.\n\n9\n\n\fReferences\nRoee Aharoni and Yoav Goldberg. Towards string-to-tree neural machine translation. In Proceedings\nof the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short\nPapers), pp. 132\u2013140, Vancouver, Canada, July 2017. Association for Computational Linguistics.\nURL http://aclweb.org/anthology/P17-2021.\n\nMiltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs\n\nwith graphs. In International Conference on Learning Representations, 2018.\n\nD. Alvarez-Melis and T. Jaakkola. Tree structured decoding with doubly recurrent neural networks.\n\nIn International Conference on Learning Representations (ICLR), 2017.\n\nMary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-match rules for in-\nIn Proceedings of the Sixteenth National Conference on Arti\ufb01cial In-\nformation extraction.\ntelligence and the Eleventh Innovative Applications of Arti\ufb01cial Intelligence Conference Inno-\nvative Applications of Arti\ufb01cial Intelligence, AAAI \u201999/IAAI \u201999, pp. 328\u2013334, Menlo Park,\nCA, USA, 1999. American Association for Arti\ufb01cial Intelligence. ISBN 0-262-51106-1. URL\nhttp://dl.acm.org/citation.cfm?id=315149.315318.\n\nXinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program translation. In\n\nInternational Conference on Learning Representations, 2018.\n\nDeborah A. Dahl, Madeleine Bates, Michael Brown, William Fisher, Kate Hunicke-Smith, David\nPallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriberg. Expanding the scope of the\nATIS task: the ATIS-3 corpus. In Proceedings of the Workshop on Human Language Technology,\npp. 43\u201348, Plainsboro, New Jersey, 1994.\n\nLi Dong and Mirella Lapata. Language to logical form with neural attention. In Association for\n\nComputational Linguistics, 2016.\n\nAkiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. Learning to parse and translate improves\nneural machine translation. In Proceedings of the 55th Annual Meeting of the Association for\nComputational Linguistics (Volume 2: Short Papers), pp. 72\u201378, Vancouver, Canada, July 2017.\nAssociation for Computational Linguistics. URL http://aclweb.org/anthology/P17-2012.\n\nCaglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz\nHermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, and Nando de Freitas.\nHyperbolic attention networks, 2018.\n\nAbram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness\n\nof software. In International Conference on Software Engineering, 2012.\n\nDiederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In International\n\nConference on Learning Representations, 2015.\n\nDan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st\nAnnual Meeting on Association for Computational Linguistics - Volume 1, ACL \u201903, pp. 423\u2013430,\nStroudsburg, PA, USA, 2003. Association for Computational Linguistics. doi: 10.3115/1075096.\n1075150. URL https://doi.org/10.3115/1075096.1075150.\n\nTom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. Scaling semantic parsers with\non-the-\ufb02y ontology matching. In Proceedings of the 2013 Conference on Empirical Methods in\nNatural Language Processing, pp. 1545\u20131556. Association for Computational Linguistics, 2013.\nURL http://aclweb.org/anthology/D13-1161.\n\nYujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural\n\nnetworks. In International Conference on Learning Representations, 2016.\n\nPercy Liang, Michael I. Jordan, and Dan Klein. Learning dependency-based compositional semantics.\nIn Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:\nHuman Language Technologies - Volume 1, HLT \u201911, pp. 590\u2013599, Stroudsburg, PA, USA, 2011.\nAssociation for Computational Linguistics. ISBN 978-1-932432-87-9. URL http://dl.acm.\norg/citation.cfm?id=2002472.2002547.\n\n10\n\n\fMakoto Miwa and Mohit Bansal. End-to-end relation extraction using lstms on sequences and tree\nstructures. In Proceedings of the 54th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pp. 1105\u20131116, Berlin, Germany, August 2016. Association\nfor Computational Linguistics. URL http://www.aclweb.org/anthology/P16-1105.\n\nNanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. Cross-sentence\nn-ary relation extraction with graph lstms. Transactions of the Association for Computational\nLinguistics, 5:101\u2013115, 2017. ISSN 2307-387X. URL https://transacl.org/ojs/index.\nphp/tacl/article/view/1028.\n\nPeter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations.\nIn Proceedings of the 2018 Conference of the North American Chapter of the Association for\nComputational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464\u2013\n468, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL\nhttp://www.aclweb.org/anthology/N18-2074.\n\nYikeng Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. Brief report: Ordered neurons:\nIntegrating tree structures into recurrent neural networks. In ICLR, May 2019. ICLR 2019 Best\nPaper Award.\n\nNitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning\nResearch, 15:1929\u20131958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.\n\nIlya Sutskever, Oriol Vinyals, and Quoc V Le.\n\nSequence to sequence learning with\nneural networks.\nIn Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and\nK. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp.\n3104\u20133112. Curran Associates,\nURL http://papers.nips.cc/paper/\n5346-sequence-to-sequence-learning-with-neural-networks.pdf.\n\nInc., 2014.\n\nKai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from\ntree-structured long short-term memory networks. In Association for Computational Linguistics,\n2015.\n\nLappoon R. Tang and Raymond J. Mooney. Using multiple clause constructors in inductive logic\nprogramming for semantic parsing. In Proceedings of the 12th European Conference on Machine\nLearning, EMCL \u201901, pp. 466\u2013477, London, UK, UK, 2001. Springer-Verlag. ISBN 3-540-42536-5.\nURL http://dl.acm.org/citation.cfm?id=645328.650015.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141 ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural\nInformation Processing Systems 30, pp. 5998\u20136008. Curran Associates, Inc., 2017. URL http:\n//papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.\n\nOriol Vinyals, \u0141 ukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar\nas a foreign language. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.),\nAdvances in Neural Information Processing Systems 28, pp. 2773\u20132781. Curran Associates, Inc.,\n2015. URL http://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.\npdf.\n\nYonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey,\nMaxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson,\nXiaobing Liu, \u0141ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith\nStevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex\nRudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google\u2019s neural\nmachine translation system: Bridging the gap between human and machine translation, 2016.\n\n11\n\n\f", "award": [], "sourceid": 6499, "authors": [{"given_name": "Vighnesh", "family_name": "Shiv", "institution": "Microsoft Research"}, {"given_name": "Chris", "family_name": "Quirk", "institution": "Microsoft Research"}]}