{"title": "Densely Connected Attention Propagation for Reading Comprehension", "book": "Advances in Neural Information Processing Systems", "page_first": 4906, "page_last": 4917, "abstract": "We propose DecaProp (Densely Connected Attention Propagation), a new densely connected neural architecture for reading comprehension (RC). There are two distinct characteristics of our model. Firstly, our model densely connects all pairwise layers of the network, modeling relationships between passage and query across all hierarchical levels. Secondly, the dense connectors in our network are learned via attention instead of standard residual skip-connectors. To this end, we propose novel Bidirectional Attention Connectors (BAC) for efficiently forging connections throughout the network. We conduct extensive experiments on four challenging RC benchmarks. Our proposed approach achieves state-of-the-art results on all four, outperforming existing baselines by up to 2.6% to 14.2% in absolute F1 score.", "full_text": "Densely Connected Attention Propagation\n\nfor Reading Comprehension\n\n\u2217Yi Tay1, \u2217Luu Anh Tuan2, Siu Cheung Hui3 and Jian Su4\n\n1,3Nanyang Technological University, Singapore\n2,4Institute for Infocomm Research, Singapore\n\nytay017@e.ntu.edu.sg1\n\nat.luu@i2r.a-star.edu.sg2\n\nasschui@ntu.edu.sg3\n\nsujian@i2r.a-star.edu.sg4\n\nAbstract\n\nWe propose DECAPROP (Densely Connected Attention Propagation), a new\ndensely connected neural architecture for reading comprehension (RC). There\nare two distinct characteristics of our model. Firstly, our model densely connects\nall pairwise layers of the network, modeling relationships between passage and\nquery across all hierarchical levels. Secondly, the dense connectors in our network\nare learned via attention instead of standard residual skip-connectors. To this end,\nwe propose novel Bidirectional Attention Connectors (BAC) for ef\ufb01ciently forging\nconnections throughout the network. We conduct extensive experiments on four\nchallenging RC benchmarks. Our proposed approach achieves state-of-the-art\nresults on all four, outperforming existing baselines by up to 2.6% \u2212 14.2% in\nabsolute F1 score.\n\n1\n\nIntroduction\n\nThe dominant neural architectures for reading comprehension (RC) typically follow a standard\n\u2018encode-interact-point\u2019 design [Wang and Jiang, 2016; Seo et al., 2016; Xiong et al., 2016; Wang et al.,\n2017c; Kundu and Ng, 2018]. Following the embedding layer, a compositional encoder typically\nencodes Q (query) and P (passage) individually. Subsequently, an (bidirectional) attention layer\nis then used to model interactions between P/Q. Finally, these attended representations are then\nreasoned over to \ufb01nd (point to) the best answer span. While, there might be slight variants of this\narchitecture, this overall architectural design remains consistent across many RC models.\nIntuitively, the design of RC models often possess some depth, i.e., every stage of the network easily\ncomprises several layers. For example, the R-NET [Wang et al., 2017c] architecture adopts three\nBiRNN layers as the encoder and two additional BiRNN layers at the interaction layer. BiDAF [Seo\net al., 2016] uses two BiLSTM layers at the pointer layer, etc. As such, RC models are often relatively\ndeep, at the very least within the context of NLP.\nUnfortunately, the depth of a model is not without implications. It is well-established fact that\nincreasing the depth may impair gradient \ufb02ow and feature propagation, making networks harder\nto train [He et al., 2016; Srivastava et al., 2015; Huang et al., 2017]. This problem is prevalent\nin computer vision, where mitigation strategies that rely on shortcut connections such as Residual\nnetworks [He et al., 2016], GoogLeNet [Szegedy et al., 2015] and DenseNets [Huang et al., 2017]\nwere incepted. Naturally, many of the existing RC models already have some built-in designs to\nworkaround this issue by shortening the signal path in the network. Examples include attention \ufb02ow\n\n\u2217Denotes equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[Seo et al., 2016], residual connections [Xiong et al., 2017; Yu et al., 2018] or simply the usage\nof highway encoders [Srivastava et al., 2015]. As such, we hypothesize that explicitly improving\ninformation \ufb02ow can lead to further and considerable improvements in RC models.\nA second observation is that the \ufb02ow of P/Q representations across the network are often well-aligned\nand \u2018synchronous\u2019, i.e., P is often only matched with Q at the same hierarchical stage (e.g., only\nafter they have passed through a \ufb01xed number of encoder layers). To this end, we hypothesize that\nincreasing the number of interaction interfaces, i.e., matching in an asynchronous, cross-hierarchical\nfashion, can also lead to an improvement in performance.\nBased on the above mentioned intuitions, this paper proposes a new architecture with two distinct\ncharacteristics. Firstly, our network is densely connected, connecting every layer of P with every\nlayer of Q. This not only facilitates information \ufb02ow but also increases the number of interaction\ninterfaces between P/Q. Secondly, our network is densely connected by attention, making it vastly\ndifferent from any residual mitigation strategy in the literature. To the best of our knowledge, this is\nthe \ufb01rst work that explicitly considers attention as a form of skip-connector.\nNotably, models such as BiDAF incorporates a form of attention propagation (\ufb02ow). However, this is\ninherently unsuitable for forging dense connections throughout the network since this would incur a\nmassive increase in the representation size in subsequent layers. To this end, we propose ef\ufb01cient\nBidirectional Attention Connectors (BAC) as a base building block to connect two sequences at\narbitrary layers. The key idea is to compress the attention outputs so that they can be small enough to\npropagate, yet enabling a connection between two sequences. The propagated features are collectively\npassed into prediction layers, which effectively connect shallow layers to deeper layers. Therefore,\nthis enables multiple bidirectional attention calls to be executed without much concern, allowing us\nto ef\ufb01ciently connect multiple layers together.\nOverall, we propose DECAPROP (Densely Connected Attention Propagation), a novel architecture\nfor reading comprehension. DECAPROP achieves a signi\ufb01cant gain of 2.6% \u2212 14.2% absolute\nimprovement in F1 score over the existing state-of-the-art on four challenging RC datasets, namely\nNewsQA [Trischler et al., 2016], Quasar-T [Dhingra et al., 2017], SearchQA [Dunn et al., 2017] and\nNarrativeQA [Ko\u02c7cisk`y et al., 2017].\n\n2 Bidirectional Attention Connectors (BAC)\n\nThis section introduces the Bidirectional Attention Connectors (BAC) module which is central to our\noverall architecture. The BAC module can be thought of as a connector component that connects two\nsequences/layers.\nThe key goals of this module are to (1) connect any two layers of P/Q in the network, returning a\nresidual feature that can be propagated2 to deeper layers, (2) model cross-hierarchical interactions\nbetween P/Q and (3) minimize any costs incurred to other network components such that this\ncomponent may be executed multiple times across all layers.\nLet P \u2208 R(cid:96)p\u00d7d and Q \u2208 R(cid:96)q\u00d7d be inputs to the BAC module. The initial steps in this module\nremains identical to standard bi-attention in which an af\ufb01nity matrix is constructed between P/Q. In\nour bi-attention module, the af\ufb01nity matrix is computed via:\nF(pi)(cid:62)F(qj)\n\nEij =\n\n(1)\n\n1\u221a\nd\n\nwhere F(.) is a standard dense layer with ReLU activations and d is the dimensionality of the vectors.\nNote that this is the scaled dot-product attention from Vaswani et al. [2017]. Next, we learn an\nalignment between P/Q as follows:\n\nA = Softmax(E(cid:62))P and B = Softmax(E)Q\n\n(2)\n\nwhere A, B are the aligned representations of the query/passsage respectively. In many standard neural\nQA models, it is common to pass an augmented3 matching vector of this attentional representation to\n\n2Notably, signals still have to back-propagate through the BAC parameters. However, this still enjoys the\n\nbene\ufb01ts when connecting far away layers and also by increasing the number of pathways.\n\n3This refers to common element-wise operations such as the subtraction or multiplication.\n\n2\n\n\fsubsequent layers. For this purpose, functions such as f = W ([bi ; pi; bi (cid:12) pi, bi \u2212 pi]) + b have\nbeen used [Wang and Jiang, 2016]. However, simple/naive augmentation would not suf\ufb01ce in our use\ncase. Even without augmentation, every call of bi-attention returns a new d dimensional vector for\neach element in the sequence. If the network has l layers, then connecting4 all pairwise layers would\nrequire l2 connectors and therefore an output dimension of l2 \u00d7 d. This is not only computationally\nundesirable but also require a large network at the end to reduce this vector. With augmentation, this\nproblem is aggravated. Hence, standard birectional attention is not suitable here.\nTo overcome this limitation, we utilize a parameterized function G(.) to compress the bi-attention\nvectors down to scalar.\n\ni = [G([bi; pi]); G(bi \u2212 pi); G(bi (cid:12) pi)]\ngp\n\ni for each element in Q. Intuitively g\u2217\n\n(3)\ni \u2208 R3 is the output (for each element in P ) of the BAC module. This is done in an identical\nwhere gp\ni where \u2217 = {p, q} are the\nfashion for ai and qi to form gq\nlearned scalar attention that is propagated to upper layers. Since there are only three scalars, they will\nnot cause any problems even when executed for multiple times. As such, the connection remains\nrelatively lightweight. This compression layer can be considered as a de\ufb01ning trait of the BAC,\ndifferentiating it from standard bi-attention.\nNaturally, there are many potential candidates for the function G(.). One natural choice is the\nstandard dense layer (or multiple dense layers). However, dense layers are limited as they do not\ncompute dyadic pairwise interactions between features which inhibit its expressiveness. On the other\nhand, factorization-based models are known to not only be expressive and ef\ufb01cient, but also able to\nmodel low-rank structure well.\nTo this end, we adopt factorization machines (FM) [Rendle, 2010] as G(.). The FM layer is de\ufb01ned\nas:\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nG(x) = w0 +\n\nwi xi +\n\n(cid:104)vi, vj(cid:105) xi xj\n\n(4)\n\ni=1\n\ni=1\n\nj=i+1\n\nwhere v \u2208 Rd\u00d7k, w0 \u2208 R and wi \u2208 Rd. The output G(x) is a scalar. Intuitively, this layer tries to\nlearn pairwise interactions between every xi and xj using factorized (vector) parameters v. In the\ncontext of our BAC module, the FM layer is trying to learn a low-rank structure from the \u2018match\u2019\nvector (e.g., bi \u2212 pi, bi (cid:12) pi or [bi; pi]). Finally, we note that the BAC module takes inspiration from\nthe main body of our CAFE model [Tay et al., 2017] for entailment classi\ufb01cation. However, this\nwork demonstrates the usage and potential of the BAC as a residual connector.\n\nFigure 1: High level overview of our proposed Bidrectional Attention Connectors (BAC). BAC supports\nconnecting two sequence layers with attention and produces connectors that can be propagated to deeper layers\nof the network.\n\n3 Densely Connected Attention Propagation (DECAPROP)\n\nIn this section, we describe our proposed model in detail. Figure 2 depicts a high-level overview of\nour proposed architecture.\n\n4See encoder component of Figure 2 for more details.\n\n3\n\nSequence LayerSequence LayerMultiplyDiffConcatPropagationFactorizationBidirectional AttentionConnectorsAlignedVectors\fFigure 2: Overview of our proposed model architecture\n\n3.1 Contextualized Input Encoder\n\nThe inputs to our model are two sequences P and Q which represent passage and query respectively.\nGiven Q, the task of the RC model is to select a sequence of tokens in P as the answer. Following\nmany RC models, we enhance the input representations with (1) character embeddings (passed into\na BiRNN encoder), (2) a binary match feature which denotes if a word in the query appears in the\npassage (and vice versa) and (3) a normalized frequency score denoting how many times a word\nappears in the passage. The Char BiRNN of hc dimensions, along with two other binary features, is\nconcatenated with the word embeddings wi \u2208 Rdw, to form the \ufb01nal representation of dw + hc + 2\ndimensions.\n\n3.2 Densely Connected Attention Encoder (DECAENC)\n\nThe DECAENC accepts the inputs P and Q from the input encoder. DECAENC is a multi-layered\nencoder with k layers. For each layer, we pass P/Q into a bidirectional RNN layer of h dimensions.\nNext, we apply our attention connector (BAC) to H P /H Q \u2208 R(cid:104) where H represents the hidden state\noutputs from the BiRNN encoder where the RNN cell can either be a GRU or LSTM encoder. Let d\nbe the input dimensions of P and Q, then this encoder goes through a process of d \u2192 h \u2192 h+3 \u2192 h\nin which the BiRNN at layer l + 1 consumes the propagated features from layer l.\nIntuitively, this layer models P/Q whenever they are at the same network hierarchical level. At this\npoint, we include \u2018asynchronous\u2019 (cross hierarchy) connections between P and Q. Let P i, Qi denote\nthe representations of P, Q at layer i. We apply the Bidirectional Attention Connectors (BAC) as\nfollows:\n\nq = FC(P i, Qj) \u2200 i, j = 1, 2\u00b7\u00b7\u00b7 n\n\n(5)\nwhere FC represents the BAC component. This densely connects all representations of P and Q\nacross multiple layers. Z ij\u2217 \u2208 R3\u00d7(cid:96) represents the generated features for each ij combination of\nP/Q. In total, we obtain 3n2 compressed attention features for each word. Intuitively, these features\ncapture \ufb01ne-grained relationships between P/Q at different stages of the network \ufb02ow. The output of\nthe encoder is the concatenation of all the BiRNN hidden states H 1, H 2 \u00b7\u00b7\u00b7 H k and Z\u2217 which is a\nmatrix of (nh + 3n2) \u00d7 (cid:96) dimensions.\n\nZ ij\n\np , Z ij\n\n3.3 Densely Connected Core Architecture (DECACORE)\n\nThis section introduces the core architecture of our proposed model. This component corresponds to\nthe interaction segment of standard RC model architecture.\n\nGated Attention The outputs of the densely connected encoder are then passed into a standard\ngated attention layer. This corresponds to the \u2018interact\u2019 component in many other popular RC models\n\n4\n\nStartPointerEndPointerGated Self-AttentionGated AttentionPointer LayerBiRNNBiRNNBiRNNBiRNNBiRNNBiRNNDensely Connected Attention EncoderQueryPassageBACBACInput EncoderGated Attention + Dense CorePointer Layer\fthat models Q/P interactions with attention. While there are typically many choices of implementing\nthis layer, we adopt the standard gated bi-attention layer following [Wang et al., 2017c].\n\nS =\n\n1\u221a\nd\n\nF(P )(cid:62)(F(Q)\n\n(6)\n\n(7)\n(8)\nwhere \u03c3 is the sigmoid function and F (.) are dense layers with ReLU activations. The output P (cid:48) is\nthe query-dependent passage representation.\n\n\u00afP = Softmax(S)Q\nP (cid:48) = BiRNN(\u03c3(Wg([P ; \u00afP ]) + bg) (cid:12) P )\n\nGated Self-Attention Next, we employ a self-attention layer, applying Equation (8) yet again on\nP (cid:48), matching P (cid:48) against itself to form B, the output representation of the core layer. The key idea is\nthat self-attention models each word in the query-dependent passsage representation against all other\nwords, enabling each word to bene\ufb01t from a wider global view of the context.\n\nDense Core At this point, we note that there are two intermediate representations of P , i.e., one\nafter the gated bi-attention layer and one after the gated self-attention layer. We denote them as\nU 1, U 2 respectively. Unlike the Densely Connected Attention Encoder, we no longer have two\nrepresentations at each hierarchical level since they have already been \u2018fused\u2019. Hence, we apply a\none-sided BAC to all permutations of [U 1, U 2] and Qi, \u2200i = 1, 2\u00b7\u00b7\u00b7 k. Note that the one-sided BAC\nonly outputs values for the left sequence, ignoring the right sequence.\n\nRkj = F (cid:48)\n\nC(U j, Qk) \u2200 k = 1, 2\u00b7\u00b7\u00b7 n,\u2200j = 1, 2\n\n(9)\n\nwhere Rkj \u2208 R3\u00d7(cid:96) represents the connection output and F (cid:48)\nC is the one-sided BAC function. All\nvalues of Rkj, \u2200j = 1, 2 , \u2200k = 1, 2\u00b7\u00b7\u00b7 n are concatenated to form a matrix R(cid:48) of (2n \u00d7 6) \u00d7 (cid:96),\nwhich is then concatenated with U 2 to form M \u2208 R(cid:96)p\u00d7(d+12n). This \ufb01nal representation is then\npassed to the answer prediction layer.\n\n3.4 Answer Pointer and Prediction Layer\n\nNext, we pass M through a stacked BiRNN model with two layers and obtain two representations,\nH\n\n\u2020\n2 respectively.\n\n\u2020\n1 and H\n\n\u2020\n1 = BiRNN(M ) and H\n\nH\n\n\u2020\n2 = BiRNN(H\n\n\u2020\n1)\n\n(10)\n\nThe start and end pointers are then learned via:\n\u2020\n1) and p2 = Softmax(w2H\n\n(11)\nwhere w1, w2 \u2208 Rd are parameters of this layer. To train the model, following prior work, we\nminimize the sum of negative log probabilities of the start and end indices:\n\np1 = Softmax(w1H\n\n\u2020\n2)\n\nN(cid:88)\n\ni\n\nL(\u03b8) = \u2212 1\nN\n\nlog(p1\ny1\ni\n\n) + log(p2\ny2\ni\n\n)\n\n(12)\n\nwhere N is the number of samples, y1\nthe vector p. The test span is chosen by \ufb01nding the maximum value of p1\n\ni are the true start and end indices. pk is the k-th value of\n\nl where k \u2264 l.\n\nk, p2\n\ni , y2\n\n4 Experiments\n\nThis section describes our experiment setup and empirical results.\n\n4.1 Datasets and Competitor Baselines\n\nWe conduct experiments on four challenging QA datasets which are described as follows:\n\n5\n\n\fNewsQA This challenging RC dataset [Trischler et al., 2016] comprises 100k QA pairs. Passages\nare relatively long at about 600 words on average. This dataset has also been extensively used\nin benchmarking RC models. On this dataset, the key competitors are BiDAF [Seo et al., 2016],\nMatch-LSTM [Wang and Jiang, 2016], FastQA/FastQA-Ext [Weissenborn et al., 2017], R2-BiLSTM\n[Weissenborn, 2017], AMANDA [Kundu and Ng, 2018].\n\nQuasar-T This dataset [Dhingra et al., 2017] comprises 43k factoid-based QA pairs and is con-\nstructed using ClueWeb09 as its backbone corpus. The key competitors on this dataset are BiDAF\nand the Reinforced Ranker-Reader (R3) [Wang et al., 2017a]. Several variations of the ranker-reader\nmodel (e.g., SR, SR2), which use the Match-LSTM underneath, are also compared against.\n\nSearchQA This dataset [Dunn et al., 2017] aims to emulate the search and retrieval process in\nquestion answering applications. The challenge involves reasoning over multiple documents. In\nthis dataset, we concatenate all documents into a single passage context and perform RC over the\ndocuments. The competitor baselines on this dataset are Attention Sum Reader (ASR) [Kadlec et al.,\n2016], Focused Hierarchical RNNs (FH-RNN) [Ke et al., 2018], AMANDA [Kundu and Ng, 2018],\nBiDAF, AQA [Buck et al., 2017] and the Reinforced Ranker-Reader (R3) [Wang et al., 2017a].\n\nNarrativeQA [Ko\u02c7cisk`y et al., 2017] is a recent QA dataset that involves comprehension over\nstories. We use the summaries setting5 which is closer to a standard QA or reading comprehension\nsetting. We compare with the baselines in the original paper, namely Seq2Seq, Attention Sum Reader\nand BiDAF. We also compare with the recent BiAttention + MRU model [Tay et al., 2018b].\nAs compared to the popular SQuAD dataset [Rajpurkar et al., 2016], these datasets are either (1)\nmore challenging6, involves more multi-sentence reasoning or (2) is concerned with searching\nacross multiple documents in an \u2018open domain\u2019 setting (SearchQA/Quasar-T). Hence, these datasets\naccurately re\ufb02ect real world applications to a greater extent. However, we regard the concatenated\ndocuments as a single context for performing reading comprehension. The evaluation metrics are\nthe EM (exact match) and F1 score. Note that for all datasets, we compare all models solely on the\nRC task. Therefore, for fair comparison we do not compare with algorithms that use a second-pass\nanswer re-ranker [Wang et al., 2017b]. Finally, to ensure that our model is not a failing case of\nSQuAD, and as requested by reviewers, we also include development set scores of our model on\nSQuAD.\n\n4.2 Experimental Setup\n\nOur model is implemented in Tensor\ufb02ow [Abadi et al., 2015]. The sequence lengths are capped\nat 800/700/1500/1100 for NewsQA, SearchQA, Quasar-T and NarrativeQA respectively. We use\nAdadelta [Zeiler, 2012] with \u03b1 = 0.5 for NewsQA, Adam [Kingma and Ba, 2014] with \u03b1 = 0.001\nfor SearchQA, Quasar-T and NarrativeQA. The choice of the RNN encoder is tuned between\nGRU and LSTM cells and the hidden size is tuned amongst {32, 50, 64, 75}. We use the CUDNN\nimplementation of the RNN encoder. Batch size is tuned amongst {16, 32, 64}. Dropout rate is tuned\namongst {0.1, 0.2, 0.3} and applied to all RNN and fully-connected layers. We apply variational\ndropout [Gal and Ghahramani, 2016] in-between RNN layers. We initialize the word embeddings\nwith 300D GloVe embeddings [Pennington et al., 2014] and are \ufb01xed during training. The size of the\ncharacter embeddings is set to 8 and the character RNN is set to the same as the word-level RNN\nencoders. The maximum characters per word is set to 16. The number of layers in DECAENC is set\nto 3 and the number of factors in the factorization kernel is set to 64. We use a learning rate decay\nfactor of 2 and patience of 3 epochs whenever the EM (or ROUGE-L) score on the development set\ndoes not increase.\n\n5 Results\n\nOverall, our results are optimistic and promising, with results indicating that DECAPROP achieves\nstate-of-the-art performance7 on all four datasets.\n\n5Notably, a new SOTA was set by [Hu et al., 2018] after the NIPS submission deadline.\n6This is claimed by authors in most of the dataset papers.\n7As of NIPS 2018 submission deadline.\n\n6\n\n\fDev\n\nTest\n\nF1\n49.6\n49.6\n\nEM\nModel\nEM\n34.9\nMatch-LSTM 34.4\n36.1\nBARB\n34.1\nN/A N/A 37.1\nBiDAF\n24.1\nNeural BoW 25.8\nFastQA\n43.7\n41.9\n43.7\nFastQAExt\n42.8\nN/A N/A 43.7\nR2-BiLSTM\n48.4\n48.4\nAMANDA\n52.5\n53.1\nDECAPROP\n\n37.6\n56.4\n56.1\n\n63.3\n65.7\n\nF1\n50.0\n48.2\n52.3\n36.6\n55.7\n56.1\n56.7\n63.7\n66.3\n\nTable 1: Performance comparison on NewsQA\ndataset.\n\nDev\n\nTest\n\nF1n\n\nAcc\n\nAcc\nF1n\n13.0 N/A 12.7 N/A\n22.8\n43.9\n53.4\n49.6\n56.6\n48.6\n64.5\n70.8\n\n24.2\n56.7\n57.7\n71.9\n\n41.3\n46.8\n46.8\n62.2\n\nTF-IDF max\nASR\nFH-RNN\nAMANDA\nDECAPROP\n\nTable 3: Evaluation on original setting, Unigram\nAccuracy and N-gram F1 scores on SearchQA\ndataset.\n\nDev\n\nTest\n\nF1\n25.6\n28.9\n\nEM\nEM\n26.4\n25.6\n25.7\n25.9\nN/A N/A 31.5\nN/A N/A 31.9\nN/A N/A 34.2\n39.7\n38.6\n\n48.1\n\nF1\n26.4\n28.5\n38.5\n38.8\n40.9\n46.9\n\nGA\nBiDAF\nSR\nSR2\nR3\nDECAPROP\n\nTable 2: Performance comparison on Quasar-T\ndataset.\n\nDev\n\nTest\n\nF1\n37.9\n47.4\n\nEM\nEM\n28.6\n31.7\n40.5\n38.7\nN/A N/A 49.0\n58.8\n56.8\n\n65.5\n\nF1\n34.6\n45.6\n55.3\n63.6\n\nBiDAF\nAQA\nR3\nDECAPROP\n\nTable 4: Evaluation on Exact Match and F1 Metrics\non SearchQA dataset.\n\nAttention Sum Reader\n\nSeq2Seq\n\nBiDAF\n\nBiAttention + MRU\n\nDECAPROP\n\nBLEU-1\n\n15.89 / 16.10\n23.20 / 23.54\n33.72 / 33.45\n\n- / 36.55\n\nTest / Validation\n\nBLEU-4\n1.26 / 1.40\n6.39 / 5.90\n15.53 / 15.69\n\n- /19.79\n\nMETEOR\n4.08 / 4.22\n7.77 / 8.02\n15.38 / 15.68\n\n- / 17.87\n\n23.42 / 21.80\nTable 5: Evaluation on NarrativeQA (Story Summaries).\n\n42.00 / 44.35\n\n23.42 / 27.61\n\nROUGE-L\n13.15 / 13.29\n22.26 / 23.28\n36.30 / 36.74\n\n- / 41.44\n\n40.07 / 44.69\n\nModel\n\nDCN [Xiong et al., 2016]\n\nDCN + CoVE [McCann et al., 2017]\n\nR-NET (Wang et al.) [Wang et al., 2017c]\n\nR-NET (Our re-implementation)\n\nDECAPROP (This paper)\nQANet [Yu et al., 2018]\n\nEM\n66.2\n71.3\n72.3\n71.9\n72.9\n73.6\n\nF1\n75.9\n79.9\n80.6\n79.6\n81.4\n82.7\n\nTable 6: Single model dev scores (published scores) of some representative models on SQuAD.\n\nNewsQA Table 1 reports the results on NewsQA. On this dataset, DECAPROP outperforms the\nexisting state-of-the-art, i.e., the recent AMANDA model by (+4.7% EM / +2.6% F1). Notably,\nAMANDA is a strong neural baseline that also incorporates gated self-attention layers, along with\nquestion-aware pointer layers. Moreover, our proposed model also outperforms well-established\nbaselines such as Match-LSTM (+18% EM / +16.3% F1) and BiDAF (+16% EM / +14% F1).\n\nQuasar-T Table 2 reports the results on Quasar-T. Our model achieves state-of-the-art performance\non this dataset, outperforming the state-of-the-art R3 (Reinforced Ranker Reader) by a considerable\nmargin of +4.4% EM / +6% F1. Performance gain over standard baselines such as BiDAF and GA\nare even larger (> 15% F1).\n\nSearchQA Table 3 and Table 4 report the results8 on SearchQA. On the original setting, our model\noutperforms AMANDA by +15.4% EM and +14.2% in terms of F1 score. On the overall setting, our\nmodel outperforms both AQA (+18.1% EM / +18% F1) and Reinforced Reader Ranker (+7.8% EM /\n\n8 The original SearchQA paper [Dunn et al., 2017], along with AMANDA [Kundu and Ng, 2018] report\nresults on Unigram Accuracy and N-gram F1. On the other hand, [Buck et al., 2017] reports results on overall\nEM/F1 metrics. We provide comparisons on both.\n\n7\n\n\f+8.3% F1). Both models are reinforcement learning based extensions of existing strong baselines\nsuch as BiDAF and Match-LSTM.\n\nNarrativeQA Table 5 reports the results on NarrativeQA. Our proposed model outperforms all\nbaseline systems (Seq2Seq, ASR, BiDAF) in the original paper. On average, there is a \u2248 +5%\nimprovement across all metrics.\n\nSQuAD Table 6 reports dev scores9 of our model against several representative models on the\npopular SQuAD benchmark. While our model does not achieve state-of-the-art performance, our\nmodel can outperform the base R-NET (both our implementation as well as the published score). Our\nmodel achieves reasonably competitive performance.\n\n5.1 Ablation Study\n\nWe conduct an ablation study on the NewsQA development set (Table 7). More speci\ufb01cally, we\nreport the development scores of seven ablation baselines. In (1), we removed the entire DECAPROP\narchitecture, reverting it to an enhanced version of the original R-NET model10. In (2), we removed\nDECACORE and passed U 2 to the answer layer instead of M. In (3), we removed the DECAENC\nlayer and used a 3-layered BiRNN instead. In (4), we kept the DECAENC but only compared layer\nof the same hierarchy and omitted cross hierarchical comparisons. In (5), we removed the Gated\nBi-Attention and Gated Self-Attention layers. Removing these layers simply allow previous layers to\npass through. In (6-7), we varied n, the number of layers of DECAENC. Finally, in (8-9), we varied\nthe FM with linear and nonlinear feed-forward layers.\n\nAblation\n(1) Remove All (R-NET)\n(2) w/o DECACORE\n(3) w/o DECAENC\n(4) w/o Cross Hierarchy\n(5) w/o Gated Attention\n(6) Set DECAENC n = 2\n(7) Set DECAENC n = 4\n(8) DecaProp (Linear d->1)\n(9) DecaProp (Nonlinear d->d->1)\nFull Architecture (n = 3)\n\nF1\n61.2\n64.5\n62.0\n63.1\n62.8\n63.4\n63.3\n63.0\n60.0\n65.7\nTable 7: Ablation study on NewsQA development\nset.\n\nEM\n48.1\n51.5\n49.3\n50.0\n49.4\n50.5\n50.7\n50.9\n48.9\n52.5\n\nTable 8: Development EM score (DECAPROP versus\nR-NET) on NewsQA.\n\nFrom (1), we observe a signi\ufb01cant gap in performance between DECAPROP and R-NET. This\ndemonstrates the effectiveness of our proposed architecture. Overall, the key insight is that all model\ncomponents are crucial to DECAPROP. Notably, the DECAENC seems to contribute the most to the\noverall performance. Finally, Figure 8 shows the performance plot of the development EM metric\n(NewsQA) over training. We observe that the superiority of DECAPROP over R-NET is consistent\nand relatively stable. This is also observed across other datasets but not reported due to the lack of\nspace.\n\n6 Related Work\n\nIn recent years, there has been an increase in the number of annotated RC datasets such as SQuAD\n[Rajpurkar et al., 2016], NewsQA [Trischler et al., 2016], TriviaQA [Joshi et al., 2017] and RACE\n\n9Early testing of our model was actually done on SQuAD. However, since taking part on the heavily contested\npublic leaderboard requires more computational resources than we could muster, we decided to focus on other\ndatasets. In lieu of reviewer requests, we include preliminary results of our model on SQuAD dev set.\n\n10For fairer comparison, we make several enhancements to the R-NET model as follows: (1) We replaced the\nadditive attention with scaled dot-product attention similar to ours. (2) We added shortcut connections after the\nencoder layer. (3) We replaced the original Pointer networks with our BiRNN Pointer Layer. We found that\nthese enhancements consistently lead to improved performance. The original R-NET performs at \u2248 2% lower\non NewsQA.\n\n8\n\n0481216202428(pRch20253035404550(xDct 0Dtch ((0)DecD3rRp5-1(T\f[Lai et al., 2017]. Spurred on by the avaliability of data, many neural models have also been proposed\nto tackle these challenges. These models include BiDAF [Seo et al., 2016], Match-LSTM [Wang\nand Jiang, 2016], DCN/DCN+ [Xiong et al., 2016, 2017], R-NET [Wang et al., 2017c], DrQA\n[Chen et al., 2017], AoA Reader [Cui et al., 2016], Reinforced Mnemonic Reader [Hu et al., 2017],\nReasoNet [Shen et al., 2017], AMANDA [Kundu and Ng, 2018], R3 Reinforced Reader Ranker\n[Wang et al., 2017a] and QANet [Yu et al., 2018]. Many of these models innovate at either (1) the\nbidirectional attention layer (BiDAF, DCN), (2) invoking multi-hop reasoning (Mnemonic Reader,\nReasoNet), (3) reinforcement learning (R3, DCN+), (4) self-attention (AMANDA, R-NET, QANet)\nand \ufb01nally, (5) improvements at the encoder level (QANet). While not speci\ufb01cally targeted at reading\ncomprehension, a multitude of pretraining schemes [McCann et al., 2017; Peters et al., 2018; Radford\net al.; Devlin et al., 2018] have recently proven to be very effective for language understanding tasks.\nOur work is concerned with densely connected networks aimed at improving information \ufb02ow [Huang\net al., 2017; Srivastava et al., 2015; Szegedy et al., 2015]. While most works are concerned with\ncomputer vision tasks or general machine learning, there are several notable works in the NLP domain.\nDing et al. [2018] proposed Densely Connected BiLSTMs for standard text classi\ufb01cation tasks. [Tay\net al., 2018a] proposed a co-stacking residual af\ufb01nity mechanims that includes all pairwise layers\nof a text matching model in the af\ufb01nity matrix calculation. In the RC domain, DCN+ [Xiong et al.,\n2017] used Residual Co-Attention encoders. QANet [Yu et al., 2018] used residual self-attentive\nconvolution encoders. While the usage of highway/residual networks is not an uncommon sight in\nNLP, the usage of bidirectional attention as a skip-connector is new. Moreover, our work introduces\nnew cross-hierarchical connections, which help to increase the number of interaction interfaces\nbetween P/Q.\n\n7 Conclusion\n\nWe proposed a new Densely Connected Attention Propagation (DECAPROP) mechanism. For\nthe \ufb01rst time, we explore the possibilities of using birectional attention as a skip-connector. We\nproposed Bidirectional Attention Connectors (BAC) for ef\ufb01cient connection of any two arbitary\nlayers, producing connectors that can be propagated to deeper layers. This enables a shortened signal\npath, aiding information \ufb02ow across the network. Additionally, the modularity of the BAC allows it\nto be easily equipped to other models and even other domains. Our proposed architecture achieves\nstate-of-the-art performance on four challenging QA datasets, outperforming strong and competitive\nbaselines such as Reinforced Reader Ranker (R3), AMANDA, BiDAF and R-NET.\n\n8 Acknowledgements\n\nThis paper is partially supported by Baidu I2R Research Centre, a joint laboratory between Baidu and\nA-Star I2R. The authors would like to thank the anonymous reviewers of NeuRIPS 2018 for their\nvaluable time and feedback!\n\nReferences\nMart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew\nHarp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath\nKudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike\nSchuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent\nVanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg,\nMartin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning\non heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from\ntensor\ufb02ow.org.\n\nChristian Buck, Jannis Bulian, Massimiliano Ciaramita, Andrea Gesmundo, Neil Houlsby, Woj-\nciech Gajewski, and Wei Wang. Ask the right questions: Active question reformulation with\nreinforcement learning. arXiv preprint arXiv:1705.07830, 2017.\n\nDanqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-\n\ndomain questions. arXiv preprint arXiv:1704.00051, 2017.\n\n9\n\n\fYiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-over-attention\n\nneural networks for reading comprehension. arXiv preprint arXiv:1607.04423, 2016.\n\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep\n\nbidirectional transformers for language understanding, 2018.\n\nBhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. Quasar: Datasets for question answering\n\nby search and reading. arXiv preprint arXiv:1707.03904, 2017.\n\nZixiang Ding, Rui Xia, Jianfei Yu, Xiang Li, and Jian Yang. Densely connected bidirectional lstm\n\nwith applications to sentence classi\ufb01cation. arXiv preprint arXiv:1802.00889, 2018.\n\nMatthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho.\nSearchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint\narXiv:1704.05179, 2017.\n\nYarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent\nneural networks. In Advances in neural information processing systems, pages 1019\u20131027, 2016.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\nMinghao Hu, Yuxing Peng, and Xipeng Qiu. Mnemonic reader for machine comprehension. arXiv\n\npreprint arXiv:1705.02798, 2017.\n\nMinghao Hu, Yuxing Peng, Furu Wei, Zhen Huang, Dongsheng Li, Nan Yang, and Ming\nZhou. Attention-guided answer distillation for machine reading comprehension. arXiv preprint\narXiv:1808.07644, 2018.\n\nGao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected\nconvolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261\u20132269, 2017. doi: 10.1109/CVPR.\n2017.243. URL https://doi.org/10.1109/CVPR.2017.243.\n\nMandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly\nsupervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.\n\nRudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understanding with the\n\nattention sum reader network. arXiv preprint arXiv:1603.01547, 2016.\n\nNan Rosemary Ke, Konrad Zolna, Alessandro Sordoni, Zhouhan Lin, Adam Trischler, Yoshua Bengio,\nJoelle Pineau, Laurent Charlin, and Chris Pal. Focused hierarchical rnns for conditional sequence\nprocessing. arXiv preprint arXiv:1806.04342, 2018.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\nTom\u00e1\u0161 Ko\u02c7cisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G\u00e1bor Melis,\nand Edward Grefenstette. The narrativeqa reading comprehension challenge. arXiv preprint\narXiv:1712.07040, 2017.\n\nSouvik Kundu and Hwee Tou Ng. A question-focused multi-factor attention network for question\n\nanswering. 2018.\n\nGuokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading\n\ncomprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.\n\nBryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation:\nContextualized word vectors. In Advances in Neural Information Processing Systems, pages\n6294\u20136305, 2017.\n\n10\n\n\fJeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word\nrepresentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special\nInterest Group of the ACL, pages 1532\u20131543, 2014.\n\nMatthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and\nLuke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,\n2018.\n\nAlec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-\n\nstanding by generative pre-training.\n\nPranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for\n\nmachine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.\n\nSteffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International\n\nConference on, pages 995\u20131000. IEEE, 2010.\n\nMinjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention\n\n\ufb02ow for machine comprehension. arXiv preprint arXiv:1611.01603, 2016.\n\nYelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in\nmachine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, pages 1047\u20131055. ACM, 2017.\n\nRupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks. CoRR,\n\nabs/1505.00387, 2015.\n\nChristian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\n\nErhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. 2015.\n\nYi Tay, Luu Anh Tuan, and Siu Cheung Hui. A compare-propagate architecture with alignment\n\nfactorization for natural language inference. arXiv preprint arXiv:1801.00102, 2017.\n\nYi Tay, Anh Tuan Luu, and Siu Cheung Hui. Co-stack residual af\ufb01nity networks with multi-level\nattention re\ufb01nement for matching text sequences. In Proceedings of the 2018 Conference on\nEmpirical Methods in Natural Language Processing, pages 4492\u20134502, 2018a.\n\nYi Tay, Luu Anh Tuan, and Siu Cheung Hui. Multi-range reasoning for machine comprehension.\n\narXiv preprint arXiv:1803.09074, 2018b.\n\nAdam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and\nKaheer Suleman. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830,\n2016.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nIn Advances in Neural Information\n\nKaiser, and Illia Polosukhin. Attention is all you need.\nProcessing Systems, pages 6000\u20136010, 2017.\n\nShuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. arXiv\n\npreprint arXiv:1608.07905, 2016.\n\nShuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang,\nGerald Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced reader-ranker for open-domain\nquestion answering. arXiv preprint arXiv:1709.00023, 2017a.\n\nShuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim\nKlinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer re-ranking in\nopen-domain question answering. arXiv preprint arXiv:1711.05116, 2017b.\n\nWenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networks\nfor reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of\nthe Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189\u2013198,\n2017c.\n\n11\n\n\fDirk Weissenborn.\n\narXiv:1706.02596, 2017.\n\nReading twice for natural\n\nlanguage understanding.\n\narXiv preprint\n\nDirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but not\n\nsimpler. arXiv preprint arXiv:1703.04816, 2017.\n\nCaiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question\n\nanswering. CoRR, abs/1611.01604, 2016.\n\nCaiming Xiong, Victor Zhong, and Richard Socher. Dcn+: Mixed objective and deep residual\n\ncoattention for question answering. arXiv preprint arXiv:1711.00106, 2017.\n\nAdams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi,\nand Quoc V Le. Qanet: Combining local convolution with global self-attention for reading\ncomprehension. arXiv preprint arXiv:1804.09541, 2018.\n\nMatthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n12\n\n\f", "award": [], "sourceid": 2381, "authors": [{"given_name": "Yi", "family_name": "Tay", "institution": "Nanyang Technological University"}, {"given_name": "Anh Tuan", "family_name": "Luu", "institution": "Institute for Infocomm Research"}, {"given_name": "Siu Cheung", "family_name": "Hui", "institution": "Nanyang Technological University"}, {"given_name": "Jian", "family_name": "Su", "institution": "I2R, Singapore"}]}