{"title": "XLNet: Generalized Autoregressive Pretraining for Language Understanding", "book": "Advances in Neural Information Processing Systems", "page_first": 5753, "page_last": 5763, "abstract": "With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling.\nHowever, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.\nIn light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation.\nFurthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining.\nEmpirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.", "full_text": "XLNet: Generalized Autoregressive Pretraining\n\nfor Language Understanding\n\nZhilin Yang\u22171, Zihang Dai\u221712, Yiming Yang1, Jaime Carbonell1,\n\n{zhiliny,dzihang,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com\n\nRuslan Salakhutdinov1, Quoc V. Le2\n\n1Carnegie Mellon University, 2Google AI Brain Team\n\nAbstract\n\nWith the capability of modeling bidirectional contexts, denoising autoencoding\nbased pretraining like BERT achieves better performance than pretraining ap-\nproaches based on autoregressive language modeling. However, relying on corrupt-\ning the input with masks, BERT neglects dependency between the masked positions\nand suffers from a pretrain-\ufb01netune discrepancy. In light of these pros and cons, we\npropose XLNet, a generalized autoregressive pretraining method that (1) enables\nlearning bidirectional contexts by maximizing the expected likelihood over all\npermutations of the factorization order and (2) overcomes the limitations of BERT\nthanks to its autoregressive formulation. Furthermore, XLNet integrates ideas\nfrom Transformer-XL, the state-of-the-art autoregressive model, into pretraining.\nEmpirically, under comparable experiment setting, XLNet outperforms BERT on\n20 tasks, often by a large margin, including question answering, natural language\ninference, sentiment analysis, and document ranking.1.\n\n1\n\nIntroduction\n\nmodeling factorizes the likelihood into a forward product p(x) =(cid:81)T\none p(x) =(cid:81)1\n\nUnsupervised representation learning has been highly successful in the domain of natural language\nprocessing [7, 22, 27, 28, 10]. Typically, these methods \ufb01rst pretrain neural networks on large-scale\nunlabeled text corpora, and then \ufb01netune the models or representations on downstream tasks. Under\nthis shared high-level idea, different unsupervised pretraining objectives have been explored in\nliterature. Among them, autoregressive (AR) language modeling and autoencoding (AE) have been\nthe two most successful pretraining objectives.\nAR language modeling seeks to estimate the probability distribution of a text corpus with an au-\ntoregressive model [7, 27, 28]. Speci\ufb01cally, given a text sequence x = (x1,\u00b7\u00b7\u00b7 , xT ), AR language\nt=1 p(xt | x<t) or a backward\nt=T p(xt | x>t). A parametric model (e.g. a neural network) is trained to model each\nconditional distribution. Since an AR language model is only trained to encode a uni-directional con-\ntext (either forward or backward), it is not effective at modeling deep bidirectional contexts. On the\ncontrary, downstream language understanding tasks often require bidirectional context information.\nThis results in a gap between AR language modeling and effective pretraining.\nIn comparison, AE based pretraining does not perform explicit density estimation but instead aims to\nreconstruct the original data from corrupted input. A notable example is BERT [10], which has been\nthe state-of-the-art pretraining approach. Given the input token sequence, a certain portion of tokens\nare replaced by a special symbol [MASK], and the model is trained to recover the original tokens from\nthe corrupted version. Since density estimation is not part of the objective, BERT is allowed to utilize\n\n\u2217Equal contribution. Order determined by swapping the one in [9].\n1Pretrained models and code are available at https://github.com/zihangdai/xlnet\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbidirectional contexts for reconstruction. As an immediate bene\ufb01t, this closes the aforementioned\nbidirectional information gap in AR language modeling, leading to improved performance. However,\nthe arti\ufb01cial symbols like [MASK] used by BERT during pretraining are absent from real data at\n\ufb01netuning time, resulting in a pretrain-\ufb01netune discrepancy. Moreover, since the predicted tokens are\nmasked in the input, BERT is not able to model the joint probability using the product rule as in AR\nlanguage modeling. In other words, BERT assumes the predicted tokens are independent of each\nother given the unmasked tokens, which is oversimpli\ufb01ed as high-order, long-range dependency is\nprevalent in natural language [9].\nFaced with the pros and cons of existing language pretraining objectives, in this work, we propose\nXLNet, a generalized autoregressive method that leverages the best of both AR language modeling\nand AE while avoiding their limitations.\n\u2022 Firstly, instead of using a \ufb01xed forward or backward factorization order as in conventional AR mod-\nels, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations\nof the factorization order. Thanks to the permutation operation, the context for each position can\nconsist of tokens from both left and right. In expectation, each position learns to utilize contextual\ninformation from all positions, i.e., capturing bidirectional context.\n\u2022 Secondly, as a generalized AR language model, XLNet does not rely on data corruption. Hence,\nXLNet does not suffer from the pretrain-\ufb01netune discrepancy that BERT is subject to. Meanwhile,\nthe autoregressive objective also provides a natural way to use the product rule for factorizing the\njoint probability of the predicted tokens, eliminating the independence assumption made in BERT.\n\nIn addition to a novel pretraining objective, XLNet improves architectural designs for pretraining.\n\u2022 Inspired by the latest advancements in AR language modeling, XLNet integrates the segment\nrecurrence mechanism and relative encoding scheme of Transformer-XL [9] into pretraining, which\nempirically improves the performance especially for tasks involving a longer text sequence.\n\u2022 Naively applying a Transformer(-XL) architecture to permutation-based language modeling does\nnot work because the factorization order is arbitrary and the target is ambiguous. As a solution, we\npropose to reparameterize the Transformer(-XL) network to remove the ambiguity.\n\nEmpirically, under comparable experiment setting, XLNet consistently outperforms BERT [10] on a\nwide spectrum of problems including GLUE language understanding tasks, reading comprehension\ntasks like SQuAD and RACE, text classi\ufb01cation tasks such as Yelp and IMDB, and the ClueWeb09-B\ndocument ranking task.\nRelated Work The idea of permutation-based AR modeling has been explored in [32, 12], but there\nare several key differences. Firstly, previous models aim to improve density estimation by baking\nan \u201corderless\u201d inductive bias into the model while XLNet is motivated by enabling AR language\nmodels to learn bidirectional contexts. Technically, to construct a valid target-aware prediction\ndistribution, XLNet incorporates the target position into the hidden state via two-stream attention\nwhile previous permutation-based AR models relied on implicit position awareness inherent to their\nMLP architectures. Finally, for both orderless NADE and XLNet, we would like to emphasize that\n\u201corderless\u201d does not mean that the input sequence can be randomly permuted but that the model\nallows for different factorization orders of the distribution.\nAnother related idea is to perform autoregressive denoising in the context of text generation [11],\nwhich only considers a \ufb01xed order though.\n\n2 Proposed Method\n\n2.1 Background\n\nIn this section, we \ufb01rst review and compare the conventional AR language modeling and BERT for\nlanguage pretraining. Given a text sequence x = [x1,\u00b7\u00b7\u00b7 , xT ], AR language modeling performs\npretraining by maximizing the likelihood under the forward autoregressive factorization:\n\nmax\n\n\u03b8\n\nlog p\u03b8(x) =\n\nT(cid:88)\n\nexp(cid:0)h\u03b8(x1:t\u22121)(cid:62)e(xt)(cid:1)\n(cid:80)\n\nx(cid:48) exp (h\u03b8(x1:t\u22121)(cid:62)e(x(cid:48)))\n\n,\n\n(1)\n\nT(cid:88)\n\nlog p\u03b8(xt | x<t) =\n\nlog\n\nt=1\n\nt=1\n\n2\n\n\fwhere h\u03b8(x1:t\u22121) is a context representation produced by neural models, such as RNNs or Transform-\ners, and e(x) denotes the embedding of x. In comparison, BERT is based on denoising auto-encoding.\nSpeci\ufb01cally, for a text sequence x, BERT \ufb01rst constructs a corrupted version \u02c6x by randomly setting\na portion (e.g. 15%) of tokens in x to a special symbol [MASK]. Let the masked tokens be \u00afx. The\ntraining objective is to reconstruct \u00afx from \u02c6x:\n\nlog p\u03b8(\u00afx | \u02c6x) \u2248 T(cid:88)\n\nmax\n\n\u03b8\n\nmt log p\u03b8(xt | \u02c6x) =\n\nmt log\n\nt=1\n\nt=1\n\nT(cid:88)\n\nt e(xt)(cid:1)\nexp(cid:0)H\u03b8(\u02c6x)(cid:62)\nx(cid:48) exp(cid:0)H\u03b8(\u02c6x)(cid:62)\nt e(x(cid:48))(cid:1) ,\n(cid:80)\n\n(2)\n\nwhere mt = 1 indicates xt is masked, and H\u03b8 is a Transformer that maps a length-T text sequence x\ninto a sequence of hidden vectors H\u03b8(x) = [H\u03b8(x)1, H\u03b8(x)2,\u00b7\u00b7\u00b7 , H\u03b8(x)T ]. The pros and cons of\nthe two pretraining objectives are compared in the following aspects:\n\u2022 Independence Assumption: As emphasized by the \u2248 sign in Eq. (2), BERT factorizes the joint\nconditional probability p(\u00afx | \u02c6x) based on an independence assumption that all masked tokens \u00afx\nare separately reconstructed. In comparison, the AR language modeling objective (1) factorizes\np\u03b8(x) using the product rule that holds universally without such an independence assumption.\n\u2022 Input noise: The input to BERT contains arti\ufb01cial symbols like [MASK] that never occur in\ndownstream tasks, which creates a pretrain-\ufb01netune discrepancy. Replacing [MASK] with original\ntokens as in [10] does not solve the problem because original tokens can be only used with a small\nprobability \u2014 otherwise Eq. (2) will be trivial to optimize. In comparison, AR language modeling\ndoes not rely on any input corruption and does not suffer from this issue.\n\u2022 Context dependency: The AR representation h\u03b8(x1:t\u22121) is only conditioned on the tokens up\nto position t (i.e. tokens to the left), while the BERT representation H\u03b8(x)t has access to the\ncontextual information on both sides. As a result, the BERT objective allows the model to be\npretrained to better capture bidirectional context.\n\n2.2 Objective: Permutation Language Modeling\n\nAccording to the comparison above, AR language modeling and BERT possess their unique advan-\ntages over the other. A natural question to ask is whether there exists a pretraining objective that\nbrings the advantages of both while avoiding their weaknesses.\nBorrowing ideas from orderless NADE [32], we propose the permutation language modeling objective\nthat not only retains the bene\ufb01ts of AR models but also allows models to capture bidirectional\ncontexts. Speci\ufb01cally, for a sequence x of length T , there are T ! different orders to perform a valid\nautoregressive factorization. Intuitively, if model parameters are shared across all factorization orders,\nin expectation, the model will learn to gather information from all positions on both sides.\nTo formalize the idea, let ZT be the set of all possible permutations of the length-T index sequence\n[1, 2, . . . , T ]. We use zt and z<t to denote the t-th element and the \ufb01rst t\u22121 elements of a permutation\nz \u2208 ZT . Then, our proposed permutation language modeling objective can be expressed as follows:\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\nmax\n\n\u03b8\n\nEz\u223cZT\n\nlog p\u03b8(xzt | xz<t)\n\n.\n\n(3)\n\nEssentially, for a text sequence x, we sample a factorization order z at a time and decompose the\nlikelihood p\u03b8(x) according to factorization order. Since the same model parameter \u03b8 is shared across\nall factorization orders during training, in expectation, xt has seen every possible element xi (cid:54)= xt in\nthe sequence, hence being able to capture the bidirectional context. Moreover, as this objective \ufb01ts\ninto the AR framework, it naturally avoids the independence assumption and the pretrain-\ufb01netune\ndiscrepancy discussed in Section 2.1.\nRemark on Permutation The proposed objective only permutes the factorization order, not the\nsequence order. In other words, we keep the original sequence order, use the positional encodings\ncorresponding to the original sequence, and rely on a proper attention mask in Transformers to\nachieve permutation of the factorization order. Note that this choice is necessary, since the model\nwill only encounter text sequences with the natural order during \ufb01netuning.\nTo provide an overall picture, we show an example of predicting the token x3 given the same input\nsequence x but under different factorization orders in the Appendix A.7 with Figure 4.\n\n3\n\n\f2.3 Architecture: Two-Stream Self-Attention for Target-Aware Representations\n\nFigure 1: (a): Content stream attention, which is the same as the standard self-attention. (b): Query\nstream attention, which does not have access information about the content xzt. (c): Overview of the\npermutation language modeling training with two-stream attention.\nWhile the permutation language modeling objective has desired properties, naive implementation with\nstandard Transformer parameterization may not work. To see the problem, assume we parameterize\nthe next-token distribution p\u03b8(Xzt | xz<t ) using the standard Softmax formulation, i.e., p\u03b8(Xzt =\nx | xz<t) =\n, where h\u03b8(xz<t) denotes the hidden representation of xz<t\nproduced by the shared Transformer network after proper masking. Now notice that the representation\nh\u03b8(xz<t) does not depend on which position it will predict, i.e., the value of zt. Consequently, the\nsame distribution is predicted regardless of the target position, which is not able to learn useful\nrepresentations (see Appendix A.1 for a concrete example). To avoid this problem, we propose to\nre-parameterize the next-token distribution to be target position aware:\n\nexp(e(x)(cid:62)h\u03b8(xz<t ))\nx(cid:48) exp(e(x(cid:48))(cid:62)h\u03b8(xz<t ))\n\n(cid:80)\n\nexp(cid:0)e(x)(cid:62)g\u03b8(xz<t , zt)(cid:1)\n\nx(cid:48) exp (e(x(cid:48))(cid:62)g\u03b8(xz<t , zt))\n\n(cid:80)\n\np\u03b8(Xzt = x | xz<t) =\n\n,\n\n(4)\n\nwhere g\u03b8(xz<t, zt) denotes a new type of representations which additionally take the target position\nzt as input.\nTwo-Stream Self-Attention While the idea of target-aware representations removes the ambiguity\nin target prediction, how to formulate g\u03b8(xz<t, zt) remains a non-trivial problem. Among other\npossibilities, we propose to \u201cstand\u201d at the target position zt and rely on the position zt to gather\ninformation from the context xz<t through attention. For this parameterization to work, there are two\nrequirements that are contradictory in a standard Transformer architecture: (1) to predict the token\nxzt, g\u03b8(xz<t, zt) should only use the position zt and not the content xzt, otherwise the objective\nbecomes trivial; (2) to predict the other tokens xzj with j > t, g\u03b8(xz<t , zt) should also encode the\ncontent xzt to provide full contextual information. To resolve such a contradiction, we propose to use\ntwo sets of hidden representations instead of one:\n\u2022 The content representation h\u03b8(xz\u2264t), or abbreviated as hzt, which serves a similar role to the\nstandard hidden states in Transformer. This representation encodes both the context and xzt itself.\n\u2022 The query representation g\u03b8(xz<t, zt), or abbreviated as gzt, which only has access to the contex-\n\ntual information xz<t and the position zt, but not the content xzt, as discussed above.\n\nComputationally, the \ufb01rst layer query stream is initialized with a trainable vector, i.e. g(0)\ni = w,\nwhile the content stream is set to the corresponding word embedding, i.e. h(0)\ni = e(xi). For each\nself-attention layer m = 1, . . . , M, the two streams of representations are schematically2 updated\n\n2To avoid clutter, we omit the implementation details including multi-head attention, residual connection,\nlayer normalization and position-wise feed-forward as used in Transformer(-XL). The details are included in\nAppendix A.2 for reference.\n\n4\n\nSample a factorization order:3 \u00e02 \u00e04 \u00e01Attention Maskse(x$)we(x\u2019)we(x()we(x))wh$($)g$($)h\u2019($)g\u2019($)h(($)g(($)h)($)g)($)h$(\u2019)g$(\u2019)h\u2019(\u2019)g\u2019(\u2019)h((\u2019)g((\u2019)h)(\u2019)g)(\u2019)Content stream:can see selfQuery stream:cannot see selfx$x\u2019x(x)Masked Two-stream AttentionMasked Two-stream Attention(c)h$(,)g$(,)h\u2019(,)g\u2019(,)h((,)g((,)h)(,)g)(,)h$($)g$($)AttentionQK, Vh$($)g$($)AttentionQK, V(b)(a)h$(,)g$(,)h\u2019(,)g\u2019(,)h((,)g((,)h)(,)g)(,)\fwith a shared set of parameters as follows (illustrated in Figures 1 (a) and (b)):\n\ng(m)\nzt\nh(m)\nzt\n\n\u2190 Attention(Q = g(m\u22121)\n\u2190 Attention(Q = h(m\u22121)\n\nzt\n\nzt\n\n, KV = h(m\u22121)\n, KV = h(m\u22121)\n\nz<t\n\nz\u2264t\n\n; \u03b8),\n\n; \u03b8),\n\n(query stream: use zt but cannot see xzt)\n(content stream: use both zt and xzt).\n\nwhere Q, K, V denote the query, key, and value in an attention operation [33]. The update rule of the\ncontent representations is exactly the same as the standard self-attention, so during \ufb01netuning, we\ncan simply drop the query stream and use the content stream as a normal Transformer(-XL). Finally,\nwe can use the last-layer query representation g(M )\nPartial Prediction While the permutation language modeling objective (3) has several bene\ufb01ts, it is\na much more challenging optimization problem due to the permutation and causes slow convergence\nin preliminary experiments. To reduce the optimization dif\ufb01culty, we choose to only predict the last\ntokens in a factorization order. Formally, we split z into a non-target subsequence z\u2264c and a target\nsubsequence z>c, where c is the cutting point. The objective is to maximize the log-likelihood of the\ntarget subsequence conditioned on the non-target subsequence, i.e.,\n\nto compute Eq. (4).\n\nzt\n\n(cid:104)\n(cid:105)\nlog p\u03b8(xz>c | xz\u2264c )\n\nmax\n\n\u03b8\n\nEz\u223cZT\n\n\uf8ee\uf8f0 |z|(cid:88)\n\nt=c+1\n\n= Ez\u223cZT\n\nlog p\u03b8(xzt | xz<t )\n\n(5)\n\n\uf8f9\uf8fb.\n\nNote that z>c is chosen as the target because it possesses the longest context in the sequence given the\ncurrent factorization order z. A hyperparameter K is used such that about 1/K tokens are selected\nfor predictions; i.e., |z| /(|z| \u2212 c) \u2248 K. For unselected tokens, their query representations need not\nbe computed, which saves speed and memory.\n\n2.4\n\nIncorporating Ideas from Transformer-XL\n\nSince our objective function \ufb01ts in the AR framework, we incorporate the state-of-the-art AR\nlanguage model, Transformer-XL [9], into our pretraining framework, and name our method after it.\nWe integrate two important techniques in Transformer-XL, namely the relative positional encoding\nscheme and the segment recurrence mechanism. We apply relative positional encodings based on the\noriginal sequence as discussed earlier, which is straightforward. Now we discuss how to integrate the\nrecurrence mechanism into the proposed permutation setting and enable the model to reuse hidden\nstates from previous segments. Without loss of generality, suppose we have two segments taken from\na long sequence s; i.e., \u02dcx = s1:T and x = sT +1:2T . Let \u02dcz and z be permutations of [1\u00b7\u00b7\u00b7 T ] and\n[T + 1\u00b7\u00b7\u00b7 2T ] respectively. Then, based on the permutation \u02dcz, we process the \ufb01rst segment, and then\ncache the obtained content representations \u02dch(m) for each layer m. Then, for the next segment x, the\nattention update with memory can be written as\n\nh(m)\nzt\n\n\u2190 Attention(Q = h(m\u22121)\n\nzt\n\n, KV =\n\n(cid:104)\u02dch(m\u22121), h(m\u22121)\n\n(cid:105)\n\nz\u2264t\n\n; \u03b8)\n\nwhere [., .] denotes concatenation along the sequence dimension. Notice that positional encodings\nonly depend on the actual positions in the original sequence. Thus, the above attention update is\nindependent of \u02dcz once the representations \u02dch(m) are obtained. This allows caching and reusing the\nmemory without knowing the factorization order of the previous segment. In expectation, the model\nlearns to utilize the memory over all factorization orders of the last segment. The query stream can\nbe computed in the same way. Finally, Figure 1 (c) presents an overview of the proposed permutation\nlanguage modeling with two-stream attention (see Appendix A.7 for more detailed illustration).\n\n2.5 Modeling Multiple Segments\n\nMany downstream tasks have multiple input segments, e.g., a question and a context paragraph in\nquestion answering. We now discuss how we pretrain XLNet to model multiple segments in the\nautoregressive framework. During the pretraining phase, following BERT, we randomly sample two\nsegments (either from the same context or not) and treat the concatenation of two segments as one\nsequence to perform permutation language modeling. We only reuse the memory that belongs to\nthe same context. Speci\ufb01cally, the input to our model is the same as BERT: [CLS, A, SEP, B, SEP],\nwhere \u201cSEP\u201d and \u201cCLS\u201d are two special symbols and \u201cA\u201d and \u201cB\u201d are the two segments. Although\n\n5\n\n\fwe follow the two-segment data format, XLNet-Large does not use the objective of next sentence\nprediction [10] as it does not show consistent improvement in our ablation study (see Section 3.4).\nRelative Segment Encodings Architecturally, different from BERT that adds an absolute segment\nembedding to the word embedding at each position, we extend the idea of relative encodings from\nTransformer-XL to also encode the segments. Given a pair of positions i and j in the sequence, if\ni and j are from the same segment, we use a segment encoding sij = s+ or otherwise sij = s\u2212,\nwhere s+ and s\u2212 are learnable model parameters for each attention head. In other words, we only\nconsider whether the two positions are within the same segment, as opposed to considering which\nspeci\ufb01c segments they are from. This is consistent with the core idea of relative encodings; i.e., only\nmodeling the relationships between positions. When i attends to j, the segment encoding sij is used\nto compute an attention weight aij = (qi + b)(cid:62)sij, where qi is the query vector as in a standard\nattention operation and b is a learnable head-speci\ufb01c bias vector. Finally, the value aij is added to\nthe normal attention weight. There are two bene\ufb01ts of using relative segment encodings. First, the\ninductive bias of relative encodings improves generalization [9]. Second, it opens the possibility of\n\ufb01netuning on tasks that have more than two input segments, which is not possible using absolute\nsegment encodings.\n\n2.6 Discussion\n\nComparing Eq. (2) and (5), we observe that both BERT and XLNet perform partial prediction, i.e.,\nonly predicting a subset of tokens in the sequence. This is a necessary choice for BERT because if all\ntokens are masked, it is impossible to make any meaningful predictions. In addition, for both BERT\nand XLNet, partial prediction plays a role of reducing optimization dif\ufb01culty by only predicting\ntokens with suf\ufb01cient context. However, the independence assumption discussed in Section 2.1\ndisables BERT to model dependency between targets.\nTo better understand the difference, let\u2019s consider a concrete example [New, York, is, a, city]. Suppose\nboth BERT and XLNet select the two tokens [New, York] as the prediction targets and maximize\nlog p(New York | is a city). Also suppose that XLNet samples the factorization order [is, a, city,\nNew, York]. In this case, BERT and XLNet respectively reduce to the following objectives:\n\nJBERT = log p(New | is a city) + log p(York | is a city),\n\nJXLNet = log p(New | is a city) + log p(York | New, is a city).\n\nNotice that XLNet is able to capture the dependency between the pair (New, York), which is omitted\nby BERT. Although in this example, BERT learns some dependency pairs such as (New, city) and\n(York, city), it is obvious that XLNet always learns more dependency pairs given the same target and\ncontains \u201cdenser\u201d effective training signals.\nFor more formal analysis and further discussion, please refer to Appendix A.5.\n\n3 Experiments\n\n3.1 Pretraining and Implementation\n\nFollowing BERT [10], we use the BooksCorpus [40] and English Wikipedia as part of our pretraining\ndata, which have 13GB plain text combined. In addition, we include Giga5 (16GB text) [26],\nClueWeb 2012-B (extended from [5]), and Common Crawl [6] for pretraining. We use heuristics\nto aggressively \ufb01lter out short or low-quality articles for ClueWeb 2012-B and Common Crawl,\nwhich results in 19GB and 110GB text respectively. After tokenization with SentencePiece [17], we\nobtain 2.78B, 1.09B, 4.75B, 4.30B, and 19.97B subword pieces for Wikipedia, BooksCorpus, Giga5,\nClueWeb, and Common Crawl respectively, which are 32.89B in total.\nOur largest model XLNet-Large has the same architecture hyperparameters as BERT-Large, which\nresults in a similar model size. During pretraining, we always use a full sequence length of 512.\nFirstly, to provide a fair comparison with BERT (section 3.2), we also trained XLNet-Large-wikibooks\non BooksCorpus and Wikipedia only, where we reuse all pretraining hyper-parameters as in the\noriginal BERT. Then, we scale up the training of XLNet-Large by using all the datasets described\nabove. Speci\ufb01cally, we train on 512 TPU v3 chips for 500K steps with an Adam weight decay\noptimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days. It was\n\n6\n\n\fobserved that the model still under\ufb01ts the data at the end of training. Finally, we perform ablation\nstudy (section 3.4) based on the XLNet-Base-wikibooks.\nSince the recurrence mechanism is introduced, we use a bidirectional data input pipeline where each\nof the forward and backward directions takes half of the batch size. For training XLNet-Large, we set\nthe partial prediction constant K as 6 (see Section 2.3). Our \ufb01netuning procedure follows BERT [10]\nexcept otherwise speci\ufb01ed3. We employ an idea of span-based prediction, where we \ufb01rst sample a\nlength L \u2208 [1,\u00b7\u00b7\u00b7 , 5], and then randomly select a consecutive span of L tokens as prediction targets\nwithin a context of (KL) tokens.\nWe use a variety of natural language understanding datasets to evaluate the performance of our\nmethod. Detailed descriptions of the settings for all the datasets can be found in Appendix A.3.\n\n3.2 Fair Comparison with BERT\n\n82.8/85.5\n\n75.1\n\nSQuAD1.1 SQuAD2.0 RACE MNLI QNLI QQP RTE SST-2 MRPC CoLA STS-B\n86.7/92.8\n90.2\n\nModel\nBERT-Large\n(Best of 3)\nXLNet-Large-\nwikibooks\nTable 1: Fair comparison with BERT. All models are trained using the same data and hyperparameters as in\nBERT. We use the best of 3 BERT variants for comparison; i.e., the original BERT, BERT with whole word\nmasking, and BERT without next sentence prediction.\n\n91.8 81.2\n\n85.1/87.8\n\n88.2/94.0\n\n87.3\n\n93.0\n\n91.4 74.0\n\n94.0\n\n88.4\n\n93.9\n\n65.2\n\n91.1\n\n94.4\n\n90.0\n\n77.4\n\n88.7\n\n63.7\n\nHere, we \ufb01rst compare the performance of BERT and XLNet in a fair setting to decouple the effects\nof using more data and the improvement from BERT to XLNet. In Table 1, we compare (1) best\nperformance of three different variants of BERT and (2) XLNet trained with the same data and\nhyperparameters. As we can see, trained on the same data with an almost identical training recipe,\nXLNet outperforms BERT by a sizable margin on all the considered datasets.\n\n3.3 Results After Scaling Up\n\nRACE\nGPT [28]\nBERT [25]\nBERT+DCMN\u2217 [38]\nRoBERTa [21]\nXLNet\n\nAccuracy Middle High Model\n\nNDCG@20 ERR@20\n\n59.0\n72.0\n74.1\n83.2\n85.4\n\n62.9\n76.6\n79.5\n86.5\n88.6\n\n57.4\n70.1\n71.8\n81.8\n84.0\n\nDRMM [13]\nKNRM [8]\nConv [8]\nBERT\u2020\nXLNet\n\n24.3\n26.9\n28.7\n30.53\n31.10\n\n13.8\n14.9\n18.1\n18.67\n20.28\n\nTable 2: Comparison with state-of-the-art results on the test set of RACE, a reading comprehension task, and on\nClueWeb09-B, a document ranking task. \u2217 indicates using ensembles. \u2020 indicates our implementations. \u201cMiddle\u201d\nand \u201cHigh\u201d in RACE are two subsets representing middle and high school dif\ufb01culty levels. All BERT, RoBERTa,\nand XLNet results are obtained with a 24-layer architecture with similar model sizes (aka BERT-Large).\n\nAfter the initial publication of our manuscript, a few other pretrained models were released such as\nRoBERTa [21] and ALBERT [19]. Since ALBERT involves increasing the model hidden size from\n1024 to 2048/4096 and thus substantially increases the amount of computation in terms of FLOPs, we\nexclude ALBERT from the following results as it is hard to lead to scienti\ufb01c conclusions. To obtain\nrelatively fair comparison with RoBERTa, the experiment in this section is based on full data and\nreuses the hyper-parameters of RoBERTa, as described in section 3.1.\nThe results are presented in Tables 2 (reading comprehension & document ranking), 3 (question\nanswering), 4 (text classi\ufb01cation) and 5 (natural language understanding), where XLNet generally\noutperforms BERT and RoBERTa. In addition, we make two more interesting observations:\n\n3Hyperparameters for pretraining and \ufb01netuning are in Appendix A.4.\n\n7\n\n\fEM F1\n\nF1\n\nEM\n\nSQuAD1.1\n\nSQuAD2.0\nDev set results (single model)\nBERT [10]\nRoBERTa [21]\nXLNet\nTest set results on leaderboard (single model, as of Dec 14, 2019)\nBERT\u2217 [10]\nRoBERTa [21]\nXLNet\n\nBERT\u2020 [10]\nRoBERTa [21]\nXLNet\n\n83.061\n89.795\n90.689\n\n80.005\n86.820\n87.926\n\n78.98\n86.5\n87.9\n\n81.77\n89.4\n90.6\n\n84.1\n88.9\n89.7\n\n90.9\n94.6\n95.1\n\nTable 3: Results on SQuAD, a reading comprehension dataset. \u2020 marks our runs with the of\ufb01cial code. We are\nnot able to obtain the test results on SQuAD1.1 from the organizers after submitting our result for more than one\nmonth.\n\nModel\nCNN [15]\nDPCNN [15]\nMixed VAT [31, 23]\nULMFiT [14]\nBERT [35]\nXLNet\n\nIMDB Yelp-2 Yelp-5 DBpedia\n\n-\n-\n\n4.32\n4.6\n4.51\n3.20\n\n2.90\n2.64\n\n-\n\n2.16\n1.89\n1.37\n\n32.39\n30.58\n\n-\n\n29.98\n29.32\n27.05\n\n0.84\n0.88\n0.70\n0.80\n0.64\n0.60\n\nAG Amazon-2 Amazon-5\n6.57\n6.87\n4.95\n5.01\n\n36.24\n34.81\n\n3.79\n3.32\n\n-\n-\n\n-\n-\n\n34.17\n31.67\n\n-\n\n4.45\n\n2.63\n2.11\n\nTable 4: Comparison with state-of-the-art error rates on the test sets of several text classi\ufb01cation datasets. All\nBERT and XLNet results are obtained with a 24-layer architecture with similar model sizes (aka BERT-Large).\n\nMNLI\n\n86.6/-\n\n92.3\n94.7\n94.9\n\n90.2/90.2\n90.8/90.8\n\nQNLI QQP RTE SST-2 MRPC CoLA STS-B WNLI\n\nModel\nSingle-task single models on dev\nBERT [2]\nRoBERTa [21]\nXLNet\nMulti-task ensembles on test (from leaderboard as of Oct 28, 2019)\nMT-DNN\u2217 [20]\nRoBERTa\u2217 [21]\nXLNet\u2217\nTable 5: Results on GLUE. \u2217 indicates using ensembles, and \u2020 denotes single-task results in a multi-task row.\nAll dev results are the median of 10 runs. The upper section shows direct comparison on dev data and the lower\nsection shows comparison with state-of-the-art results on the public leaderboard.\n\n87.9/87.4\n90.8/90.2\n90.9/90.9\u2020\n\n96.5\n96.7\n97.1\u2020\n\n89.9\n90.2\n90.4\u2020\n\n96.0\n98.9\n99.0\u2020\n\n86.3\n88.2\n88.5\n\n89.0\n89.0\n92.5\n\n92.7\n92.3\n92.9\n\n91.1\n92.2\n93.0\n\n68.4\n67.8\n70.2\n\n88.0\n90.9\n90.8\n\n70.4\n86.6\n85.9\n\n93.2\n96.4\n97.0\n\n60.6\n68.0\n69.0\n\n90.0\n92.4\n92.5\n\n91.3\n92.2\n92.3\n\n-\n-\n-\n\n\u2022 For explicit reasoning tasks like SQuAD and RACE that involve longer context, the performance\ngain of XLNet is usually larger. This superiority at dealing with longer context could come from\nthe Transformer-XL backbone in XLNet.\n\u2022 For classi\ufb01cation tasks that already have abundant supervised examples such as MNLI (>390K),\n\nYelp (>560K) and Amazon (>3M), XLNet still lead to substantial gains.\n\n3.4 Ablation Study\n\nWe perform an ablation study to understand the importance of each design choice based on four\ndatasets with diverse characteristics. Speci\ufb01cally, there are three main aspects we hope to study:\n\u2022 The effectiveness of the permutation language modeling objective alone, especially compared to\n\u2022 The importance of using Transformer-XL as the backbone neural architecture.\n\u2022 The necessity of some implementation details including span-based prediction, the bidirectional\n\nthe denoising auto-encoding objective used by BERT.\n\ninput pipeline, and next-sentence prediction.\n\nWith these purposes in mind, in Table 6, we compare 6 XLNet-Base variants with different implemen-\ntation details (rows 3 - 8), the original BERT-Base model (row 1), and an additional Transformer-XL\n\n8\n\n\fbaseline trained with the denoising auto-encoding (DAE) objective used in BERT but with the bidi-\nrectional input pipeline (row 2). For fair comparison, all models are based on a 12-layer architecture\nwith the same model hyper-parameters as BERT-Base and are trained on only Wikipedia and the\nBooksCorpus. All results reported are the median of 5 runs.\n\n# Model\n\n1 BERT-Base\n2 DAE + Transformer-XL\n3 XLNet-Base (K = 7)\n4 XLNet-Base (K = 6)\n5\n6\n7\n8\n\n- memory\n- span-based pred\n- bidirectional data\n+ next-sent pred\n\nRACE\n\n64.3\n65.03\n66.05\n66.66\n65.55\n65.95\n66.34\n66.76\n\nSQuAD2.0\nEM\nF1\n73.66\n76.30\n76.80\n79.56\n81.33\n78.46\n78.18\n80.98\n77.27\n80.15\n77.91\n80.61\n77.87\n80.65\n79.83\n76.94\n\nMNLI\nm/mm\n\n84.34/84.65\n84.88/84.45\n85.84/85.43\n85.63/85.12\n85.32/85.05\n85.49/85.02\n85.31/84.99\n85.32/85.09\n\nSST-2\n\n92.78\n92.60\n92.66\n93.35\n92.78\n93.12\n92.66\n92.89\n\nTable 6: The results of BERT on RACE are taken from [38]. We run BERT on the other datasets using the\nof\ufb01cial implementation and the same hyperparameter search space as XLNet. K is a hyperparameter to control\nthe optimization dif\ufb01culty (see Section 2.3).\n\nExamining rows 1 - 4 of Table 6, we can see both Transformer-XL and the permutation LM clearly\ncontribute the superior performance of XLNet over BERT. Moreover, if we remove the memory\ncaching mechanism (row 5), the performance clearly drops, especially for RACE which involves the\nlongest context among the 4 tasks. In addition, rows 6 - 7 show that both span-based prediction and\nthe bidirectional input pipeline play important roles in XLNet. Finally, we unexpectedly \ufb01nd the the\nnext-sentence prediction objective proposed in the original BERT does not necessarily lead to an\nimprovement in our setting. Hence, we exclude the next-sentence prediction objective from XLNet.\nFinally, we also perform a qualitative study of the attention patterns, which is included in Appendix\nA.6 due to page limit.\n\n4 Conclusions\n\nXLNet is a generalized AR pretraining method that uses a permutation language modeling objective\nto combine the advantages of AR and AE methods. The neural architecture of XLNet is developed to\nwork seamlessly with the AR objective, including integrating Transformer-XL and the careful design\nof the two-stream attention mechanism. XLNet achieves substantial improvement over previous\npretraining objectives on various tasks.\n\nAcknowledgments\n\nThe authors would like to thank Qizhe Xie and Adams Wei Yu for providing useful feedback on the\nproject, Jamie Callan for providing the ClueWeb dataset, Youlong Cheng, Yanping Huang and Shibo\nWang for providing ideas to improve our TPU implementation, Chenyan Xiong and Zhuyun Dai\nfor clarifying the setting of the document ranking task. ZY and RS were supported by the Of\ufb01ce of\nNaval Research grant N000141812861, the National Science Foundation (NSF) grant IIS1763562,\nthe Nvidia fellowship, and the Siebel scholarship. ZD and YY were supported in part by NSF under\nthe grant IIS-1546329 and by the DOE-Of\ufb01ce of Science under the grant ASCR #KJ040201.\n\nReferences\n[1] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level\n\nlanguage modeling with deeper self-attention. arXiv preprint arXiv:1808.04444, 2018.\n\n[2] Anonymous. Bam! born-again multi-task networks for natural language understanding. anony-\n\nmous preprint under review, 2018.\n\n[3] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling.\n\narXiv preprint arXiv:1809.10853, 2018.\n\n[4] Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer\nneural networks. In Advances in Neural Information Processing Systems, pages 400\u2013406, 2000.\n\n9\n\n\f[5] Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. Clueweb09 data set, 2009.\n[6] Common Crawl. Common crawl. URl: http://http://commoncrawl. org, 2019.\n[7] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural\n\ninformation processing systems, pages 3079\u20133087, 2015.\n\n[8] Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks\nfor soft-matching n-grams in ad-hoc search. In Proceedings of the eleventh ACM international\nconference on web search and data mining, pages 126\u2013134. ACM, 2018.\n\n[9] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le,\nand Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a \ufb01xed-length\ncontext. arXiv preprint arXiv:1901.02860, 2019.\n\n[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018.\n\n[11] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: better text generation via \ufb01lling\n\nin the_. arXiv preprint arXiv:1801.07736, 2018.\n\n[12] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder\nfor distribution estimation. In International Conference on Machine Learning, pages 881\u2013889,\n2015.\n\n[13] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. A deep relevance matching model for\nad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information\nand Knowledge Management, pages 55\u201364. ACM, 2016.\n\n[14] Jeremy Howard and Sebastian Ruder. Universal language model \ufb01ne-tuning for text classi\ufb01ca-\n\ntion. arXiv preprint arXiv:1801.06146, 2018.\n\n[15] Rie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text catego-\nrization. In Proceedings of the 55th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 562\u2013570, 2017.\n\n[16] Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas\nLukasiewicz. A surprisingly robust trick for winograd schema challenge. arXiv preprint\narXiv:1905.06290, 2019.\n\n[17] Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword\ntokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.\n[18] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale\n\nreading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.\n\n[19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu\nSoricut. Albert: A lite bert for self-supervised learning of language representations. arXiv\npreprint arXiv:1909.11942, 2019.\n\n[20] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks\n\nfor natural language understanding. arXiv preprint arXiv:1901.11504, 2019.\n\n[21] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike\nLewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining\napproach. arXiv preprint arXiv:1907.11692, 2019.\n\n[22] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation:\nContextualized word vectors. In Advances in Neural Information Processing Systems, pages\n6294\u20136305, 2017.\n\n[23] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-\n\nsupervised text classi\ufb01cation. arXiv preprint arXiv:1605.07725, 2016.\n\n[24] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural\n\nnetworks. arXiv preprint arXiv:1601.06759, 2016.\n\n[25] Xiaoman Pan, Kai Sun, Dian Yu, Heng Ji, and Dong Yu. Improving question answering with\n\nexternal knowledge. arXiv preprint arXiv:1902.00993, 2019.\n\n10\n\n\f[26] Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword\n\ufb01fth edition, linguistic data consortium. Technical report, Technical Report. Linguistic Data\nConsortium, Philadelphia, Tech. Rep., 2011.\n\n[27] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken-\nton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint\narXiv:1802.05365, 2018.\n\n[28] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language\nunderstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-\nassets/research-covers/languageunsupervised/language understanding paper. pdf, 2018.\n\n[29] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don\u2019t know: Unanswerable\n\nquestions for squad. arXiv preprint arXiv:1806.03822, 2018.\n\n[30] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions\n\nfor machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.\n\n[31] Devendra Singh Sachan, Manzil Zaheer, and Ruslan Salakhutdinov. Revisiting lstm networks\n\nfor semi-supervised text classi\ufb01cation via mixed objective function. 2018.\n\n[32] Benigno Uria, Marc-Alexandre C\u00f4t\u00e9, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural\nautoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184\u2013\n7220, 2016.\n\n[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information\nprocessing systems, pages 5998\u20136008, 2017.\n\n[34] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.\nGLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019.\nIn the Proceedings of ICLR.\n\n[35] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data\n\naugmentation. arXiv preprint arXiv:1904.12848, 2019.\n\n[36] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural\nad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR\nconference on research and development in information retrieval, pages 55\u201364. ACM, 2017.\n\n[37] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax\n\nbottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.\n\n[38] Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. Dual co-\nmatching network for multi-choice reading comprehension. arXiv preprint arXiv:1901.09381,\n2019.\n\n[39] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text\n\nclassi\ufb01cation. In Advances in neural information processing systems, pages 649\u2013657, 2015.\n\n[40] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,\nand Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by\nwatching movies and reading books. In Proceedings of the IEEE international conference on\ncomputer vision, pages 19\u201327, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3088, "authors": [{"given_name": "Zhilin", "family_name": "Yang", "institution": "Recurrent AI"}, {"given_name": "Zihang", "family_name": "Dai", "institution": "Carnegie Mellon University"}, {"given_name": "Yiming", "family_name": "Yang", "institution": "CMU"}, {"given_name": "Jaime", "family_name": "Carbonell", "institution": "CMU"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}]}