{"title": "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 3266, "page_last": 3280, "abstract": "In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at https://super.gluebenchmark.com.", "full_text": "SuperGLUE: A Stickier Benchmark for\n\nGeneral-Purpose Language Understanding Systems\n\nAlex Wang\u2217\n\nNew York University\n\nYada Pruksachatkun\u2217\nNew York University\n\nNikita Nangia\u2217\n\nNew York University\n\nAmanpreet Singh\u2217\nFacebook AI Research\n\nJulian Michael\n\nUniversity of Washington\n\nFelix Hill\nDeepMind\n\nOmer Levy\n\nFacebook AI Research\n\nSamuel R. Bowman\nNew York University\n\nAbstract\n\nIn the last year, new models and methods for pretraining and transfer learning have\ndriven striking performance improvements across a range of language understand-\ning tasks. The GLUE benchmark, introduced a little over one year ago, offers\na single-number metric that summarizes progress on a diverse set of such tasks,\nbut performance on the benchmark has recently surpassed the level of non-expert\nhumans, suggesting limited headroom for further research. In this paper we present\nSuperGLUE, a new benchmark styled after GLUE with a new set of more dif\ufb01-\ncult language understanding tasks, a software toolkit, and a public leaderboard.\nSuperGLUE is available at super.gluebenchmark.com.\n\n1\n\nIntroduction\n\nRecently there has been notable progress across many natural language processing (NLP) tasks, led\nby methods such as ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), and BERT\n(Devlin et al., 2019). The unifying theme of these methods is that they couple self-supervised learning\nfrom massive unlabelled text corpora with effective adapting of the resulting model to target tasks.\nThe tasks that have proven amenable to this general approach include question answering, textual\nentailment, and parsing, among many others (Devlin et al., 2019; Kitaev et al., 2019, i.a.).\nIn this context, the GLUE benchmark (Wang et al., 2019a) has become a prominent evaluation\nframework for research towards general-purpose language understanding technologies. GLUE is\na collection of nine language understanding tasks built on existing public datasets, together with\nprivate test data, an evaluation server, a single-number target metric, and an accompanying expert-\nconstructed diagnostic set. GLUE was designed to provide a general-purpose evaluation of language\nunderstanding that covers a range of training data volumes, task genres, and task formulations. We\nbelieve it was these aspects that made GLUE particularly appropriate for exhibiting the transfer-\nlearning potential of approaches like OpenAI GPT and BERT.\nThe progress of the last twelve months has eroded headroom on the GLUE benchmark dramatically.\nWhile some tasks (Figure 1) and some linguistic phenomena (Figure 2 in Appendix B) measured\nin GLUE remain dif\ufb01cult, the current state of the art GLUE Score as of early July 2019 (88.4 from\nYang et al., 2019) surpasses human performance (87.1 from Nangia and Bowman, 2019) by 1.3\npoints, and in fact exceeds this human performance estimate on four tasks. Consequently, while there\n\n\u2217Equal contribution. Correspondence: glue-benchmark-admin@googlegroups.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: GLUE benchmark performance for submitted systems, rescaled to set human performance\nto 1.0, shown as a single number score, and broken down into the nine constituent task performances.\nFor tasks with multiple metrics, we use an average of the metrics. More information on the tasks\nincluded in GLUE can be found in Wang et al. (2019a) and in Warstadt et al. (2019, CoLA), Socher\net al. (2013, SST-2), Dolan and Brockett (2005, MRPC), Cer et al. (2017, STS-B), and Williams et al.\n(2018, MNLI), and Rajpurkar et al. (2016, the original data source for QNLI).\n\nremains substantial scope for improvement towards GLUE\u2019s high-level goals, the original version of\nthe benchmark is no longer a suitable metric for quantifying such progress.\nIn response, we introduce SuperGLUE, a new benchmark designed to pose a more rigorous test of\nlanguage understanding. SuperGLUE has the same high-level motivation as GLUE: to provide a\nsimple, hard-to-game measure of progress toward general-purpose language understanding technolo-\ngies for English. We anticipate that signi\ufb01cant progress on SuperGLUE should require substantive\ninnovations in a number of core areas of machine learning, including sample-ef\ufb01cient, transfer,\nmultitask, and unsupervised or self-supervised learning.\nSuperGLUE follows the basic design of GLUE: It consists of a public leaderboard built around\neight language understanding tasks, drawing on existing data, accompanied by a single-number\nperformance metric, and an analysis toolkit. However, it improves upon GLUE in several ways:\nMore challenging tasks: SuperGLUE retains the two hardest tasks in GLUE. The remaining tasks\nwere identi\ufb01ed from those submitted to an open call for task proposals and were selected based on\ndif\ufb01culty for current NLP approaches.\nMore diverse task formats: The task formats in GLUE are limited to sentence- and sentence-pair\nclassi\ufb01cation. We expand the set of task formats in SuperGLUE to include coreference resolution\nand question answering (QA).\nComprehensive human baselines: We include human performance estimates for all benchmark\ntasks, which verify that substantial headroom exists between a strong BERT-based baseline and\nhuman performance.\nImproved code support: SuperGLUE is distributed with a new, modular toolkit for work on\npretraining, multi-task learning, and transfer learning in NLP, built around standard tools including\nPyTorch (Paszke et al., 2017) and AllenNLP (Gardner et al., 2017).\nRe\ufb01ned usage rules: The conditions for inclusion on the SuperGLUE leaderboard have been\nrevamped to ensure fair competition, an informative leaderboard, and full credit assignment to data\nand task creators.\nThe SuperGLUE leaderboard, data, and software tools are available at super.gluebenchmark.com.\n\n2 Related Work\n\nMuch work prior to GLUE demonstrated that training neural models with large amounts of available\nsupervision can produce representations that effectively transfer to a broad range of NLP tasks\n\n2\n\nBiLSTM+ELMo+AttnOpenAI GPTBERT + Single-task AdaptersBERT (Large)BERT on STILTsBERT + BAMSemBERTSnorkel MeTaLALICE (Large)MT-DNN (ensemble)XLNet-Large (ensemble)0.50.60.70.80.91.01.11.2GLUE ScoreHuman PerformanceCoLASST-2MRPCSTS-BQQPMNLIQNLIRTEWNLI\fTable 1: The tasks included in SuperGLUE. WSD stands for word sense disambiguation, NLI is\nnatural language inference, coref. is coreference resolution, and QA is question answering. For\nMultiRC, we list the number of total answers for 456/83/166 train/dev/test questions.\n\nCorpus\nBoolQ\nCB\nCOPA\nMultiRC\nReCoRD\nRTE\nWiC\nWSC\n\n|Train|\n9427\n250\n400\n5100\n101k\n2500\n6000\n554\n\n|Dev|\n3270\n57\n100\n953\n10k\n278\n638\n104\n\n|Test| Task Metrics\n3245 QA\nacc.\n250 NLI\nacc./F1\n500 QA\nacc.\n1800 QA\nF1a/EM various\n10k QA\nF1/EM\n300 NLI\nacc.\n1400 WSD acc.\n146\nacc.\n\ncoref.\n\nText Sources\nGoogle queries, Wikipedia\nvarious\nblogs, photography encyclopedia\n\nnews (CNN, Daily Mail)\nnews, Wikipedia\nWordNet, VerbNet, Wiktionary\n\ufb01ction books\n\n(Collobert and Weston, 2008; Dai and Le, 2015; Kiros et al., 2015; Hill et al., 2016; Conneau and\nKiela, 2018; McCann et al., 2017; Peters et al., 2018). GLUE was presented as a formal challenge\naffording straightforward comparison between such task-agnostic transfer learning techniques. Other\nsimilarly-motivated benchmarks include SentEval (Conneau and Kiela, 2018), which speci\ufb01cally\nevaluates \ufb01xed-size sentence embeddings, and DecaNLP (McCann et al., 2018), which recasts a set\nof target tasks into a general question-answering format and prohibits task-speci\ufb01c parameters. In\ncontrast, GLUE provides a lightweight classi\ufb01cation API and no restrictions on model architecture or\nparameter sharing, which seems to have been well-suited to recent work in this area.\nSince its release, GLUE has been used as a testbed and showcase by the developers of several\nin\ufb02uential models, including GPT (Radford et al., 2018) and BERT (Devlin et al., 2019). As shown\nin Figure 1, progress on GLUE since its release has been striking. On GLUE, GPT and BERT\nachieved scores of 72.8 and 80.2 respectively, relative to 66.5 for an ELMo-based model (Peters\net al., 2018) and 63.7 for the strongest baseline with no multitask learning or pretraining above the\nword level. Recent models (Liu et al., 2019d; Yang et al., 2019) have clearly surpassed estimates of\nnon-expert human performance on GLUE (Nangia and Bowman, 2019). The success of these models\non GLUE has been driven by ever-increasing model capacity, compute power, and data quantity, as\nwell as innovations in model expressivity (from recurrent to bidirectional recurrent to multi-headed\ntransformer encoders) and degree of contextualization (from learning representation of words in\nisolation to using uni-directional contexts and ultimately to leveraging bidirectional contexts).\nIn parallel to work scaling up pretrained models, several studies have focused on complementary\nmethods for augmenting performance of pretrained models. Phang et al. (2018) show that BERT can\nbe improved using two-stage pretraining, i.e., \ufb01ne-tuning the pretrained model on an intermediate\ndata-rich supervised task before \ufb01ne-tuning it again on a data-poor target task. Liu et al. (2019d,c) and\nBach et al. (2018) get further improvements respectively via multi-task \ufb01netuning and using massive\namounts of weak supervision. Clark et al. (2019b) demonstrate that knowledge distillation (Hinton\net al., 2015; Furlanello et al., 2018) can lead to student networks that outperform their teachers.\nOverall, the quantity and quality of research contributions aimed at the challenges posed by GLUE\nunderline the utility of this style of benchmark for machine learning researchers looking to evaluate\nnew application-agnostic methods on language understanding.\nLimits to current approaches are also apparent via the GLUE suite. Performance on the GLUE\ndiagnostic entailment dataset, at 0.42 R3, falls far below the average human performance of 0.80\nR3 reported in the original GLUE publication, with models performing near, or even below, chance\non some linguistic phenomena (Figure 2, Appendix B). While some initially dif\ufb01cult categories\nsaw gains from advances on GLUE (e.g., double negation), others remain hard (restrictivity) or\neven adversarial (disjunction, downward monotonicity). This suggests that even as unsupervised\npretraining produces ever-better statistical summaries of text, it remains dif\ufb01cult to extract many\ndetails crucial to semantics without the right kind of supervision. Much recent work has made similar\nobservations about the limitations of existing pretrained models (Jia and Liang, 2017; Naik et al.,\n2018; McCoy and Linzen, 2019; McCoy et al., 2019; Liu et al., 2019a,b).\n\n3\n\n\fTable 2: Development set examples from the tasks in SuperGLUE. Bold text represents part of the\nexample format for each task. Text in italics is part of the model input. Underlined text is specially\nmarked in the input. Text in a monospaced font represents the expected model output.\n\nl\no\no\nB\n\nQ Passage: Barq\u2019s \u2013 Barq\u2019s is an American soft drink. Its brand of root beer is notable for having caffeine.\nBarq\u2019s, created by Edward Barq and bottled since the turn of the 20th century, is owned by the Barq\nfamily but bottled by the Coca-Cola Company. It was known as Barq\u2019s Famous Olde Tyme Root Beer\nuntil 2012.\nQuestion: is barq\u2019s root beer a pepsi product Answer: No\n\nB Text: B: And yet, uh, I we-, I hope to see employer based, you know, helping out. You know, child, uh,\ncare centers at the place of employment and things like that, that will help out. A: Uh-huh. B: What do\nC\nyou think, do you think we are, setting a trend?\nHypothesis: they are setting a trend Entailment: Unknown\n\nA Premise: My body cast a shadow over the grass. Question: What\u2019s the CAUSE for this?\nP\nO\nC\n\nAlternative 1: The sun was rising. Alternative 2: The grass was cut.\nCorrect Alternative: 1\n\ni\nt\nl\nu\nM\n\nC Paragraph: Susan wanted to have a birthday party. She called all of her friends. She has \ufb01ve friends.\nR\nHer mom said that Susan can invite them all to the party. Her \ufb01rst friend could not go to the party\nbecause she was sick. Her second friend was going out of town. Her third friend was not so sure if her\nparents would let her. The fourth friend said maybe. The \ufb01fth friend could go to the party for sure. Susan\nwas a little sad. On the day of the party, all \ufb01ve friends showed up. Each friend had a present for Susan.\nSusan was happy and sent each friend a thank you card the next week\nQuestion: Did Susan\u2019s sick friend recover? Candidate answers: Yes, she recovered (T), No (F), Yes\n(T), No, she didn\u2019t recover (F), Yes, she was at Susan\u2019s party (T)\n\nD Paragraph: (CNN) Puerto Rico on Sunday overwhelmingly voted for statehood. But Congress, the only\nR\nbody that can approve new states, will ultimately decide whether the status of the US commonwealth\no\nC\nchanges. Ninety-seven percent of the votes in the nonbinding referendum favored statehood, an increase\ne\nR\nover the results of a 2012 referendum, of\ufb01cial results from the State Electorcal Commission show. It\nwas the \ufb01fth such vote on statehood. \"Today, we the people of Puerto Rico are sending a strong and\nclear message to the US Congress ... and to the world ... claiming our equal rights as American citizens,\nPuerto Rico Gov. Ricardo Rossello said in a news release. @highlight Puerto Rico voted Sunday in\nfavor of US statehood\nQuery For one, they can truthfully say, \u201cDon\u2019t blame me, I didn\u2019t vote for them, \u201d when discussing the\n<placeholder> presidency Correct Entities: US\n\nE Text: Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44,\nT\nR\n\naccording to the Christopher Reeve Foundation.\nHypothesis: Christopher Reeve had an accident. Entailment: False\n\ni\n\nSense match: False\n\nC Context 1: Room and board. Context 2: He nailed boards across the windows.\nW\nC Text: Mark told Pete many lies about himself, which Pete included in his book. He should have been\nS\nW\n\nmore truthful. Coreference: False\n\n3 SuperGLUE Overview\n\n3.1 Design Process\n\nThe goal of SuperGLUE is to provide a simple, robust evaluation metric of any method capable of\nbeing applied to a broad range of language understanding tasks. To that end, in designing SuperGLUE,\nwe identify the following desiderata of tasks in the benchmark:\nTask substance: Tasks should test a system\u2019s ability to understand and reason about texts in English.\nTask dif\ufb01culty: Tasks should be beyond the scope of current state-of-the-art systems, but solvable by\nmost college-educated English speakers. We exclude tasks that require domain-speci\ufb01c knowledge,\ne.g. medical notes or scienti\ufb01c papers.\nEvaluability: Tasks must have an automatic performance metric that corresponds well to human\njudgments of output quality. Some text generation tasks fail to meet this criteria due to issues with\nautomatic metrics like ROUGE and BLEU (Callison-Burch et al., 2006; Liu et al., 2016, i.a.).\n\n4\n\n\fPublic data: We require that tasks have existing public training data in order to minimize the risks\ninvolved in newly-created datasets. We also prefer tasks for which we have access to (or could create)\na test set with private labels.\nTask format: We prefer tasks that had relatively simple input and output formats, to avoid incentiviz-\ning the users of the benchmark to create complex task-speci\ufb01c model architectures. Still, while GLUE\nis restricted to tasks involving single sentence or sentence pair inputs, for SuperGLUE we expand\nthe scope to consider tasks with longer inputs. This yields a set of tasks that requires understanding\nindividual tokens in context, complete sentences, inter-sentence relations, and entire paragraphs.\nLicense: Task data must be available under licences that allow use and redistribution for research\npurposes.\nTo identify possible tasks for SuperGLUE, we disseminated a public call for task proposals to the\nNLP community, and received approximately 30 proposals. We \ufb01ltered these proposals according\nto our criteria. Many proposals were not suitable due to licensing issues, complex formats, and\ninsuf\ufb01cient headroom; we provide examples of such tasks in Appendix D. For each of the remaining\ntasks, we ran a BERT-based baseline and a human baseline, and \ufb01ltered out tasks which were either\ntoo challenging for humans without extensive training or too easy for our machine baselines.\n\n3.2 Selected Tasks\n\nFollowing this process, we arrived at eight tasks to use in SuperGLUE. See Tables 1 and 2 for details\nand speci\ufb01c examples of each task.\nBoolQ (Boolean Questions, Clark et al., 2019a) is a QA task where each example consists of a short\npassage and a yes/no question about the passage. The questions are provided anonymously and\nunsolicited by users of the Google search engine, and afterwards paired with a paragraph from a\nWikipedia article containing the answer. Following the original work, we evaluate with accuracy.\nCB (CommitmentBank, de Marneffe et al., 2019) is a corpus of short texts in which at least one\nsentence contains an embedded clause. Each of these embedded clauses is annotated with the degree\nto which it appears the person who wrote the text is committed to the truth of the clause. The resulting\ntask framed as three-class textual entailment on examples that are drawn from the Wall Street Journal,\n\ufb01ction from the British National Corpus, and Switchboard. Each example consists of a premise\ncontaining an embedded clause and the corresponding hypothesis is the extraction of that clause.\nWe use a subset of the data that had inter-annotator agreement above 80%. The data is imbalanced\n(relatively fewer neutral examples), so we evaluate using accuracy and F1, where for multi-class F1\nwe compute the unweighted average of the F1 per class.\nCOPA (Choice of Plausible Alternatives, Roemmele et al., 2011) is a causal reasoning task in which\na system is given a premise sentence and must determine either the cause or effect of the premise\nfrom two possible choices. All examples are handcrafted and focus on topics from blogs and a\nphotography-related encyclopedia. Following the original work, we evaluate using accuracy.\nMultiRC (Multi-Sentence Reading Comprehension, Khashabi et al., 2018) is a QA task where each\nexample consists of a context paragraph, a question about that paragraph, and a list of possible\nanswers. The system must predict which answers are true and which are false. While many QA\ntasks exist, we use MultiRC because of a number of desirable properties: (i) each question can have\nmultiple possible correct answers, so each question-answer pair must be evaluated independent of\nother pairs, (ii) the questions are designed such that answering each question requires drawing facts\nfrom multiple context sentences, and (iii) the question-answer pair format more closely matches\nthe API of other tasks in SuperGLUE than the more popular span-extractive QA format does. The\nparagraphs are drawn from seven domains including news, \ufb01ction, and historical text. The evaluation\nmetrics are F1 over all answer-options (F1a) and exact match of each question\u2019s set of answers (EM).\nReCoRD (Reading Comprehension with Commonsense Reasoning Dataset, Zhang et al., 2018) is a\nmultiple-choice QA task. Each example consists of a news article and a Cloze-style question about\nthe article in which one entity is masked out. The system must predict the masked out entity from a\nlist of possible entities in the provided passage, where the same entity may be expressed with multiple\ndifferent surface forms, which are all considered correct. Articles are from CNN and Daily Mail. We\nevaluate with max (over all mentions) token-level F1 and exact match (EM).\n\n5\n\n\fRTE (Recognizing Textual Entailment) datasets come from a series of annual competitions on textual\nentailment. RTE is included in GLUE, and we use the same data and format as GLUE: We merge data\nfrom RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and\nRTE5 (Bentivogli et al., 2009). All datasets are combined and converted to two-class classi\ufb01cation:\nentailment and not_entailment. Of all the GLUE tasks, RTE is among those that bene\ufb01ts from\ntransfer learning the most, with performance jumping from near random-chance (\u223c56%) at the time\nof GLUE\u2019s launch to 86.3% accuracy (Liu et al., 2019d; Yang et al., 2019) at the time of writing.\nGiven the nearly eight point gap with respect to human performance, however, the task is not yet\nsolved by machines, and we expect the remaining gap to be dif\ufb01cult to close.\nWiC (Word-in-Context, Pilehvar and Camacho-Collados, 2019) is a word sense disambiguation task\ncast as binary classi\ufb01cation of sentence pairs. Given two text snippets and a polysemous word that\nappears in both sentences, the task is to determine whether the word is used with the same sense in\nboth sentences. Sentences are drawn from WordNet (Miller, 1995), VerbNet (Schuler, 2005), and\nWiktionary. We follow the original work and evaluate using accuracy.\nWSC (Winograd Schema Challenge, Levesque et al., 2012) is a coreference resolution task in\nwhich examples consist of a sentence with a pronoun and a list of noun phrases from the sentence.\nThe system must determine the correct referrent of the pronoun from among the provided choices.\nWinograd schemas are designed to require everyday knowledge and commonsense reasoning to solve.\nGLUE includes a version of WSC recast as NLI, known as WNLI. Until very recently, no substantial\nprogress had been made on WNLI, with many submissions opting to submit majority class predic-\ntions.2 In the past few months, several works (Kocijan et al., 2019; Liu et al., 2019d) have made rapid\nprogress via a hueristic data augmentation scheme, raising machine performance to 90.4% accuracy.\nGiven estimated human performance of \u223c96%, there is still a gap between machine and human\nperformance, which we expect will be relatively dif\ufb01cult to close. We therefore include a version of\nWSC cast as binary classi\ufb01cation, where each example consists of a sentence with a marked pronoun\nand noun, and the task is to determine if the pronoun refers to that noun. The training and validation\nexamples are drawn from the original WSC data (Levesque et al., 2012), as well as those distributed\nby the af\ufb01liated organization Commonsense Reasoning.3 The test examples are derived from \ufb01ction\nbooks and have been shared with us by the authors of the original dataset. We evaluate using accuracy.\n\n3.3 Scoring\n\nAs with GLUE, we seek to give a sense of aggregate system performance over all tasks by averaging\nscores of all tasks. Lacking a fair criterion with which to weight the contributions of each task to\nthe overall score, we opt for the simple approach of weighing each task equally, and for tasks with\nmultiple metrics, \ufb01rst averaging those metrics to get a task score.\n\n3.4 Tools for Model Analysis\n\nAnalyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed,\ndiagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and\nworld knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with\na three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that\nindicate the phenomena that characterize the relationship between the two sentences. Submissions to\nthe GLUE leaderboard are required to include predictions from the submission\u2019s MultiNLI classi\ufb01er\non the diagnostic dataset, and analyses of the results were shown alongside the main leaderboard.\nSince this diagnostic task has proved dif\ufb01cult for top models, we retain it in SuperGLUE. However,\nsince MultiNLI is not part of SuperGLUE, we collapse contradiction and neutral into a single\nnot_entailment label, and request that submissions include predictions on the resulting set from the\nmodel used for the RTE task. We estimate human performance following the same procedure we use\n\n2WNLI is especially dif\ufb01cult due to an adversarial train/dev split: Premise sentences that appear in the\ntraining set often appear in the development set with a different hypothesis and a \ufb02ipped label. If a system\nmemorizes the training set, which was easy due to the small size of the training set, it could perform far below\nchance on the development set. We remove this adversarial design in our version of WSC by ensuring that no\nsentences are shared between the training, validation, and test sets.\n\n3http://commonsensereasoning.org/disambiguation.html\n\n6\n\n\ffor the benchmark tasks (Section C). We estimate an accuracy of 88% and a Matthew\u2019s correlation\ncoef\ufb01cient (MCC, the two-class variant of the R3 metric used in GLUE) of 0.77.\n\nAnalyzing Gender Bias in Models Recent work has identi\ufb01ed the presence and ampli\ufb01cation of\nmany social biases in data-driven machine learning models (Lu et al., 2018; Zhao et al., 2018, i.a.). To\npromote the detection of such biases, we include Winogender (Rudinger et al., 2018) as an additional\ndiagnostic dataset. Winogender is designed to measure gender bias in coreference resolution systems.\nWe use the Diverse Natural Language Inference Collection (Poliak et al., 2018) version that casts\nWinogender as a textual entailment task.Each example consists of a premise sentence with a male or\nfemale pronoun and a hypothesis giving a possible antecedent of the pronoun. Examples occur in\nminimal pairs, where the only difference between an example and its pair is the gender of the pronoun\nin the premise. Performance on Winogender is measured with accuracy and the gender parity score:\nthe percentage of minimal pairs for which the predictions are the same. A system can trivially obtain\na perfect gender parity score by guessing the same class for all examples, so a high gender parity\nscore is meaningless unless accompanied by high accuracy. We collect non-expert annotations to\nestimate human performance, and observe an accuracy of 99.7% and a gender parity score of 0.99.\nLike any diagnostic, Winogender has limitations. It offers only positive predictive value: A poor\nbias score is clear evidence that a model exhibits gender bias, but a good score does not mean that\nthe model is unbiased. More speci\ufb01cally, in the DNC version of the task, a low gender parity score\nmeans that a model\u2019s prediction of textual entailment can be changed with a change in pronouns, all\nelse equal. It is plausible that there are forms of bias that are relevant to target tasks of interest, but\nthat do not surface in this setting (Gonen and Goldberg, 2019). Also, Winogender does not cover all\nforms of social bias, or even all forms of gender. For instance, the version of the data used here offers\nno coverage of gender-neutral they or non-binary pronouns. Despite these limitations, we believe that\nWinogender\u2019s inclusion is worthwhile in providing a coarse sense of how social biases evolve with\nmodel performance and for keeping attention on the social rami\ufb01cations of NLP models.\n\n4 Using SuperGLUE\n\nSoftware Tools To facilitate using SuperGLUE, we release jiant (Wang et al., 2019b),4 a modular\nsoftware toolkit, built with PyTorch (Paszke et al., 2017), components from AllenNLP (Gardner\net al., 2017), and the transformers package.5 jiant implements our baselines and supports the\nevaluation of custom models and training methods on the benchmark tasks. The toolkit includes\nsupport for existing popular pretrained models such as OpenAI GPT and BERT, as well as support\nfor multistage and multitask learning of the kind seen in the strongest models on GLUE.\n\nEligibility Any system or method that can produce predictions for the SuperGLUE tasks is eligible\nfor submission to the leaderboard, subject to the data-use and submission frequency policies stated\nimmediately below. There are no restrictions on the type of methods that may be used, and there is\nno requirement that any form of parameter sharing or shared initialization be used across the tasks in\nthe benchmark. To limit over\ufb01tting to the private test data, users are limited to a maximum of two\nsubmissions per day and six submissions per month.\n\nData Data for the tasks are available for download through the SuperGLUE site and through a\ndownload script included with the software toolkit. Each task comes with a standardized training set,\ndevelopment set, and unlabeled test set. Submitted systems may use any public or private data when\ndeveloping their systems, with a few exceptions: Systems may only use the SuperGLUE-distributed\nversions of the task datasets, as these use different train/validation/test splits from other public\nversions in some cases. Systems also may not use the unlabeled test data for the tasks in system\ndevelopment in any way, may not use the structured source data that was used to collect the WiC\nlabels (sense-annotated example sentences from WordNet, VerbNet, and Wiktionary) in any way, and\nmay not build systems that share information across separate test examples in any way.\nTo ensure reasonable credit assignment, because we build very directly on prior work, we ask the\nauthors of submitted systems to directly name and cite the speci\ufb01c datasets that they use, including the\nbenchmark datasets. We will enforce this as a requirement for papers to be listed on the leaderboard.\n\n4https://github.com/nyu-mll/jiant\n5https://github.com/huggingface/transformers\n\n7\n\n\fTable 3: Baseline performance on the SuperGLUE test sets and diagnostics. For CB we report\naccuracy and macro-average F1. For MultiRC we report F1 on all answer-options and exact match\nof each question\u2019s set of correct answers. AXb is the broad-coverage diagnostic task, scored using\nMatthews\u2019 correlation (MCC). AXg is the Winogender diagnostic, scored using accuracy and the\ngender parity score (GPS). All values are scaled by 100. The Avg column is the overall benchmark\nscore on non-AX\u2217 tasks. The bolded numbers re\ufb02ect the best machine performance on task. *MultiRC\nhas multiple test sets released on a staggered schedule, and these results evaluate on an installation of\nthe test set that is a subset of ours.\n\nModel\nMetrics\nMost Frequent 47.1\n44.3\nCBoW\n69.0\nBERT\n71.5\nBERT++\nOutside Best\nHuman (est.)\n\nAvg BoolQ\nAcc.\n62.3\n62.1\n77.4\n79.0\n80.4\n89.0\n\n89.8\n\n-\n\nCB\n\nCOPA MultiRC ReCoRD RTE WiC WSC AXb\n\nAXg\n\nF1/Acc. Acc.\n50.0\n21.7/48.4\n51.6\n49.0/71.2\n70.6\n75.7/83.6\n84.7/90.4\n73.8\n84.4\n\n-\n\n-\n\n/\n\nF1a/EM F1/EM Acc. Acc. Acc. MCC GPS Acc.\n100.0/ 50.0\n33.4/32.5 50.3 50.0 65.1\n61.1 / 0.3\n100.0/ 50.0\n0.0 / 0.4\n14.0/13.6 49.7 53.0 65.1\n70.0 / 24.0 72.0/71.3 71.6 69.5 64.3\n97.8 / 51.7\n70.0 / 24.1 72.0/71.3 79.0 69.5 64.3\n99.4 / 51.4\n70.4*/24.5* 74.8/73.0 82.7\n\n0.0\n-0.4\n23.0\n38.0\n\n-\n\n-\n\n-\n\n-\n\n/\n\n-\n\n95.8/98.9 100.0 81.8*/51.9* 91.7/91.3 93.6 80.0 100.0 77.0\n\n99.3 / 99.7\n\n5 Experiments\n\n5.1 Baselines\n\nBERT Our main baselines are built around BERT, variants of which are among the most successful\napproach on GLUE at the time of writing. Speci\ufb01cally, we use the bert-large-cased variant.\nFollowing the practice recommended in Devlin et al. (2019), for each task, we use the simplest\npossible architecture on top of BERT. We \ufb01ne-tune a copy of the pretrained BERT model separately\nfor each task, and leave the development of multi-task learning models to future work. For training,\nwe use the procedure speci\ufb01ed in Devlin et al. (2019): We use Adam (Kingma and Ba, 2014) with an\ninitial learning rate of 10\u22125 and \ufb01ne-tune for a maximum of 10 epochs.\nFor classi\ufb01cation tasks with sentence-pair inputs (BoolQ, CB, RTE, WiC), we concatenate the\nsentences with a [SEP] token, feed the fused input to BERT, and use a logistic regression classi\ufb01er\nthat sees the representation corresponding to [CLS]. For WiC, we also concatenate the representation\nof the marked word. For COPA, MultiRC, and ReCoRD, for each answer choice, we similarly\nconcatenate the context with that answer choice and feed the resulting sequence into BERT to produce\nan answer representation. For COPA, we project these representations into a scalar, and take as the\nanswer the choice with the highest associated scalar. For MultiRC, because each question can have\nmore than one correct answer, we feed each answer representation into a logistic regression classi\ufb01er.\nFor ReCoRD, we also evaluate the probability of each candidate independent of other candidates,\nand take the most likely candidate as the model\u2019s prediction. For WSC, which is a span-based task,\nwe use a model inspired by Tenney et al. (2019). Given the BERT representation for each word in the\noriginal sentence, we get span representations of the pronoun and noun phrase via a self-attention\nspan-pooling operator (Lee et al., 2017), before feeding it into a logistic regression classi\ufb01er.\n\nBERT++ We also report results using BERT with additional training on related datasets before\n\ufb01ne-tuning on the benchmark tasks, following the STILTs style of transfer learning (Phang et al.,\n2018). Given the productive use of MultiNLI in pretraining and intermediate \ufb01ne-tuning of pretrained\nlanguage models (Conneau et al., 2017; Phang et al., 2018, i.a.), for CB, RTE, and BoolQ, we use\nMultiNLI as a transfer task by \ufb01rst using the above procedure on MultiNLI. Similarly, given the\nsimilarity of COPA to SWAG (Zellers et al., 2018), we \ufb01rst \ufb01ne-tune BERT on SWAG. These results\nare reported as BERT++. For all other tasks, we reuse the results of BERT \ufb01ne-tuned on just that task.\n\nOther Baselines We include a baseline where for each task we simply predict the majority class,6\nas well as a bag-of-words baseline where each input is represented as an average of its tokens\u2019 GloVe\nword vectors (the 300D/840B release from Pennington et al., 2014). Finally, we list the best known\nresult on each task as of May 2019, except on tasks which we recast (WSC), resplit (CB), or achieve\n\n6For ReCoRD, we predict the entity that has the highest F1 with the other entity options.\n\n8\n\n\fthe best known result (WiC). The outside results for COPA, MultiRC, and RTE are from Sap et al.\n(2019), Trivedi et al. (2019), and Liu et al. (2019d) respectively.\n\nHuman Performance Pilehvar and Camacho-Collados (2019), Khashabi et al. (2018), Nangia and\nBowman (2019), and Zhang et al. (2018) respectively provide estimates for human performance\non WiC, MultiRC, RTE, and ReCoRD. For the remaining tasks, including the diagnostic set, we\nestimate human performance by hiring crowdworker annotators through Amazon\u2019s Mechanical Turk\nplatform to reannotate a sample of each test set. We follow a two step procedure where a crowd\nworker completes a short training phase before proceeding to the annotation phase, modeled after the\nmethod used by Nangia and Bowman (2019) for GLUE. See Appendix C for details.\n\n5.2 Results\n\nTable 3 shows results for all baselines. The most frequent class and CBOW baselines do not perform\nwell overall, achieving near chance performance for several of the tasks. Using BERT increases\nthe average SuperGLUE score by 25 points, attaining signi\ufb01cant gains on all of the benchmark\ntasks, particularly MultiRC, ReCoRD, and RTE. On WSC, BERT actually performs worse than\nthe simple baselines, likely due to the small size of the dataset and the lack of data augmentation.\nUsing MultiNLI as an additional source of supervision for BoolQ, CB, and RTE leads to a 2-5 point\nimprovement on all tasks. Using SWAG as a transfer task for COPA sees an 8 point improvement.\nOur best baselines still lag substantially behind human performance. On average, there is a nearly 20\npoint gap between BERT++ and human performance. The largest gap is on WSC, with a 35 point\ndifference between the best model and human performance. The smallest margins are on BoolQ,\nCB, RTE, and WiC, with gaps of around 10 points on each of these. We believe these gaps will be\nchallenging to close: On WSC and COPA, human performance is perfect. On three other tasks, it is\nin the mid-to-high 90s. On the diagnostics, all models continue to lag signi\ufb01cantly behind humans.\nThough all models obtain near perfect gender parity scores on Winogender, this is due to the fact that\nthey are obtaining accuracy near that of random guessing.\n\n6 Conclusion\n\nWe present SuperGLUE, a new benchmark for evaluating general-purpose language understanding\nsystems. SuperGLUE updates the GLUE benchmark by identifying a new set of challenging NLU\ntasks, as measured by the difference between human and machine baselines. The set of eight tasks in\nour benchmark emphasizes diverse task formats and low-data training data tasks, with nearly half the\ntasks having fewer than 1k examples and all but one of the tasks having fewer than 10k examples.\nWe evaluate BERT-based baselines and \ufb01nd that they still lag behind humans by nearly 20 points.\nGiven the dif\ufb01culty of SuperGLUE for BERT, we expect that further progress in multi-task, transfer,\nand unsupervised/self-supervised learning techniques will be necessary to approach human-level per-\nformance on the benchmark. Overall, we argue that SuperGLUE offers a rich and challenging testbed\nfor work developing new general-purpose machine learning methods for language understanding.\n\n7 Acknowledgments\n\nWe thank the original authors of the included datasets in SuperGLUE for their cooperation in the\ncreation of the benchmark, as well as those who proposed tasks and datasets that we ultimately\ncould not include. This work was made possible in part by a donation to NYU from Eric and Wendy\nSchmidt made by recommendation of the Schmidt Futures program. We gratefully acknowledge\nthe support of the NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this\nresearch, and funding from DeepMind for the hosting of the benchmark platform. AW is supported\nby the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE\n1342536. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material are\nthose of the author(s) and do not necessarily re\ufb02ect the views of the National Science Foundation. SB\nis partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning:\nfrom Pattern Recognition to AI) and Samsung Electronics (Improving Deep Learning using Latent\nStructure).\n\n9\n\n\fReferences\nStephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik\nSen, Alexander Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher R\u00e9, and\nRob Malkin. Snorkel drybell: A case study in deploying weak supervision at industrial scale. In\nSIGMOD. ACM, 2018.\n\nRoy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and\nIdan Szpektor. The second PASCAL recognising textual entailment challenge. In Proceedings\nof the Second PASCAL Challenges Workshop on Recognising Textual Entailment, 2006. URL\nhttp://u.cs.biu.ac.il/~nlp/RTE2/Proceedings/01.pdf.\n\nLuisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The\n\ufb01fth PASCAL recognizing textual entailment challenge. In Textual Analysis Conference (TAC),\n2009. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.1231.\n\nSven Buechel, Anneke Buffone, Barry Slaff, Lyle Ungar, and Jo\u00e3o Sedoc. Modeling empathy and\ndistress in reaction to news stories. In Proceedings of the 2018 Conference on Empirical Methods\nin Natural Language Processing (EMNLP), 2018.\n\nChris Callison-Burch, Miles Osborne, and Philipp Koehn. Re-evaluation the role of bleu in machine\ntranslation research. In Proceedings of the Conference of the European Chapter of the Association\nfor Computational Linguistics (EACL). Association for Computational Linguistics, 2006. URL\nhttps://www.aclweb.org/anthology/E06-1032.\n\nDaniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task\n1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings\nof the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for\nComputational Linguistics, 2017. doi: 10.18653/v1/S17-2001. URL https://www.aclweb.\norg/anthology/S17-2001.\n\nEunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke\nZettlemoyer. QuAC: Question answering in context. In Proceedings of the 2018 Conference on\nEmpirical Methods in Natural Language Processing (EMNLP). Association for Computational\nLinguistics, 2018a.\n\nEunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. Ultra-\ufb01ne entity typing. In Proceedings\nof the Association for Computational Linguistics (ACL). Association for Computational Linguistics,\n2018b. URL https://www.aclweb.org/anthology/P18-1009.\n\nChristopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina\nToutanova. Boolq: Exploring the surprising dif\ufb01culty of natural yes/no questions. In Proceedings\nof the 2019 Conference of the North American Chapter of the Association for Computational\nLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924\u20132936,\n2019a.\n\nKevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, and Quoc V. Le.\nBAM! Born-again multi-task networks for natural language understanding. In Proceedings of\nthe Association of Computational Linguistics (ACL). Association for Computational Linguistics,\n2019b. URL https://arxiv.org/pdf/1907.04829.pdf.\n\nRonan Collobert and Jason Weston. A uni\ufb01ed architecture for natural language processing: Deep\nneural networks with multitask learning. In Proceedings of the 25th International Conference on\nMachine Learning (ICML). Association for Computing Machinery, 2008. URL https://dl.acm.\norg/citation.cfm?id=1390177.\n\nAlexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representa-\ntions. In Proceedings of the 11th Language Resources and Evaluation Conference. European Lan-\nguage Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.\n\nAlexis Conneau, Douwe Kiela, Holger Schwenk, Lo\u00efc Barrault, and Antoine Bordes. Super-\nvised learning of universal sentence representations from natural language inference data. In\nProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing\n\n10\n\n\f(EMNLP). Association for Computational Linguistics, 2017. doi: 10.18653/v1/D17-1070. URL\nhttps://www.aclweb.org/anthology/D17-1070.\n\nIdo Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entail-\nIn Machine Learning Challenges. Evaluating Predictive Uncertainty, Vi-\nment challenge.\nsual Object Classi\ufb01cation, and Recognising Textual Entailment. Springer, 2006. URL https:\n//link.springer.com/chapter/10.1007/11736790_9.\n\nAndrew M Dai and Quoc V Le. Semi-supervised sequence learning.\n\nIn Advances in Neural\nInformation Processing Systems (NeurIPS). Curran Associates, Inc., 2015. URL http://papers.\nnips.cc/paper/5949-semi-supervised-sequence-learning.pdf.\n\nMarie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investi-\ngating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und\nBedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.\n\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep\nbidirectional transformers for language understanding. In Proceedings of the Conference of the\nNorth American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https:\n//arxiv.org/abs/1810.04805.\n\nWilliam B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases.\n\nIn Proceedings of IWP, 2005.\n\nManaal Faruqui and Dipanjan Das. Identifying well-formed natural language questions. In Pro-\nceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).\nAssociation for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/\nD18-1091.\n\nTommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar.\nBorn again neural networks. International Conference on Machine Learning (ICML), 2018. URL\nhttp://proceedings.mlr.press/v80/furlanello18a.html.\n\nMatt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew\nPeters, Michael Schmitz, and Luke S. Zettlemoyer. AllenNLP: A deep semantic natural language\nprocessing platform. In Proceedings of Workshop for NLP Open Source Software, 2017. URL\nhttps://www.aclweb.org/anthology/W18-2501.\n\nDanilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing\ntextual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment\nand Paraphrasing. Association for Computational Linguistics, 2007.\n\nHila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing methods cover up systematic\nIn Proceedings of the 2019\ngender biases in word embeddings but do not remove them.\nConference of the North American Chapter of the Association for Computational Linguistics:\nHuman Language Technologies, Volume 1 (Long and Short Papers), pages 609\u2013614, Min-\nneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL https:\n//www.aclweb.org/anthology/N19-1061.\n\nFelix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences\nIn Proceedings of the Conference of the North American Chapter of\nfrom unlabelled data.\nthe Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).\nAssociation for Computational Linguistics, 2016. doi: 10.18653/v1/N16-1162. URL https:\n//www.aclweb.org/anthology/N16-1162.\n\nGeoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv\n\npreprint 1503.02531, 2015. URL https://arxiv.org/abs/1503.02531.\n\nRobin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In\nProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).\nAssociation for Computational Linguistics, 2017. doi: 10.18653/v1/D17-1215. URL https:\n//www.aclweb.org/anthology/D17-1215.\n\n11\n\n\fDaniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking\nbeyond the surface: A challenge set for reading comprehension over multiple sentences.\nIn\nProceedings of the Conference of the North American Chapter of the Association for Computa-\ntional Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational\nLinguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\n1412.6980, 2014. URL https://arxiv.org/abs/1412.6980.\n\nRyan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba,\nand Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems,\n2015.\n\nNikita Kitaev, Steven Cao, and Dan Klein. Multilingual constituency parsing with self-attention and\npre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin-\nguistics, pages 3499\u20133505, Florence, Italy, July 2019. Association for Computational Linguistics.\ndoi: 10.18653/v1/P19-1340. URL https://www.aclweb.org/anthology/P19-1340.\n\nVid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz.\nIn Proceedings of the 57th\nA surprisingly robust trick for the Winograd schema challenge.\nAnnual Meeting of the Association for Computational Linguistics, pages 4837\u20134842, Florence,\nItaly, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1478. URL\nhttps://www.aclweb.org/anthology/P19-1478.\n\nKenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference\nresolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language\nProcessing. Association for Computational Linguistics, September 2017. doi: 10.18653/v1/\nD17-1018. URL https://www.aclweb.org/anthology/D17-1018.\n\nHector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge.\n\nIn\nThirteenth International Conference on the Principles of Knowledge Representation and Reasoning,\n2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.\n\nChia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau.\nHow not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics\nfor dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in\nNatural Language Processing. Association for Computational Linguistics, 2016. doi: 10.18653/\nv1/D16-1230. URL https://www.aclweb.org/anthology/D16-1230.\n\nNelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic\nknowledge and transferability of contextual representations. In Proceedings of the Conference of\nthe North American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies (NAACL-HLT). Association for Computational Linguistics, 2019a. URL https:\n//arxiv.org/abs/1903.08855.\n\nNelson F. Liu, Roy Schwartz, and Noah A. Smith.\n\nInoculation by \ufb01ne-tuning: A method for\nanalyzing challenge datasets. In Proceedings of the Conference of the North American Chapter\nof the Association for Computational Linguistics: Human Language Technologies (NAACL-\nHLT). Association for Computational Linguistics, 2019b. URL https://arxiv.org/abs/1904.\n02668.\n\nXiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural\nnetworks via knowledge distillation for natural language understanding. arXiv preprint 1904.09482,\n2019c. URL http://arxiv.org/abs/1904.09482.\n\nXiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for\n\nnatural language understanding. arXiv preprint 1901.11504, 2019d.\n\nKaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. Gender bias in\nneural natural language processing. arXiv preprint 1807.11714, 2018. URL http://arxiv.org/\nabs/1807.11714.\n\n12\n\n\fBryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in transla-\nIn Advances in Neural Information Processing Sys-\ntion: Contextualized word vectors.\ntems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/\n7209-learned-in-translation-contextualized-word-vectors.pdf.\n\nBryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language\ndecathlon: Multitask learning as question answering. arXiv preprint 1806.08730, 2018. URL\nhttps://arxiv.org/abs/1806.08730.\n\nR. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic\nheuristics in natural language inference. In Proceedings of the Association for Computational\nLinguistics (ACL). Association for Computational Linguistics, 2019. URL https://arxiv.org/\nabs/1902.01007.\n\nRichard T. McCoy and Tal Linzen. Non-entailed subsequences as a challenge for natural language\ninference. In Proceedings of the Society for Computational in Linguistics (SCiL) 2019, 2019. URL\nhttps://scholarworks.umass.edu/scil/vol2/iss1/46/.\n\nGeorge A Miller. WordNet: a lexical database for english. Communications of the ACM, 1995. URL\n\nhttps://www.aclweb.org/anthology/H94-1111.\n\nAakanksha Naik, Abhilasha Ravichander, Norman M. Sadeh, Carolyn Penstein Ros\u00e9, and Graham\nNeubig. Stress test evaluation for natural language inference. In International Conference on\nComputational Linguistics (COLING), 2018.\n\nNikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of hu-\nIn Proceedings of the Association of Compu-\nman performance on the GLUE benchmark.\ntational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https:\n//woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates,\nInc., 2017. URL https://openreview.net/pdf?id=BJJsrmfCZ.\n\nJeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word\nrepresentation. In Proceedings of the Conference on Empirical Methods in Natural Language Pro-\ncessing (EMNLP). Association for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1162.\nURL https://www.aclweb.org/anthology/D14-1162.\n\nMatthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and\nLuke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of\nthe North American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/\nN18-1202. URL https://www.aclweb.org/anthology/N18-1202.\n\nJason Phang, Thibault F\u00e9vry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary\ntraining on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https:\n//arxiv.org/abs/1811.01088.\n\nMohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for\nevaluating context-sensitive meaning representations. In Proceedings of the Conference of the\nNorth American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https:\n//arxiv.org/abs/1808.09121.\n\nAdam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White,\nand Benjamin Van Durme. Collecting diverse natural language inference problems for sentence\nIn Proceedings of the 2018 Conference on Empirical Methods in\nrepresentation evaluation.\nNatural Language Processing. Association for Computational Linguistics, 2018. URL https:\n//www.aclweb.org/anthology/D18-1007.\n\n13\n\n\fAlec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language un-\nderstanding by generative pre-training, 2018. Unpublished ms. available through a link at\nhttps://blog.openai.com/language-unsupervised/.\n\nPranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions\nfor machine comprehension of text. In Proceedings of the Conference on Empirical Methods in\nNatural Language Processing (EMNLP). Association for Computational Linguistics, 2016. doi:\n10.18653/v1/D16-1264. URL http://aclweb.org/anthology/D16-1264.\n\nMelissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives:\nAn evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.\nRachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in\ncoreference resolution. In Proceedings of the 2018 Conference of the North American Chapter\nof the Association for Computational Linguistics: Human Language Technologies. Association\nfor Computational Linguistics, 2018. doi: 10.18653/v1/N18-2002. URL https://www.aclweb.\norg/anthology/N18-2002.\n\nMaarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. SocialIQa: Common-\nsense reasoning about social interactions. Proceedings of the Conference on Empirical Methods in\nNatural Language Processing (EMNLP), 2019. URL https://arxiv.org/abs/1904.09728.\nNathan Schneider and Noah A Smith. A corpus and model integrating multiword expressions and\nsupersenses. In Proceedings of the Conference of the North American Chapter of the Association\nfor Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for\nComputational Linguistics, 2015. URL https://www.aclweb.org/anthology/N15-1177.\n\nKarin Kipper Schuler. Verbnet: A Broad-coverage, Comprehensive Verb Lexicon. PhD thesis, 2005.\n\nURL http://verbs.colorado.edu/~kipper/Papers/dissertation.pdf.\n\nRichard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng,\nand Christopher. Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing\n(EMNLP). Association for Computational Linguistics, 2013. URL https://www.aclweb.org/\nanthology/D13-1170.\n\nIan Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim,\nBenjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. What do you learn from\ncontext? probing for sentence structure in contextualized word representations. International Con-\nference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?\nid=SJzSgnRcKX.\n\nHarsh Trivedi, Heeyoung Kwon, Tushar Khot, Ashish Sabharwal, and Niranjan Balasubramanian.\nRepurposing entailment for multi-hop question answering tasks. In Proceedings of the 2019\nConference of the North American Chapter of the Association for Computational Linguistics:\nHuman Language Technologies, Volume 1 (Long and Short Papers), pages 2948\u20132958, Minneapolis,\nMinnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1302.\nURL https://www.aclweb.org/anthology/N19-1302.\n\nAlex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.\nGLUE: A multi-task benchmark and analysis platform for natural language understanding. In\nInternational Conference on Learning Representations, 2019a. URL https://openreview.\nnet/forum?id=rJ4km2R5t7.\n\nAlex Wang, Ian F. Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pap-\npagari, Shuning Jin, R. Thomas McCoy, Roma Patel, Yinghui Huang, Jason Phang, Edouard\nGrave, Haokun Liu, Najoung Kim, Phu Mon Htut, Thibault F\u2019evry, Berlin Chen, Nikita Nangia,\nAnhad Mohananey, Katharina Kann, Shikha Bordia, Nicolas Patry, David Benton, Ellie Pavlick,\nand Samuel R. Bowman. jiant 1.2: A software toolkit for research on general-purpose text\nunderstanding models. http://jiant.info/, 2019b.\n\nAlex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments.\nTransactions of the Association of Computational Linguists, 2019. URL https://arxiv.org/\nabs/1805.12471.\n\n14\n\n\fKellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. Mind the GAP: A balanced\ncorpus of gendered ambiguous pronouns. Transactions of the Association for Computational\nLinguistics (TACL), 2018. URL https://www.aclweb.org/anthology/Q18-1042.\n\nAdina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for\nsentence understanding through inference. In Proceedings of the Conference of the North American\nChapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-\nHLT). Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/\nN18-1101.\n\nZhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.\nXLNet: Generalized autoregressive pretraining for language understanding. Advances in Neural\nInformation Processing Systems (NeurIPS), 2019.\n\nFabio Massimo Zanzotto and Lorenzo Ferrone. Have you lost the thread? discovering ongoing\nconversations in scattered dialog blocks. ACM Transactions on Interactive Intelligent Systems\n(TiiS), 2017.\n\nRowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. SWAG: A large-scale adversarial dataset\nfor grounded commonsense inference. 2018. URL https://www.aclweb.org/anthology/\nD18-1009.\n\nSheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme.\nReCoRD: Bridging the gap between human and machine commonsense reading comprehension.\narXiv preprint 1810.12885, 2018.\n\nYuan Zhang, Jason Baldridge, and Luheng He. PAWS: Paraphrase adversaries from word scrambling.\nProceedings of the Conference of the North American Chapter of the Association for Computational\nLinguistics: Human Language Technologies (NAACL-HLT), 2019. URL https://arxiv.org/\nabs/1904.01130.\n\nJieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in\ncoreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference\nof the North American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-2003. URL\nhttps://www.aclweb.org/anthology/N18-2003.\n\n15\n\n\f", "award": [], "sourceid": 1828, "authors": [{"given_name": "Alex", "family_name": "Wang", "institution": "New York University"}, {"given_name": "Yada", "family_name": "Pruksachatkun", "institution": "New York University"}, {"given_name": "Nikita", "family_name": "Nangia", "institution": "NYU"}, {"given_name": "Amanpreet", "family_name": "Singh", "institution": "Facebook"}, {"given_name": "Julian", "family_name": "Michael", "institution": "University of Washington"}, {"given_name": "Felix", "family_name": "Hill", "institution": "Google Deepmind"}, {"given_name": "Omer", "family_name": "Levy", "institution": "Facebook AI Research"}, {"given_name": "Samuel", "family_name": "Bowman", "institution": "New York University"}]}