{"title": "Interpreting and improving natural-language processing (in machines) with  natural language-processing (in the brain)", "book": "Advances in Neural Information Processing Systems", "page_first": 14954, "page_last": 14964, "abstract": "Neural networks models for NLP are typically implemented without the explicit encoding of language rules and yet they are able to break one performance record after another.  This has generated a lot of research interest in interpreting the representations learned by these networks. We propose here a novel interpretation approach that relies on the only processing system we have that does understand language: the human brain. We use brain imaging recordings of subjects reading complex natural text to interpret word and sequence embeddings from 4 recent NLP models - ELMo, USE, BERT and Transformer-XL. We study how their representations differ across layer depth, context length, and attention type. Our results reveal differences in the context-related representations across these models. Further, in the transformer models, we find an interaction between layer depth and context length, and between layer depth and attention type. We finally hypothesize that altering BERT to better align with brain recordings would enable it to also better understand language. Probing the altered BERT using syntactic NLP tasks reveals that the model with increased brain-alignment outperforms the original model. Cognitive neuroscientists have already begun using NLP networks to study the brain, and this work closes the loop to allow the interaction between NLP and cognitive neuroscience to be a true cross-pollination.", "full_text": "Interpreting and improving natural-language\n\nprocessing (in machines) with natural\n\nlanguage-processing (in the brain)\n\nMariya Toneva\n\nNeuroscience Institute\n\nDepartment of Machine Learning\n\nCarnegie Mellon University\n\nmariya@cmu.edu\n\nLeila Wehbe\n\nNeuroscience Institute\n\nDepartment of Machine Learning\n\nCarnegie Mellon University\n\nlwehbe@cmu.edu\n\nAbstract\n\nNeural networks models for NLP are typically implemented without the explicit\nencoding of language rules and yet they are able to break one performance record\nafter another. This has generated a lot of research interest in interpreting the\nrepresentations learned by these networks. We propose here a novel interpretation\napproach that relies on the only processing system we have that does understand\nlanguage: the human brain. We use brain imaging recordings of subjects reading\ncomplex natural text to interpret word and sequence embeddings from 4 recent\nNLP models - ELMo, USE, BERT and Transformer-XL. We study how their\nrepresentations differ across layer depth, context length, and attention type. Our\nresults reveal differences in the context-related representations across these models.\nFurther, in the transformer models, we \ufb01nd an interaction between layer depth and\ncontext length, and between layer depth and attention type. We \ufb01nally hypothesize\nthat altering BERT to better align with brain recordings would enable it to also\nbetter understand language. Probing the altered BERT using syntactic NLP tasks\nreveals that the model with increased brain-alignment outperforms the original\nmodel. Cognitive neuroscientists have already begun using NLP networks to study\nthe brain, and this work closes the loop to allow the interaction between NLP and\ncognitive neuroscience to be a true cross-pollination.\n\n1\n\nIntroduction\n\nThe large success of deep neural networks in NLP is perplexing when considering that unlike most\nother NLP approaches, neural networks are typically not informed by explicit language rules. Yet,\nneural networks are constantly breaking records in various NLP tasks from machine translation to\nsentiment analysis. Even more interestingly, it has been shown that word embeddings and language\nmodels trained on a large generic corpus and then optimized for downstream NLP tasks produce\neven better results than training the entire model only to solve this one task (Peters et al., 2018;\nHoward and Ruder, 2018; Devlin et al., 2018). These models seem to capture something generic\nabout language. What representations do these models capture of their language input?\nDifferent approaches have been proposed to probe the representations in the network layers through\nNLP tasks designed to detect speci\ufb01c linguistic information (Conneau et al., 2018; Zhu et al., 2018;\n\nCode available at https://github.com/mtoneva/brain_language_nlp\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Diagram of approach and prior on brain function. The prior was constructed using the\nresults of Lerner et al. (2011): regions in group 1 (white) process information related to isolated\nwords and word sequences while group 2 (red) process only information related to word sequences\n(see Section 1.1). V indicates visual cortex. The drawing indicates the views of the brain with respect\nto the head. See supplementary materials for names of brain areas and full description of the methods.\n\nLinzen et al., 2016). Other approaches have attempted to offer a more theoretical assessment of how\nrecurrent networks propagate information, or what word embeddings can represent (Peng et al., 2018;\nChen et al., 2017; Weiss et al., 2018). Most of this work has been centered around understanding the\nproperties of sequential models such as LSTMs and RNNs, with considerably less work focused on\nnon-sequential models such as transformers.\nUsing speci\ufb01c NLP tasks, word annotations or behavioral measures to detect if a type of information\nis present in a network-derived representation (such as a word embedding of an LSTM or a state\nvector of a transformer) can be informative. However, complex and arguably more interesting aspects\nof language, such as high level meaning, are dif\ufb01cult to capture in an NLP task or in behavioral\nmeasures. We therefore propose a novel approach for interpreting neural networks that relies on\nthe only processing system we have that does understand language: the human brain. Indeed, the\nbrain does represent complex linguistic information while processing language, and we can use brain\nactivity recordings as a proxy for these representations. We can then relate the brain representations\nwith neural network representations by learning a mapping from the latter to the former. We refer to\nthis analysis as aligning the neural network representations with brain activity.\n\n1.1 Proposed approach\n\nWe propose to look at brain activity of subjects reading naturalistic text as a source of additional\ninformation for interpreting neural networks. We use fMRI (functional Magnetic Resonance Imaging)\nand Magnetoencephalography (MEG) recordings of the brain activity of these subjects as they are\npresented text one word at a time. We present the same text to the NLP model we would like\nto investigate and extract representations from the intermediate layers of the network, given this\ntext. We then learn an alignment between these extracted representations and the brain recordings\ncorresponding to the same words to offer an evaluation of the information contained in the network\nrepresentations. Evaluating neural network representations with brain activity is a departure from\nexisting studies that go the other way, using such an alignment to instead evaluate brain representations\n(Wehbe et al., 2014a; Frank et al., 2015; Hale et al., 2018; Jain and Huth, 2018).\nTo align a layer (cid:96) representation with brain activity, we \ufb01rst learn a model that predicts the fMRI\nor MEG activity in every region of the brain (\ufb01g. 1). We determine the regions where this model\nis predictive of brain activity using a classi\ufb01cation task followed by a signi\ufb01cance test. If a layer\nrepresentation can accurately predict the activity in a brain region r, then we conclude that the layer\nshares information with brain region r. We can thus make conclusions about the representation in\nlayer (cid:96) based on our prior knowledge of region r.\nBrain recordings have inherent, meaningful structure that is absent in network-derived representations.\nIn the brain, different processes are assigned to speci\ufb01c locations as has been revealed by a large array\n\n2\n\nR2b1b1b1aLVV2a1a2a2b2d2e2cRV2d2e2cLVPrior used in this paper2bProcedure for interpreting neural network representationsStep 1: Acquire brain activity of people reading or listening to natural textStep 2: Acquire intermediate representations of layer of a network processing the same text\u2113x\u2113y=f(x\u2113)Step 3: For each brain region learn a function that predicts activity in that region using the layer representation of the corresponding wordsStep 4: Use a classi\ufb01cation task to determine if the model is predictive of brain activity. Compute signi\ufb01cance of classi\ufb01cation results and identify the regions predicted by layer \u2113Step 5: Interpret layer using the prediction result and a prior on representations in brain areas\u2113\fof fMRI experiments. These processes have speci\ufb01c latencies and follow a certain order, which has\nbeen revealed by electrophysiology methods such as MEG. In contrast to the brain, a network-derived\nrepresentation might encode information that is related to multiple of these processes without a\nspeci\ufb01c organization. When we align that speci\ufb01c network representation with fMRI and MEG\ndata, the result will be a decomposition of the representation into parts that correspond to different\nprocesses and should therefore be more interpretable. We can think of alignment with brain activity\nas a \u201cdemultiplexer\" in which a single input (the network-derived representation) is decomposed into\nmultiple outputs (relationship with different brain processes).\nThere doesn\u2019t yet exist a unique theory of how the brain processes language that researchers agree\nupon (Hickok and Poeppel, 2007; Friederici, 2011; Hagoort, 2003). Because we don\u2019t know which of\nthe existing theories are correct, we abandon the theory-based approach and adopt a fully data-driven\napproach. We focus on results from experiments that use naturalistic stimuli to derive our priors\non the function of speci\ufb01c brain areas during language processing. These experiments have found\nthat a set of regions in the temporo-parietal and frontal cortices are activated in language processing\n(Lerner et al., 2011; Wehbe et al., 2014b; Huth et al., 2016; Blank and Fedorenko, 2017) and are\ncollectively referred to as the language network (Fedorenko and Thompson-Schill, 2014). Using\nthe results of Lerner et al. (2011) we subdivide this network into two groups of areas: group 1 is\nconsistently activated across subjects when they listen to disconnected words or to complex fragments\nlike sentences or paragraphs and group 2 is consistently activated only when they listen to complex\nfragments. We will use group 1 as our prior on brain areas that process information at the level of\nboth short-range context (isolated words) and long-range context (multi-word composition), and\ngroup 2 as a prior on areas that process long-range context only. Fig. 1 shows a simple approximation\nof these areas on the Montreal Neurological Institute (MNI) template. Inspection of the results of\nJain and Huth (2018) shows they corroborate the division of language areas into group 1 and group 2.\nBecause our prior relies on experimental results and not theories of brain function, it is data-driven.\nWe use this setup to investigate a series of questions about the information represented in different\nlayers of neural network models. We explore four recent models: ELMo, a language model by Peters\net al. (2018), BERT, a transformer by Devlin et al. (2018), USE (Universal Sentence Encoder), a\nsentence encoder by Cer et al. (2018), and T-XL (Transformer-XL), a transformer that includes a\nrecurrence mechanism by Dai et al. (2019). We investigate multiple questions about these networks.\nIs word-level speci\ufb01c information represented only at input layers? Does this differ across recurrent\nmodels, transformers and other sentence embedding methods? How many layers do we need to\nrepresent a speci\ufb01c length of context? Is attention affecting long range or short range context?\n\nIntricacies As a disclaimer, we warn the reader that one should be careful while dealing with\nbrain activity. Say a researcher runs a task T in fMRI (e.g. counting objects on the screen) and\n\ufb01nds it activates region R, which is shown in another experiment to also be active during process\nP (e.g. internal speech). It is seductive to then infer that process P is involved during task T . This\n\u201creverse inference\" can lead to erroneous conclusions, as region R can be involved in more than\none task (Poldrack, 2006). To avoid this trap, we only interpret alignment between network-derived\nrepresentations and brain regions if (1) the function of the region is well studied and we have some\ncon\ufb01dence on its function during a task similar to ours (e.g. the primary visual cortex processing\nletters on the screen or group 2 processing long range context) or (2) we show a brain region has\noverlap in the variance explained by the network-derived layer and by a speci\ufb01c process, in the same\nexperiment. We further take sound measures for reporting results: we cross-validate our models and\nreport results on unseen test sets.Another possible fallacy is to directly compare the performance\nof layers from different networks and conclude that one network performs better than the other:\ninformation is likely organized differently across networks and such comparisons are misleading.\nInstead we only perform controlled experiments where we look at one network and vary one parameter\nat a time, such as context length, layer depth or attention type.\n\n1.2 Contributions\n\n1. We present a new method to interpret network representations and a proof of concept for it.\n2. We use our method to analyze and provide hypotheses about ELMo, BERT, USE and T-XL.\n3. We \ufb01nd the middle layers of transformers are better at predicting brain activity than other\nlayers. We \ufb01nd that T-XL\u2019s performance doesn\u2019t degrade as context is increased, unlike the\n\n3\n\n\fother models\u2019. We \ufb01nd that using uniform attention in early layers of BERT (removing the\npretrained attention on the previous layer) leads to better prediction of brain activity.\n\n4. We show that when BERT is altered to better align with brain recordings (by removing the\npretrained attention in the shallow layers), it is also able to perform better at NLP tasks that\nprobe its syntactic understanding (Marvin and Linzen, 2018). This result shows a transfer of\nknowledge from the brain to NLP tasks and validates our approach.\n\n2 Related work on brains and language\n\nMost work investigating language in the brain has been done in a controlled experiment setup where\ntwo conditions are contrasted (Friederici, 2011). These conditions typically vary in complexity\n(simple vs. complex sentences), vary in the presence or absence of a linguistic property (sentences vs.\nlists of words) or vary in the presence or absence of incongruities (e.g. semantic surprisal) (Friederici,\n2011). A few researchers instead use naturalistic stimulus such as stories (Brennan et al., 2010;\nLerner et al., 2011; Speer et al., 2009; Wehbe et al., 2014b; Huth et al., 2016; Blank and Fedorenko,\n2017). Some use predictive models of brain activity as a function of multi-dimensional features\nspaces describing the different properties of the stimulus (Wehbe et al., 2014b; Huth et al., 2016).\nA few previous works have used neural network representations as a source of feature spaces to model\nbrain activity. Wehbe et al. (2014b) aligned the MEG brain activity we use here with a Recurrent\nNeural Network (RNN), trained on an online archive of Harry Potter Fan Fiction. The authors aligned\nbrain activity with the context vector and the word embedding, allowing them to trace sentence\ncomprehension at a word-by-word level. Jain and Huth (2018) aligned layers from a Long Short-Term\nMemory (LSTM) model to fMRI recordings of subjects listening to stories to differentiate between\nthe amount of context maintained by each brain region. Other approaches rely on computing surprisal\nor cognitive load metrics using neural networks to identify processing effort in the brain, instead of\naligning entire representations (Frank et al., 2015; Hale et al., 2018).\nThere is little prior work that evaluates or improves NLP models through brain recordings. S\u00f8gaard\n(2016) proposes to evaluate whether a word embedding contains cognition-relevant semantics by\nmeasuring how well they predict eye tracking data and fMRI recordings. Fyshe et al. (2014) build a\nnon-negative sparse embedding for individual words by constraining the embedding to also predict\nbrain activity well and show that the new embeddings better align with behavioral measures of\nsemantics.\n\n3 Approach\n\nNetwork-derived Representations The approach we propose in this paper is general and can be\napplied to a wide variety of current NLP models. We present four case-studies of recent models that\nhave very good performance on downstream tasks: ELMO, BERT, USE and T-XL.\n\n\u2022 ELMo is a bidirectional language model that incorporates multiple layers of LSTMs. It can\nbe used to derive contextualized embeddings by concatenating the LSTM output layers at\nthat word with its non-contextualized embedding. We use a pretrained version of ELMo\nwith 2 LSTM layers provided by Gardner et al. (2017).\n\u2022 BERT is a bidirectional model of stacked transformers that is trained to predict whether\na given sentence follows the current sentence, in addition to predicting a number of input\nwords that have been masked (Devlin et al., 2018). Upon release, this recent model achieved\nstate of the art across a large array of NLP tasks, ranging from question answering to named\nentity recognition. We use a pretrained model provided by Hugging Face 1.We investigate\nthe base BERT model, which has 12 layers, 12 attention heads, and 768 hidden units.\n\u2022 USE is a method of encoding sentences into an embedding (Cer et al., 2018) using a task\nsimilar to Skip-thought (Kiros et al., 2015). USE is able to produce embeddings in the same\nspace for single words and passages of text of different lengths. We use a version of USE\nfrom tensor\ufb02ow hub trained with a deep averaging network 2 that has 512 dimensions.\n\n1https://github.com/huggingface/pytorch-pretrained-BERT/\n2https://tfhub.dev/google/universal-sentence-encoder/2\n\n4\n\n\fFigure 2: Comparison between the prediction performance of two network representations from each\nmodel: a 10-word representation corresponding to the 10 most recent words shown to the participant\n(Red) and a word-embedding corresponding to the last word (Blue). Areas in white are well predicted\nfrom both representations. These results align to a fair extent with our prior: group 2 areas (red\noutlines) are mostly predicted by the longer context representations while areas 1b (lower white\noutlines) are predicted by both word-embeddings and longer context representations.\n\n\u2022 T-XL incorporates segment level recurrence into a transformer with the goal of capturing\nlonger context than either recurrent networks or usual transformers (Dai et al., 2019). We\nuse a pretrained model provided by Hugging Face1, with 19 layers and 1024 hidden units.\n\nWe investigate how the representations of all four networks change as we provide varying lengths\nof context. We compute the representations x(cid:96),k in each available intermediate layer ((cid:96) \u2208 {1, 2} for\nELMo; (cid:96) \u2208 {1, ..12} for BERT; (cid:96) is the output embedding for USE; (cid:96) \u2208 {1, ..19} for T-XL). We\ncompute xl,k for word wn by passing the most recent k words (wn\u2212k+1, .., wn) through the network.\n\nfMRI and MEG data\nIn this paper we use fMRI and MEG data which have complementary\nstrengths. fMRI is sensitive to the change in oxygen level in the blood that is a consequence to neural\nactivity, it has high spatial resolution (2-3mm) and low temporal resolution (multiple seconds). MEG\nmeasures the change in the magnetic \ufb01eld outside the skull due to neural activity, it has low spatial\nresolution (multiple cm) and high temporal resolution (up to 1KHz). We use fMRI data published by\nWehbe et al. (2014b). 8 subjects read chapter 9 of Harry Potter and the Sorcerer\u2019s stone Rowling\n(2012) which was presented one word at a time for a \ufb01xed duration of 0.5 seconds each, and 45\nminutes of data were recorded. The fMRI sampling rate (TR) was 2 seconds. The same chapter was\nshown by Wehbe et al. (2014a) to 3 subjects in MEG with the same rate of 0.5 seconds per word.\nDetails about the data and preprocessing can be found in the supplementary materials.\n\nEncoding models For each type of network-derived representation x(cid:96),k, we estimate an encoding\nmodel that takes x(cid:96),k as input and predicts the brain recording associated with reading the same\nk words that were used to derive x(cid:96),k. We estimate a function f, such that f (xl,k) = y, where y\nis the brain activity recorded with either MEG or fMRI. We follow previous work (Sudre et al.,\n2012; Wehbe et al., 2014b,a; Nishimoto et al., 2011; Huth et al., 2016) and model f as a linear\nfunction, regularized by the ridge penalty. The model is trained via four-fold cross-validation and the\nregularization parameter is chosen via nested cross-validation.\n\nEvaluation of predictions We evaluate the predictions from each encoding model by using them in\na classi\ufb01cation task on held-out data, in the four-fold cross-validation setting. The classi\ufb01cation task\nis to predict which of two sets of words was being read based on the respective feature representations\nof these words (Mitchell et al., 2008; Wehbe et al., 2014b,a). This task is performed between sets of\n20 consecutive TRs in fMRI (accounting for the slowness of the hemodynamic response), and sets\n\n5\n\n10-word represen.word-embedding 0.4 0.70.7 0.4ELMoBERTT-XLUSE\fFigure 3: Amount of group 1b regions and group 2 regions predicted well by each network-derived\nrepresentation: a 10-word representation corresponding to the 10 most recent words shown to the\nparticipant (Red) and a word-embedding corresponding to the last word (Blue). White indicates\nthat both representations predict the speci\ufb01ed amount of the regions well (about 0.7 threshold). We\npresent the mean and standard error of the percentage of explained voxels within the speci\ufb01ed regions\nover all participants.\n\nof 20 randomly sampled words in MEG. The classi\ufb01cation is repeated a large number of times and\nan average classi\ufb01cation accuracy is obtained for each voxel in fMRI and for each sensor/timepoint\nin MEG. We refer to this accuracy of matching the predictions of an encoding model to the correct\nbrain recordings as \"prediction accuracy\". The \ufb01nal fMRI results are reported on the MNI template,\nand we use pycortex to visualize them Gao et al. (2015). See the supplementary materials for more\ndetails about our methods.\n\nProof of concept Since MEG signals are faster than the rate of word presentation, they are more\nappropriate to study the components of word embeddings than the slow fMRI signals that cannot be\nattributed to individual words. We know that a word embedding learned from a text corpus is likely to\ncontain information related to the number of letters and part of speech of a word. We show in section\n4 of the supplementary materials that the number of letters of a word and its ELMo embedding\npredict a shared portion of brain activity early on (starting 100ms after word onset) in the back of the\nMEG helmet, over the visual cortex. Indeed, this region and latency are when we expect the visual\ninformation related to a word to be processed (Sudre et al., 2012). Further, a word\u2019s part of speech\nand its ELMo embedding predict a shared portion of brain activity around 200ms after word onset in\nthe left front of the MEG sensor. Indeed, we know from electrophysiology studies that part of speech\nviolations incur a response around 200ms after word onset in the frontal lobe (Frank et al., 2015). We\nconclude from these experiments that the ELMo embedding contains information about the number\nof letters and the part of speech of a word. Since we knew this from the onset, this experiment serves\nas a proof of concept for using our approach to interpret information in network representations.\n\n4\n\nInterpreting long-range contextual representations\n\nIntegrated contextual information in ELMo, BERT, and T-XL One question of interest in NLP\nis how successfully a model is able to integrate context into its representations. We investigate whether\nthe four NLP models we consider are able to create an integrated representation of a text sequence\nby comparing the performance of encoding models trained with two kinds of representations: a\ntoken-level word-embedding corresponding to the most recent word token a participant was shown\nand a 10-word representation corresponding to the 10 most recent words. For each of the models\nwith multiple layers (all but USE), this 10-word representation was derived from a middle layer in\nthe network (layer 1 in ELMo, layer 7 in BERT, and layer 11 in T-XL). We present the qualitative\ncomparisons across the four models in \ufb01gure 2, where only signi\ufb01cantly predicted voxels for each\nof the 8 subjects were included with the false discovery rate controlled at level 0.05 (see section 3\n\n6\n\n1b regionsgroup 2 regions0255075100percent of voxelsELMoword embcontextboth1b regionsgroup 2 regions0255075100USEword embcontextboth1b regionsgroup 2 regions0255075100percent of voxelsBERTword embcontextboth1b regionsgroup 2 regions0255075100T-XLword embcontextboth\fFigure 4: Performance of encoding models for all hidden layers in ELMo, BERT, and T-XL as the\namount of context provided to the network is increased. Transformer-XL is the only model that\ncontinues to increase performance as the context length is increased. In all networks, the middle\nlayers perform the best for contexts longer than 15 words. The deepest layers across all networks\nshow a sharp increase in performance at short-range context (fewer than 10 words), followed by a\ndecrease in performance.\n\nof supplementary materials for more details). We provide a quantitative summary of the observed\ndifferences across models for the 1b regions and group 2 regions in Figure3. We observe similarities\nin the word-embedding performances across all models, which all predict the brain activity in the left\nand right group 1b regions and to some extent in group 1a regions. We also observe differences in the\nlonger context representations between USE and the rest of the models:\n\n\u2022 ELMo, BERT, and T-XL long context representations predict subsets of both group 1 regions\nand group 2 regions. Most parts that are predicted by the word-embedding are also predicted\nby the long context representations (almost no blue voxels). We conclude that the long\ncontext representations most probably include information about the long range context and\nthe very recent word embeddings. These results may be due to the fact that all these models\nare at least partially trained to predict a word at a given position. They must encode long\nrange information and also local information that can predict the appropriate word.\n\u2022 USE long context representations predict the activity in a much smaller subset of group 2\nregions. The low performance of the USE vectors might be due to the deep averaging which\nmight be composing words in a crude manner. The low performance in predicting group 1\nregions is most probably because USE computes representations at a sentence level and\ndoes not have the option of retaining recent information like the other models. USE long\ncontext representations therefore only have long range information.\n\nRelationship between layer depth and context length We investigate how the performances of\nELMo, BERT, and T-XL change at different layers as they are provided varying size of contexts. The\nresults are shown in \ufb01gure 4. We observe that in all networks, the middle layers perform the best\nfor contexts longer than 15 words. In addition, the deepest layers across all networks show a sharp\nincrease in performance at short-range context (fewer than 10 words), followed by a decrease in\nperformance. We further observe that T-XL is the only model that continues to increase performance\nas the context length is increased. T-XL was designed to represent long range information better than\na usual transformer and our results suggest that it does. Finally, we observe that layer 1 in BERT\nbehaves differently from the \ufb01rst layers in the other two networks. In \ufb01gure 5, we show that when\nwe instead examine the increase in performance of all subsequent layers from the performance of\nthe \ufb01rst layer, the resulting context-layer relationships resemble the ones in T-XL. This suggests that\n\n7\n\n\fFigure 5: Change in encoding model per-\nformance of BERT layers from the per-\nformance of the \ufb01rst layer. When we ad-\njust for the performance of the \ufb01rst layer,\nthe performance of the remaining layers\nresemble that of T-XL more closely, as\nshown in Figure 4.\n\nFigure 6: Change in encoding model per-\nformance of BERT layer l when the atten-\ntion in layer l is made uniform. The perfor-\nmance of deep layers, other than the output\nlayer, is harmed by the change in attention.\nShallow layers bene\ufb01t from the uniform at-\ntention for context lengths up to 25 words.\n\nBERT layer 1 combines the information from the token-level embeddings in a way that limits the\nretention of longer context information in the layer 1 representations.\n\nEffect of attention on layer representation We further investigate the effect of attention across\ndifferent layers by measuring the negative impact that removing its learned attention has on its brain\nprediction performance. Speci\ufb01cally we replaced the learned attention with uniform attention over\nthe representations from the previous layer. More concretely, to alter the attention pattern at a single\nlayer in BERT, for each attention head hi = Attni(QW Q\ni ), we replace the pretrained\n\u221a\nparameter matrices W Q\nfor this layer, such that the attention Attn(Q, K, V ), de\ufb01ned\ni , W K\ndk)T V (Vaswani et al., 2017), yields equal probability over the values in value\nas sof tmax(QK/\nmatrix V (here dk denotes the dimensionality of the keys and queries). To this end, for a single\nlayer, we replace W Q\ni with the identity matrix. We only\nalter a single layer at a time, while keeping all other parameters of the pretrained BERT \ufb01xed. In\n\ufb01gure 6, we present the change in performance of each layer with uniform attention when compared\nto pretrained attention. The performance of deep layers, other than the output layer, is harmed by the\nchange in attention. However, surprisingly and against our expectations, shallow layers bene\ufb01t from\nthe uniform attention for context lengths up to 25 words.\n\ni with zero-\ufb01lled matrices and W V\n\ni and W K\n\ni , KW K\ni\n\n, V W V\n\n, and W V\ni\n\ni\n\n5 Applying insight from brain interpretations to NLP tasks\n\nAfter observing that the layers in the \ufb01rst half of the base BERT model bene\ufb01t from uniform\nattention for predicting brain activity, we test how the same alterations affect BERT\u2019s ability to predict\nlanguage by testing its performance on natural language processing tasks. We evaluate on tasks that\ndo not require \ufb01ne-tuning beyond pretraining to ensure that there is an opportunity to transfer the\ninsight from the brain interpretations of the pretrained BERT model. To this end, we evaluate on a\nrange of syntactic tasks proposed by Marvin and Linzen (2018), that have been previously used to\nquantify BERT\u2019s syntactic capabilities (Goldberg, 2019). These syntactic tasks measure subject-verb\nagreement in various types of sentences. They can be thought of as probe-tasks because they assess\nthe ability of the network to perform syntax-related predictions without further \ufb01ne-tuning.\nWe adopt the evaluation protocol of Goldberg (2019), in which BERT is \ufb01rst fed a complete\nsentence where the single focus verb is masked (e.g.[CLS] the game that the guard\nhates [MASK] bad .), then the prediction for the masked position is obtained using the pre-\ntrained language-modeling head, and lastly the accuracy is obtained by comparing the scores for\nis) to the score for the incorrect verb (i.e. the verb that is wrongly\nthe original correct verb (e.g.\nnumbered) (e.g. are). We make the attention in layers 1 through 6 in base BERT uniform, a single\n\n8\n\n0510152025303540context length6420246810% change in acc from L1layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 11layer 120510152025303540context length32101% change in acc from learned attentionlayer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 11layer 12\fcondition\nsimple\nin a sentential complement\nshort VP coordination\nlong VP coordination\nacross a prepositional phrase\nacross a subject relative clause\nacross an object relative clause\nacross an object relative clause (no that)\nin an object relative clause\nin an object relative clause (no that)\nre\ufb02exive anaphora: simple\nre\ufb02exive anaphora: in a sent. complem.\nre\ufb02exive anaphora: across rel. clause\n\nuni L1\n1.00\n0.83\n0.88\n0.96\n0.86\n0.83\n0.87\n0.87\n0.97**\n0.83**\n0.91\n0.88\n0.79\n\nuni L2\n1.00\n0.83\n0.90\n0.97\n0.93**\n0.83\n0.91\n0.80\n0.95\n0.72\n0.94\n0.85\n0.84**\n\nuni L6\n1.00\n0.83\n0.91\n1.00**\n0.88\n0.85**\n0.92**\n0.87\n0.91\n0.74\n0.99**\n0.86\n0.79\n\nuni L11\n0.98\n0.83\n0.88\n0.96\n0.82\n0.83\n0.86\n0.84\n0.93\n0.72\n0.95\n0.85\n0.76\n\nbase\n1.00\n0.83\n0.89\n0.98\n0.85\n0.84\n0.89\n0.86\n0.95\n0.79\n0.94\n0.89\n0.80\n\ncount\n120\n1440\n720\n400\n19440\n9600\n19680\n19680\n15960\n15960\n280\n3360\n22400\n\nTable 1: Performance of models with altered attention on subject-verb agreement across various\nsentence types (tasks by Marvin and Linzen (2018)). Best performance per task is made bold, and\nmarked with ** when difference from \u2018base\u2019 performance is statistically signi\ufb01cant. The altered\nmodels for the shallow layers signi\ufb01cantly outperform the pretrained model (\u2018base\u2019) in 8 of the 13\ntasks and achieve parity in 4 of the remaining 5 tasks.\n\nlayer at a time while keeping the remaining parameters \ufb01xed as described in Section 4, and evaluate\non the 13 tasks. We present the results of altering layers 1,2, and 6 in Table 1. We observe that the\naltered models signi\ufb01cantly outperform the pretrained model (\u2018base\u2019) in 8 of the 13 tasks and achieve\nparity in 4 of the remaining 5 tasks (paired t-test, signi\ufb01cance level 0.01, FDR controlled for multiple\ncomparisons (Benjamini and Hochberg, 1995)). Performance of altering layers 3-5 is similar and\nis presented in Supplementary Table 2. We contrast the performance of these layers with that of a\nmodel with uniform attention at layer 11, which is the model that suffers the most from this change\nfor predicting the brain activity as shown in Figure 6. We observe that this model also performs\npoorly on the NLP tasks as it performs on par or worse than the base model in 12 of the 13 tasks.\n\n6 Discussion\n\nWe introduced an approach to use brain activity recordings of subjects reading naturalistic text to\ninterpret different representations derived from neural networks. We used MEG to show that the\n(non-contextualized) word embedding of ELMo contains information about word length and part\nof speech as a proof of concept. We used fMRI to show that different network representation (for\nELMo, USE, BERT, T-XL) encode information relevant to language processing at different context\nlengths. USE long-range context representations perform differently from the other model and do not\nalso include short-range information. The transformer models (BERT and T-XL) both capture the\nmost brain-relevant context information in their middle layers. T-XL, by combining both recurrent\nproperties and transformer properties, has representations that don\u2019t degrade in performance when\nvery long context is used, unlike purely recurrent models (e.g. ELMo) or transformers (e.g. BERT).\nWe found that uniform attention on the previous layer actually improved the brain prediction perfor-\nmance of the shallow layers (layers 1-6) over using learned attention. After this observation, we tested\nhow the same alterations affect BERT\u2019s ability to predict language by probing the altered BERT\u2019s\nrepresentations using syntactic NLP tasks. We observed that the altered BERT performs better on\nthe majority of the tasks. This \ufb01nding suggests that altering an NLP model to better align with brain\nrecordings of people processing language may lead to better language understanding by the NLP\nmodel. This result parallels concurrent work by Kubilius et al. (2019) in the domain of vision, which\nshows that a neural network that is better-aligned with brain activity and incorporates insights about\nmodularity and connectivity in the brain outperforms other models with similar capacity on Imagenet.\n\nFuture work We hope that as naturalistic brain experiments become more popular and data more\nwidely shared, aligning brain activity with neural network will become a research area. Our next\nsteps are to expand the analysis using MEG to uncover new aspects of word-embeddings and to\nderive more informative fMRI brain priors that contain speci\ufb01c conceptual information that is linked\nto brain areas, and use them to study the high level semantic information in network representations.\n\n9\n\n\fAcknowledgments\n\nWe thank Tom Mitchell for valuable discussions. We thank the National Science Foundation for\nsupporting this work through the Graduate Research Fellowship under Grant No. DGE1745016, and\nGoogle for supporting this work through the Google Faculty Award.\n\nReferences\nBenjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach\n\nto multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289\u2013300.\n\nBlank, I. and Fedorenko, E. (2017). Domain-general brain regions do not track linguistic input as closely as\n\nlanguage-selective regions. Journal of Neuroscience, pages 3642\u201316.\n\nBrennan, J., Nir, Y., Hasson, U., Malach, R., Heeger, D., and Pylkk\u00e4nen, L. (2010). Syntactic structure building\n\nin the anterior temporal lobe during natural story listening. Brain and language.\n\nCer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan,\n\nS., Tar, C., et al. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.\n\nChen, Y., Gilroy, S., Knight, K., and May, J. (2017). Recurrent neural networks as weighted language recognizers.\n\narXiv preprint arXiv:1711.05408.\n\nConneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. (2018). What you can cram into a single\n\nvector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070.\n\nDai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J., Le, Q. V., and Salakhutdinov, R. (2019). Transformer-xl:\n\nAttentive language models beyond a \ufb01xed-length context. arXiv preprint arXiv:1901.02860.\n\nDevlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers\n\nfor language understanding. arXiv preprint arXiv:1810.04805.\n\nFedorenko, E. and Thompson-Schill, S. L. (2014). Reworking the language network. Trends in cognitive\n\nsciences, 18(3), 120\u2013126.\n\nFrank, S. L., Otten, L. J., Galli, G., and Vigliocco, G. (2015). The erp response to the amount of information\n\nconveyed by words in sentences. Brain and language, 140, 1\u201311.\n\nFriederici, A. D. (2011). The brain basis of language processing: from structure to function. Physiological\n\nreviews, 91(4), 1357\u20131392.\n\nFyshe, A., Talukdar, P. P., Murphy, B., and Mitchell, T. M. (2014). Interpretable semantic vectors from a joint\nmodel of brain-and text-based meaning. In Proceedings of the conference. Association for Computational\nLinguistics. Meeting, volume 2014, page 489. NIH Public Access.\n\nGao, J. S., Huth, A. G., Lescroart, M. D., and Gallant, J. L. (2015). Pycortex: an interactive surface visualizer\n\nfor fmri. Frontiers in neuroinformatics, 9, 23.\n\nGardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N. F., Peters, M., Schmitz, M., and Zettlemoyer,\n\nL. S. (2017). Allennlp: A deep semantic natural language processing platform.\n\nGoldberg, Y. (2019). Assessing bert\u2019s syntactic abilities. arXiv preprint arXiv:1901.05287.\n\nHagoort, P. (2003). How the brain solves the binding problem for language: a neurocomputational model of\n\nsyntactic processing. Neuroimage, 20, S18\u2013S29.\n\nHale, J., Dyer, C., Kuncoro, A., and Brennan, J. R. (2018). Finding syntax in human encephalography with\n\nbeam search. arXiv preprint arXiv:1806.04127.\n\nHickok, G. and Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience,\n\n8(5), 393\u2013402.\n\nHoward, J. and Ruder, S. (2018). Universal language model \ufb01ne-tuning for text classi\ufb01cation. In Proceedings of\nthe 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1,\npages 328\u2013339.\n\nHuth, A. G., de Heer, W. A., Grif\ufb01ths, T. L., Theunissen, F. E., and Gallant, J. L. (2016). Natural speech reveals\n\nthe semantic maps that tile human cerebral cortex. Nature, 532(7600), 453\u2013458.\n\n10\n\n\fJain, S. and Huth, A. (2018). Incorporating context into language encoding models for fmri. bioRxiv, page\n\n327601.\n\nKiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Skip-thought\n\nvectors. In Advances in neural information processing systems, pages 3294\u20133302.\n\nKubilius, J., Schrimpf, M., Hong, H., Majaj, N., Rajalingham, R., Issa, E., Kar, K., Bashivan, P., Prescott-Roy, J.,\nSchmidt, K., et al. (2019). Brain-like object recognition with high-performing shallow recurrent anns. In\nAdvances in Neural Information Processing Systems, pages 12785\u201312796.\n\nLerner, Y., Honey, C. J., Silbert, L. J., and Hasson, U. (2011). Topographic mapping of a hierarchy of temporal\n\nreceptive windows using a narrated story. The Journal of Neuroscience, 31(8), 2906\u20132915.\n\nLinzen, T., Dupoux, E., and Goldberg, Y. (2016). Assessing the ability of lstms to learn syntax-sensitive\n\ndependencies. arXiv preprint arXiv:1611.01368.\n\nMarvin, R. and Linzen, T. (2018). Targeted syntactic evaluation of language models. arXiv preprint\n\narXiv:1808.09031.\n\nMitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., and Just, M. A. (2008).\n\nPredicting human brain activity associated with the meanings of nouns. science, 320(5880), 1191\u20131195.\n\nNishimoto, S., Vu, A., Naselaris, T., Benjamini, Y., Yu, B., and Gallant, J. (2011). Reconstructing visual\n\nexperiences from brain activity evoked by natural movies. Current Biology.\n\nPeng, H., Schwartz, R., Thomson, S., and Smith, N. A. (2018). Rational recurrences. arXiv preprint\n\narXiv:1808.09357.\n\nPeters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep\n\ncontextualized word representations. arXiv preprint arXiv:1802.05365.\n\nPoldrack, R. A. (2006). Can cognitive processes be inferred from neuroimaging data? Trends in cognitive\n\nsciences, 10(2), 59\u201363.\n\nRowling, J. (2012). Harry Potter and the Sorcerer\u2019s Stone. Harry Potter US. Pottermore Limited.\n\nS\u00f8gaard, A. (2016). Evaluating word embeddings with fmri and eye-tracking. In Proceedings of the 1st Workshop\n\non Evaluating Vector-Space Representations for NLP, pages 116\u2013121.\n\nSpeer, N., Reynolds, J., Swallow, K., and Zacks, J. (2009). Reading stories activates neural representations of\n\nvisual and motor experiences. Psychological Science, 20(8), 989\u2013999.\n\nSudre, G., Pomerleau, D., Palatucci, M., Wehbe, L., Fyshe, A., Salmelin, R., and Mitchell, T. (2012). Tracking\n\nneural coding of perceptual and semantic features of concrete nouns. NeuroImage, 62(1), 451\u2013463.\n\nVaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., and Polosukhin, I.\n(2017). Attention is all you need. In Advances in neural information processing systems, pages 5998\u20136008.\n\nWehbe, L., Vaswani, A., Knight, K., and Mitchell, T. M. (2014a). Aligning context-based statistical models of\nlanguage with brain activity during reading. In Proceedings of the 2014 Conference on Empirical Methods\nin Natural Language Processing (EMNLP), pages 233\u2013243, Doha, Qatar. Association for Computational\nLinguistics.\n\nWehbe, L., Murphy, B., Talukdar, P., Fyshe, A., Ramdas, A., and Mitchell, T. M. (2014b). Simultaneously\nuncovering the patterns of brain regions involved in different story reading subprocesses. PLOS ONE, 9(11):\ne112575.\n\nWeiss, G., Goldberg, Y., and Yahav, E. (2018). On the practical computational power of \ufb01nite precision rnns for\n\nlanguage recognition. arXiv preprint arXiv:1805.04908.\n\nZhu, X., Li, T., and de Melo, G. (2018). Exploring semantic properties of sentence embeddings. In Proceedings\nof the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),\nvolume 2, pages 632\u2013637.\n\n11\n\n\f", "award": [], "sourceid": 8531, "authors": [{"given_name": "Mariya", "family_name": "Toneva", "institution": "Carnegie Mellon University"}, {"given_name": "Leila", "family_name": "Wehbe", "institution": "Carnegie Mellon University"}]}