{"title": "Visualizing and Measuring the Geometry of BERT", "book": "Advances in Neural Information Processing Systems", "page_first": 8594, "page_last": 8603, "abstract": "Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.", "full_text": "9\n1\n0\n2\n\n \nt\nc\nO\n8\n2\n\n \n\n \n \n]\n\nG\nL\n.\ns\nc\n[\n \n \n\n2\nv\n5\n1\n7\n2\n0\n\n.\n\n6\n0\n9\n1\n:\nv\ni\nX\nr\na\n\nVisualizing and Measuring the Geometry of BERT\n\nAndy Coenen\u21e4, Emily Reif\u21e4, Ann Yuan\u21e4\n\nBeen Kim, Adam Pearce, Fernanda Vi\u00e9gas, Martin Wattenberg\n\n{andycoenen,ereif,annyuan,beenkim,adampearce,viegas,wattenberg}@google.com\n\nGoogle Brain\nCambridge, MA\n\nAbstract\n\nTransformer architectures show signi\ufb01cant promise for natural language processing.\nGiven that a single pretrained model can be \ufb01ne-tuned to perform well on many\ndifferent tasks, these networks appear to extract generally useful linguistic features.\nHow do such networks represent this information internally? This paper describes\nqualitative and quantitative investigations of one particularly effective model, BERT.\nAt a high level, linguistic features seem to be represented in separate semantic and\nsyntactic subspaces. We \ufb01nd evidence of a \ufb01ne-grained geometric representation of\nword senses. We also present empirical descriptions of syntactic representations in\nboth attention matrices and individual word embeddings, as well as a mathematical\nargument to explain the geometry of these representations.\n\n1\n\nIntroduction\n\nNeural networks for language processing have advanced rapidly in recent years. A key breakthrough\nwas the introduction of transformer architectures [25]. One recent system based on this idea, BERT\n[5], has proven to be extremely \ufb02exible: a single pretrained model can be \ufb01ne-tuned to achieve\nstate-of-the-art performance on a wide variety of NLP applications. This suggests the model is\nextracting a set of generally useful features from raw text. It is natural to ask, which features are\nextracted? And how is this information represented internally?\nSimilar questions have arisen with other types of neural nets. Investigations of convolutional neural\nnetworks [9, 8] have shown how representations change from layer to layer [27] ; how individual\nunits in a network may have meaning [2]; and that \u201cmeaningful\u201d directions exist in the space of\ninternal activations [7]. These explorations have led to a broader understanding of network behavior.\nAnalyses on language-processing models (e.g., [1, 6, 10, 20, 24]) point to the existence of similarly\nrich internal representations of linguistic structure. Syntactic features seem to be extracted by RNNs\n(e.g., [1, 10]) as well as in BERT [24, 23, 11, 20]. Inspirational work from Hewitt and Manning [6]\nfound evidence of a geometric representation of entire parse trees in BERT\u2019s activation space.\nOur work extends these explorations of the geometry of internal representations. Investigating\nhow BERT represents syntax, we describe evidence that attention matrices contain grammatical\nrepresentations. We also provide mathematical arguments that may explain the particular form of the\nparse tree embeddings described in [6]. Turning to semantics, using visualizations of the activations\ncreated by different pieces of text, we show suggestive evidence that BERT distinguishes word senses\nat a very \ufb01ne level. Moreover, much of this semantic information appears to be encoded in a relatively\nlow-dimensional subspace.\n\n\u21e4Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Context and related work\n\nOur object of study is the BERT model introduced in [5]. To set context and terminology, we brie\ufb02y\ndescribe the model\u2019s architecture. The input to BERT is based on a sequence of tokens (words or\npieces of words). The output is a sequence of vectors, one for each input token. We will often refer to\nthese vectors as context embeddings because they include information about a token\u2019s context.\nBERT\u2019s internals consist of two parts. First, an initial embedding for each token is created by\ncombining a pre-trained wordpiece embedding with position and segment information. Next, this\ninitial sequence of embeddings is run through multiple transformer layers, producing a new sequence\nof context embeddings at each step. (BERT comes in two versions, a 12-layer BERT-base model and\na 24-layer BERT-large model.) Implicit in each transformer layer is a set of attention matrices, one\nfor each attention head, each of which contains a scalar value for each ordered pair (tokeni, tokenj).\n\n2.1 Language representation by neural networks\n\nSentences are sequences of discrete symbols, yet neural networks operate on continuous data\u2013vectors\nin high-dimensional space. Clearly a successful network translates discrete input into some kind of\ngeometric representation\u2013but in what form? And which linguistic features are represented?\nThe in\ufb02uential Word2Vec system [16], for example, has been shown to place related words near each\nother in space, with certain directions in space correspond to semantic distinctions. Grammatical\ninformation such as number and tense are also represented via directions in space. Analyses of the\ninternal states of RNN-based models have shown that they represent information about soft hierarchi-\ncal syntax in a form that can be extracted by a one-hidden-layer network [10]. One investigation of\nfull-sentence embeddings found a wide variety of syntactic properties could be extracted not just by\nan MLP, but by logistic regression [3].\nSeveral investigations have focused on transformer architectures. Experiments suggest context\nembeddings in BERT and related models contain enough information to perform many tasks in the\ntraditional \u201cNLP pipeline\u201d [23]\u2013tagging part-of-speech, co-reference resolution, dependency labeling,\netc.\u2013with simple classi\ufb01ers (linear or small MLP models) [24, 20]. Qualitative, visualization-based\nwork [26] suggests attention matrices may encode important relations between words.\nA recent and fascinating discovery by Hewitt and Manning [6], which motivates much of our work, is\nthat BERT seems to create a direct representation of an entire dependency parse tree. The authors \ufb01nd\nthat (after a single global linear transformation, which they term a \u201cstructural probe\u201d) the square of\nthe distance between context embeddings is roughly proportional to tree distance in the dependency\nparse. They ask why squaring distance is necessary; we address this question in the next section.\nThe work cited above suggests that language-processing networks create a rich set of intermediate\nrepresentations of both semantic and syntactic information. These results lead to two motivating\nquestions for our research. Can we \ufb01nd other examples of intermediate representations? And, from a\ngeometric perspective, how do all these different types of information coexist in a single vector?\n\n3 Geometry of syntax\n\nWe begin by exploring BERT\u2019s internal representation of syntactic information. This line of inquiry\nbuilds on the work by Hewitt and Manning in two ways. First, we look beyond context embeddings\nto investigate whether attention matrices encode syntactic features. Second, we provide a simple\nmathematical analysis of the tree embeddings that they found.\n\n3.1 Attention probes and dependency representations\n\nAs in [6], we are interested in \ufb01nding representations of dependency grammar relations [4]. While [6]\nanalyzed context embeddings, another natural place to look for encodings is in the attention matrices.\nAfter all, attention matrices are explicitly built on the relations between pairs of words.\nTo formalize what it means for attention matrices to encode linguistic features, we use an atten-\ntion probe, an analog of edge probing [24]. An attention probe is a task for a pair of tokens,\n(tokeni, tokenj) where the input is a model-wide attention vector formed by concatenating the\n\n2\n\n\fFigure 1: A model-wide attention vector for an ordered pair of tokens contains the scalar attention\nvalues for that pair in all attention heads and layers. Shown: BERT-base.\n\nentries aij in every attention matrix from every attention head in every layer. The goal is to classify\na given relation between the two tokens. If a linear model achieves reliable accuracy, it seems\nreasonable to say that the model-wide attention vector encodes that relation. We apply attention\nprobes to the task of identifying the existence and type of dependency relation between two words.\n\n3.1.1 Method\nThe data for our \ufb01rst experiment is a corpus of parsed sentences from the Penn Treebank [13].\nThis dataset has the constituency grammar for the sentences, which was translated to a dependency\ngrammar using the PyStanfordDependencies library [14]. The entirety of the Penn Treebank consists\nof 3.1 million dependency relations; we \ufb01ltered this by using only examples of the 30 dependency\nrelations with more than 5,000 examples in the data set. We then ran each sentence through BERT-\nbase, and obtained the model-wide attention vector (see Figure 1) between every pair of tokens in the\nsentence, excluding the [SEP ] and [CLS] tokens. This and subsequent experiments were conducted\nusing PyTorch on MacBook machines.\nWith these labeled embeddings, we trained two L2 regularized linear classi\ufb01ers via stochastic gradient\ndescent, using [19]. The \ufb01rst of these probes was a simple linear binary classi\ufb01er to predict whether\nor not an attention vector corresponds to the existence of a dependency relation between two tokens.\nThis was trained with a balanced class split, and 30% train/test split. The second probe was a\nmulticlass classi\ufb01er to predict which type of dependency relation exists between two tokens, given\nthe dependency relation\u2019s existence. This probe was trained with distributions outlined in table 2.\n\n3.1.2 Results\nThe binary probe achieved an accuracy of 85.8%, and the multiclass probe achieved an accuracy of\n71.9%. Our real aim, again, is not to create a state-of-the-art parser, but to gauge whether model-wide\nattention vectors contain a relatively simple representation of syntactic features. The success of this\nsimple linear probe suggests that syntactic information is in fact encoded in the attention vectors.\n\n3.2 Geometry of parse tree embeddings\n\nHewitt and Manning\u2019s result that context embeddings represent dependency parse trees geometrically\nraises several questions. Is there a reason for the particular mathematical representation they found?\nCan we learn anything by visualizing these representations?\n\n3.2.1 Mathematics of embedding trees in Euclidean space\nHewitt and Manning ask why parse tree distance seems to correspond speci\ufb01cally to the square of\nEuclidean distance, and whether some other metric might do better [6]. We describe mathematical\nreasons why squared Euclidean distance may be natural.\n\n3\n\n\fFirst, one cannot generally embed a tree, with its tree metric d, isometrically into Euclidean space\n(Appendix 6.1). Since an isometric embedding is impossible, motivated by the results of [6] we might\nask about other possible representations.\nDe\ufb01nition 1 (power-p embedding). Let M be a metric space, with metric d. We say f : M ! Rn is\na power-p embedding if for all x, y 2 M, we have\n\nWe will refer to the special case of a power-2 embedding as a Pythagorean embedding.\n\n||f (x) f (y)||p = d(x, y)\n\nIn these terms, we can say [6] found evidence of a Pythagorean embedding for parse trees. It turns\nout that Pythagorean embeddings of trees are especially simple. For one thing, it is easy to write\ndown an explicit model\u2013a mathematical idealization\u2013for a Pythagorean embedding for any tree.\nTheorem 1. Any tree with n nodes has a Pythagorean embedding into Rn1.\nProof. Let the nodes of the tree be t0, ..., tn1, with t0 being the root node. Let {e1, ..., en1} be\northogonal unit basis vectors for Rn1. Inductively, de\ufb01ne an embedding f such that:\n\nf (t0) = 0\n\nf (ti) = ei + f (parent(ti))\n\nGiven two distinct tree nodes x and y, where m is the tree distance d(x, y), it follows that we can\nmove from f (x) to f (y) using m mutually perpendicular unit steps. Thus\n\n||f (x) f (y)||2 = m = d(x, y)\n\nRemark 1. This embedding has a simple informal description: at each embedded vertex of the graph,\nall line segments to neighboring embedded vertices are unit-distance segments, orthogonal to each\nother and to every other edge segment. (It\u2019s even easy to write down a set of coordinates for each\nnode.) By de\ufb01nition any two Pythagorean embeddings of the same tree are isometric; with that in\nmind, we refer to this as the canonical Pythagorean embedding. (See [12] for an independent version\nof this theorem.)\nIn the proof of Theorem 1, instead of choosing basis vectors in advance, one can choose random\nunit vectors. Because two random vectors will be nearly orthogonal in high-dimensional space,\nthe Pythagorean embedding condition will approximately hold. This means that in space that\nis suf\ufb01ciently high-dimensional (compared to the size of the tree) it is possible to construct an\napproximate Pythagorean embedding with essentially \u201clocal\u201d information, where a tree node is\nconnected to its children via random unit-length branches. We refer to this type of embedding as a\nrandom branch embedding. (See Appendix 6.2 for visualizations, and Appendix 6.1 for mathematical\ndetail.)\nIt is also worth noting that power-p embeddings will not necessarily even exist when p < 2. (See\nAppendix 6.1)\nTheorem 2. For any p < 2, there is a tree which has no power-p embedding.\nRemark 2. A result of Schoenberg [22], phrased in our terminology, is that if a metric space X has a\npower-p embedding into Rn, then it also has a power-q embedding for any q > p. Thus for p > 2\nthere will always be a power-p embedding for any tree. Unlike the case of p = 2, we do not know of\na simple way to describe the geometry of such an embedding.\nThe simplicity of Pythagorean tree embeddings, as well as the fact that they may be approximated by\na simple random model, suggests they may be a generally useful alternative to approaches to tree\nembeddings that require hyperbolic geometry [18].\n\n3.2.2 Visualization of parse tree embeddings\nHow do parse tree embeddings in BERT compare to exact power-2 embeddings? To explore this\nquestion, we created a simple visualization tool. The input to each visualization is a sentence from\nthe Penn Treebank with associated dependency parse trees (see Section 3.1.1). We then extracted the\n\n4\n\n\fFigure 2: Visualizing embeddings of two sentences after applying the Hewitt-Manning probe. We\ncompare the parse tree (left images) with a PCA projection of context embeddings (right images).\n\nFigure 3: The average squared edge length between two words with a given dependency.\n\ntoken embeddings produced by BERT-large in layer 16 (following [6]), transformed by the Hewitt\nand Manning\u2019s \u201cstructural probe\u201d matrix B, yielding a set of points in 1024-dimensional space. We\nused PCA to project to two dimensions. (Other dimensionality-reduction methods, such as t-SNE\nand UMAP [15], were harder to interpret.)\nTo visualize the tree structure, we connected pairs of points representing words with a dependency\nrelation. The color of each edge indicates the deviation from true tree distance. We also connected,\nwith dotted line, pairs of words without a dependency relation but whose positions (before PCA) were\nfar closer than expected. The resulting image lets us see both the overall shape of the tree embedding,\nand \ufb01ne-grained information on deviation from a true power-2 embedding.\nTwo example visualizations are shown in Figure 7, next to traditional diagrams of their underlying\nparse trees. These are typical cases, illustrating some common patterns; for instance, prepositions are\nembedded unexpectedly close to words they relate to. (Figure 8 shows additional examples.)\nA natural question is whether the difference between these projected trees and the canonical ones is\nmerely noise, or a more interesting pattern. By looking at the average embedding distances of each\ndependency relation (see Figure 3) , we can see that they vary widely from around 1.2 (compound :\nprt, advcl) to 2.5 (mwe, parataxis, auxpass). Such systematic differences suggest that BERT\u2019s\nsyntactic representation has an additional quantitative aspect beyond traditional dependency grammar.\n\n4 Geometry of word senses\n\nBERT seems to have several ways of representing syntactic information. What about semantic\nfeatures? Since embeddings produced by transformer models depend on context, it is natural to\nspeculate that they capture the particular shade of meaning of a word as used in a particular sentence.\n(E.g., is \u201cbark\u201d an animal noise or part of a tree?) We explored geometric representations of word\nsense both qualitatively and quantitatively.\n\n5\n\n\fFigure 4: Embeddings for the word \"die\" in different contexts, visualized with UMAP. Sample points\nare annotated with corresponding sentences. Overall annotations (blue text) are added as a guide.\n\n4.1 Visualization of word senses\n\nOur \ufb01rst experiment is an exploratory visualization of how word sense affects context embeddings.\nFor data on different word senses, we collected all sentences used in the introductions to English-\nlanguage Wikipedia articles. (Text outside of introductions was frequently fragmentary.) We created\nan interactive application, which we plan to make public. A user enters a word, and the system\nretrieves 1,000 sentences containing that word. It sends these sentences to BERT-base as input, and\nfor each one it retrieves the context embedding for the word from a layer of the user\u2019s choosing.\nThe system visualizes these 1,000 context embeddings using UMAP [15], generally showing clear\nclusters relating to word senses. Different senses of a word are typically spatially separated, and\nwithin the clusters there is often further structure related to \ufb01ne shades of meaning. In Figure 4, for\nexample, we not only see crisp, well-separated clusters for three meanings of the word \u201cdie,\u201d but\nwithin one of these clusters there is a kind of quantitative scale, related to the number of people\ndying. See Appendix 6.4 for further examples. The apparent detail in the clusters we visualized raises\ntwo immediate questions. First, is it possible to \ufb01nd quantitative corroboration that word senses are\nwell-represented? Second, how can we resolve a seeming contradiction: in the previous section, we\nsaw how position represented syntax; yet here we see position representing semantics.\n\n4.2 Measurement of word sense disambiguation capability\n\nThe crisp clusters seen in visualizations such as Figure 4 suggest that BERT may create simple,\neffective internal representations of word senses, putting different meanings in different locations. To\ntest this hypothesis quantitatively, we test whether a simple classi\ufb01er on these internal representations\ncan perform well at word-sense disambiguation (WSD).\nWe follow the procedure described in [20], which performed a similar experiment with the ELMo\nmodel. For a given word with n senses, we make a nearest-neighbor classi\ufb01er where each neighbor is\nthe centroid of a given word sense\u2019s BERT-base embeddings in the training data. To classify a new\nword we \ufb01nd the closest of these centroids, defaulting to the most commonly used sense if the word\nwas not present in the training data. We used the data and evaluation from [21]: the training data was\nSemCor [17] (33,362 senses), and the testing data was the suite described in [21] (3,669 senses).\nThe simple nearest-neighbor classi\ufb01er achieves an F1 score of 71.1, higher than the current state of\nthe art (Table 1), with the accuracy monotonically increasing through the layers. This is a strong\nsignal that context embeddings are representing word-sense information. Additionally, an even higher\nscore of 71.5 was obtained using the technique described in the following section.\n\n6\n\n\fMethod\n\nF1 score\n\nBaseline (most frequent sense)\n\nELMo [20]\n\nBERT\n\nBERT (w/ probe)\n\n64.8\n70.1\n71.1\n71.5\n\nm\n\nTrained probe Random probe\n\n768 (full)\n\n512\n256\n128\n64\n32\n16\n\n71.26\n71.52\n71.29\n71.21\n70.19\n68.01\n65.34\n\n70.74\n70.51\n69.92\n69.56\n68.00\n64.62\n61.01\n\nTable 1: [Left] F1 scores for WSD task. [Right] Semantic probe % accuracy on \ufb01nal-layer BERT-base\n\n4.2.1 An embedding subspace for word senses?\nWe hypothesized that there might also exist a linear transformation under which distances between\nembeddings would better re\ufb02ect their semantic relationships\u2013that is, words of the same sense would\nbe closer together and words of different senses would be further apart.\nTo explore this hypothesis, we trained a probe following Hewitt and Manning\u2019s methodology. We\ninitialized a random matrix B 2 Rk\u21e5m, testing different values for m. Loss is, roughly, de\ufb01ned as\nthe difference between the average cosine similarity between embeddings of words with different\nsenses, and that between embeddings of the same sense. However, we clamped the cosine similarity\nterms to within \u00b10.1 of the pre-training averages for same and different senses. (Without clamping,\nthe trained matrix simply ended up taking well-separated clusters and separating them further. We\ntested values between 0.05 and 0.2 for the clamping range and 0.1 had the best performance.)\nOur training corpus was the same dataset from 4.1.2., \ufb01ltered to include only words with at least two\nsenses, each with at least two occurrences (for 8,542 out of the original 33,362 senses). Embeddings\ncame from BERT-base (12 layers, 768-dimensional embeddings). We evaluate our trained probes on\nthe same dataset and WSD task used in 4.1.2 (Table 1). As a control, we compare each trained probe\nagainst a random probe of the same shape. As mentioned in 4.1.2, untransformed BERT embeddings\nachieve a state-of-the-art accuracy rate of 71.1%. We \ufb01nd that our trained probes are able to achieve\nslightly improved accuracy down to m = 128.\nThough our probe achieves only a modest improvement in accuracy for \ufb01nal-layer embeddings,\nwe note that we were able to more dramatically improve the performance of embeddings at earlier\nlayers (see Appendix for details: Figure 11). This suggests there is more semantic information in the\ngeometry of earlier-layer embeddings than a \ufb01rst glance might reveal.\nOur results also support the idea that word sense information may be contained in a lower-dimensional\nspace. This suggests a resolution to the seeming contradiction mentioned above: a vector encodes\nboth syntax and semantics, but in separate complementary subspaces (see Appendix 6.7 for details).\n\n4.3 Embedding distance and context: a concatenation experiment\nIf word sense is affected by context, and encoded by location in space, then we should be able to\nin\ufb02uence context embedding positions by systematically varying their context. To test this hypothesis,\nwe performed an experiment based on a simple and controllable context change: concatenating\nsentences where the same word is used in different senses.\n\n4.3.1 Method\nWe picked 25,096 sentence pairs from SemCor, using the same keyword in different senses. E.g.:\n\nA: \"He thereupon went to London and spent the winter talking to men of wealth.\"\nwent: to move from one place to another.\nB: \"He went prone on his stomach, the better to pursue his examination.\" went: to\nenter into a speci\ufb01ed state.\n\nWe de\ufb01ne a matching and an opposing sense centroid for each keyword. For sentence A, the matching\nsense centroid is the average embedding for all occurrences of \u201cwent\u201d used with sense A. A\u2019s opposing\nsense centroid is the average embedding for all occurrences of \u201cwent\u201d used with sense B.\n\n7\n\n\fWe gave each individual sentence in the pair to BERT-base and recorded the cosine similarity between\nthe keyword embeddings and their matching sense centroids. We also recorded the similarity between\nthe keyword embeddings and their opposing sense centroids. We call the ratio between the two\nsimilarities the individual similarity ratio. Generally this ratio is greater than one, meaning that the\ncontext embedding for the keyword is closer to the matching centroid than the opposing one.\nWe joined each sentence pair with the word \"and\" to create a single new sentence. We gave these\nconcatenations to BERT and recorded the similarities between the keyword embeddings and their\nmatching/opposing sense centroids. Their ratio is the concatenated similarity ratio.\n\n4.3.2 Results\n\nOur hypothesis was that the keyword embed-\ndings in the concatenated sentence would move\ntowards their opposing sense centroids. Indeed,\nwe found that the average individual similar-\nity ratio was higher than the average concate-\nnated similarity ratio at every layer (see Fig-\nure 5). Concatenating a random sentence did not\nchange the individual similarity ratios. If the ra-\ntio is less than one for any sentence, that means\nBERT has misclassi\ufb01ed its keyword sense. We\nfound that the misclassi\ufb01cation rate was signif-\nicantly higher for \ufb01nal-layer embeddings in the\nconcatenated sentences compared to the individ-\nual sentences: 8.23% versus 2.43% respectively.\nWe also measured the effect of projecting \ufb01nal-\nlayer keyword embeddings into the semantic subspace from 4.1.3. After multiplying each embedding\nby our trained semantic probe, we obtained an average concatenated similarity ratio of 1.58 and\nindividual similarity ratio of 1.88, suggesting the transformed embeddings are closer to their matching\nsense centroids than the original embeddings (the original concatenated similarity ratio is 1.28 and\nthe individual similarity ratio is 1.43). We also measured lower average misclassi\ufb01cation rates for\ntransformed embeddings: 7.31% for concatenated sentences and 2.27% for individual sentences.\nOur results show how a token\u2019s embedding in a sentence may systematically differ from the embedding\nfor the same token in the same sentence concatenated with a non-sequitur. This points to a potential\nfailure mode for attention-based models: tokens do not necessarily respect semantic boundaries when\nattending to neighboring tokens, but rather indiscriminately absorb meaning from all neighbors.\n\nFigure 5: Average similarity ratio: senses A vs. B.\n\n5 Conclusion and future work\n\nWe have presented a series of experiments that shed light on BERT\u2019s internal representations of\nlinguistic information. We have found evidence of syntactic representation in attention matrices, with\ncertain directions in space representing particular dependency relations. We have also provided a\nmathematical justi\ufb01cation for the squared-distance tree embedding found by Hewitt and Manning.\nMeanwhile, we have shown that just as there are speci\ufb01c syntactic subspaces, there is evidence for\nsubspaces that represent semantic information. We also have shown how mistakes in word sense\ndisambiguation may correspond to changes in internal geometric representation of word meaning.\nOur experiments also suggest an answer to the question of how all these different representations\n\ufb01t together. We conjecture that the internal geometry of BERT may be broken into multiple linear\nsubspaces, with separate spaces for different syntactic and semantic information.\nInvestigating this kind of decomposition is a natural direction for future research. What other\nmeaningful subspaces exist? After all, there are many types of linguistic information that we have\nnot looked for. A second important avenue of exploration is what the internal geometry can tell us\nabout the speci\ufb01cs of the transformer architecture. Can an understanding of the geometry of internal\nrepresentations help us \ufb01nd areas for improvement, or re\ufb01ne BERT\u2019s architecture?\nAcknowledgments: We would like to thank David Belanger, Tolga Bolukbasi, Jasper Snoek, and\nIan Tenney for helpful feedback and discussions.\n\n8\n\n\fReferences\n[1] Terra Blevins, Omer Levy, and Luke Zettlemoyer. Deep rnns encode soft hierarchical syntax.\n\narXiv preprint arXiv:1805.04218, 2018.\n\n[2] Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, and Chris Olah. Activation atlas.\n\nDistill, 2019. https://distill.pub/2019/activation-atlas.\n\n[3] Alexis Conneau, German Kruszewski, Guillaume Lample, Lo\u00efc Barrault, and Marco Baroni.\nWhat you can cram into a single vector: Probing sentence embeddings for linguistic properties.\narXiv preprint arXiv:1805.01070, 2018.\n\n[4] Marie-Catherine De Marneffe, Bill MacCartney, Christopher D Manning, et al. Generating\ntyped dependency parses from phrase structure parses. In Lrec, volume 6, pages 449\u2013454, 2006.\n\n[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018.\n\n[6] John Hewitt and Christopher D Manning. A structural probe for \ufb01nding syntax in word\n\nrepresentations. Association for Computational Linguistics, 2019.\n\n[7] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas,\nand Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept\nactivation vectors (tcav). arXiv preprint arXiv:1711.11279, 2017.\n\n[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[9] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.\n\nThe handbook of brain theory and neural networks, 3361(10):1995, 1995.\n\n[10] Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of lstms to learn\nsyntax-sensitive dependencies. Transactions of the Association for Computational Linguistics,\n4:521\u2013535, 2016.\n\n[11] Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A Smith. Linguistic\nknowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855,\n2019.\n\n[12] Hiroshi Maehara. Euclidean embeddings of \ufb01nite metric spaces. Discrete Mathematics,\n\n313(23):2848\u20132856, 2013.\n\n[13] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large\nannotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313\u2013330, June 1993.\n\n[14] David McClosky.\n\nPystanforddependencies.\n\nPyStanfordDependencies, 2015.\n\nhttps://github.com/dmcc/\n\n[15] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation\n\nand projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.\n\n[16] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[17] George A. Miller, Claudia Leacock, Randee Tengi, and Ross Bunker. A semantic concordance.\n\nIn HLT, 1993.\n\n[18] Maximillian Nickel and Douwe Kiela. Poincar\u00e9 embeddings for learning hierarchical represen-\n\ntations. In Advances in neural information processing systems, pages 6338\u20136347, 2017.\n\n9\n\n\f[19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[20] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken-\nton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint\narXiv:1802.05365, 2018.\n\n[21] Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. Word sense disambigua-\ntion: A uni\ufb01ed evaluation framework and empirical comparison. In Proceedings of the 15th\nConference of the European Chapter of the Association for Computational Linguistics: Volume\n1, Long Papers, pages 99\u2013110, Valencia, Spain, April 2017. Association for Computational\nLinguistics.\n\n[22] Isaac J Schoenberg. On certain metric spaces arising from euclidean spaces by a change of\n\nmetric and their imbedding in hilbert space. Annals of mathematics, pages 787\u2013793, 1937.\n\n[23] Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. arXiv\n\npreprint arXiv:1905.05950, 2019.\n\n[24] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung\nKim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. What do you learn from\ncontext? probing for sentence structure in contextualized word representations. 2018.\n\n[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information\nprocessing systems, pages 5998\u20136008, 2017.\n\n[26] Jesse Vig. Visualizing attention in transformer-based language models. arXiv preprint\n\narXiv:1904.02679, 2019.\n\n[27] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In\n\nEuropean conference on computer vision, pages 818\u2013833. Springer, 2014.\n\n[28] Yury. Minimizing the maximum dot product among k unit vectors in an n-dimensional space.\n\nhttps://cstheory.stackexchange.com/q/14748, 2012. Accessed: 2019-05-09.\n\n6 Appendix\n\n6.1 Embedding trees in Euclidean space\n\nHere we provide additional detail on the existence of various forms of tree embeddings.\nIsometric embeddings of a tree (with its intrinsic tree metric) into Euclidean space are rare. Indeed,\nsuch an embedding is impossible even a four-point tree T , consisting of a root node R with three\nchildren C1, C2, C3. If f : T ! Rn is a tree isometry then ||f (R)f (C1))|| = ||f (R)f (C2))|| =\n1, and ||f (C1) f (C2))|| = 2. It follows that f (R), f (C1), f (C2) are collinear. The same can be\nsaid of f (R), f (C1), and f (C3), meaning that ||f (C2) f (C3)|| = 0 6= d(C2, C3).\nSince this four-point tree cannot be embedded, it follows the only trees that can be embedded are\nsimply chains.\nNot only are isometric embeddings generally impossible, but power-p embeddings may also be\nunavailable when p < 2, as the following argument shows. See [12] for an independent alternative\nversion.\nProof of Theorem 2\n\nProof. We covered the case of p = 1 above. When p < 1, even a tree of three points is impossible\nto embed without violating the triangle inequality. To handle the case when 1 < p < 2, consider a\n\u201cstar-shaped\u201d tree of one root node with k children; without loss of generality, assume the root node\n\n10\n\n\f", "award": [], "sourceid": 4632, "authors": [{"given_name": "Emily", "family_name": "Reif", "institution": "Google"}, {"given_name": "Ann", "family_name": "Yuan", "institution": "Google"}, {"given_name": "Martin", "family_name": "Wattenberg", "institution": "Google"}, {"given_name": "Fernanda", "family_name": "Viegas", "institution": "Google"}, {"given_name": "Andy", "family_name": "Coenen", "institution": "Google"}, {"given_name": "Adam", "family_name": "Pearce", "institution": "Google"}, {"given_name": "Been", "family_name": "Kim", "institution": "Google"}]}