{"title": "The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 9412, "page_last": 9423, "abstract": "Language is dynamic, constantly evolving and adapting with respect to time, domain or topic. The adaptability of language is an active research area, where researchers discover social, cultural and domain-specific changes in language using distributional tools such as word embeddings. In this paper, we introduce the global anchor method for detecting corpus-level language shifts. We show both theoretically and empirically that the global anchor method is equivalent to the alignment method, a widely-used method for comparing word embeddings, in terms of detecting corpus-level language shifts. Despite their equivalence in terms of detection abilities, we demonstrate that the global anchor method is superior in terms of applicability as it can compare embeddings of different dimensionalities. Furthermore, the global anchor method has implementation and parallelization advantages. We show that the global anchor method reveals fine structures in the evolution of language and domain adaptation. When combined with the graph Laplacian technique, the global anchor method recovers the evolution trajectory and domain clustering of disparate text corpora.", "full_text": "The Global Anchor Method for Quantifying\nLinguistic Shifts and Domain Adaptation\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nVin Sachidananda\n\nStanford University\n\nvsachi@stanford.edu\n\nZi Yin\n\nStanford University\n\ns09600974@gmail.com\n\nBalaji Prabhakar\n\nDepartment of Electrical Engineering and\n\nDepartment of Computer Science\n\nStanford University\n\nbalaji@stanford.edu\n\nAbstract\n\nLanguage is dynamic, constantly evolving and adapting with respect to time,\ndomain or topic. The adaptability of language is an active research area, where\nresearchers discover social, cultural and domain-speci\ufb01c changes in language using\ndistributional tools such as word embeddings. In this paper, we introduce the\nglobal anchor method for detecting corpus-level language shifts. We show both\ntheoretically and empirically that the global anchor method is equivalent to the\nalignment method, a widely-used method for comparing word embeddings, in\nterms of detecting corpus-level language shifts. Despite their equivalence in terms\nof detection abilities, we demonstrate that the global anchor method is superior in\nterms of applicability as it can compare embeddings of different dimensionalities.\nFurthermore, the global anchor method has implementation and parallelization\nadvantages. We show that the global anchor method reveals \ufb01ne structures in the\nevolution of language and domain adaptation. When combined with the graph\nLaplacian technique, the global anchor method recovers the evolution trajectory\nand domain clustering of disparate text corpora.\n\n1\n\nIntroduction\n\nLinguistic variations are commonly observed among text corpora from different communities or time\nperiods [9, 11]. Domain adaptation seeks to quantify the degree to which language varies in distinct\ncorpora, such as text from different time periods or academic communities such as computer science\nand physics. This adaptation can be performed either at a word-level\u2013to determine if a particular\nword\u2019s semantics are different in the two corpora, or at the corpus-level\u2013to determine the similarity\nof language usage in the two corpora. Applications of these methods include identifying how words\nor phrases differ in meaning in different corpora or how well text-based models trained on one corpus\ncan be transferred to other settings. In this paper, we focus on corpus-level adaptation methods which\nquantify the structural similarity of two vector space embeddings each learned on a separate corpus.\nConsider a motivating example of training conversational intent and entity classi\ufb01ers for computer\nsoftware diagnosis. While many pre-trained word embeddings are available for such types of natural\nlanguage problems, most of these embeddings are trained on general corpora such as news collections\nor Wikipedia. As previously mentioned, linguistic shifts can result in semantic differences between\nthe domain on which the embeddings were trained and the domain in which the embeddings are\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fbeing used. Empirically, such variations can signi\ufb01cantly affect the performance of models using\nembeddings not trained on the target domain, especially when training data is sparse. As a result, it is\nimportant, both practically and theoretically, to quantify the domain-dissimilarity in target and source\ndomains as well as study the root cause of this phenomena - language variations in time and domain.\nCurrent distributional approaches for corpus-level adaptation are alignment-based. Consider two\ncorpora E and F with corresponding vector embedding matrices E and F \u2208 Rn\u00d7d, where d is the\ndimension of the embeddings and n is the size of the common vocabulary. Using the observation\nthat vector embeddings are equivalent up to a unitary transformation [11, 12, 33], alignment-based\napproaches \ufb01nd a unitary operator Q\u2217 = minQ\u2208O(d) (cid:107)E \u2212 F Q(cid:107)F , where O(d) is the group of d \u00d7 d\nunitary matrices and (cid:107) (cid:107)F is the Frobenius norm. The shift in the meaning of an individual word can\nbe measured by computing the norm of the difference of the corresponding row in E and F Q\u2217. The\ndifference in language usage between the corpora is then quanti\ufb01ed as (cid:107)E \u2212 F Q\u2217(cid:107)F . In the rest of\nthe paper, all matrix norms will be assumed to be the Frobenius norm unless otherwise speci\ufb01ed.\nOn the other hand, anchor-based approaches [10, 17, 18] are primarily used as a local method for\ndetecting word-level adaptations. In the local anchor method, a set of words appearing in both corpora\nare picked as \"anchors\" against which the particular word is compared. If the relative position of\nthe word\u2019s embedding to the anchors has shifted signi\ufb01cantly between the two embeddings, the\nmeaning of the word is likely to be different. The anchor words are usually hand selected to re\ufb02ect\nword meaning shift along a speci\ufb01c direction. For example in Bolukbasi et al. [3], the authors\nselected gender-related anchors to detect shifts in gender bias. However, the local nature and the need\nfor anchors to be picked by hand or by nearest neighbor search make the local anchoring method\nunsuitable for detecting corpus-level shifts.\nThe three major contributions of our work are:\n\n1. Proposing the global anchor method, a generalization of the local anchor method for\n\ndetecting corpus-level adaptation.\n\n2. Establishing a theoretical equivalence of the alignment and global anchor methods in terms\nof detection ability of corpus-level language adaptation. Meanwhile, we \ufb01nd that the global\nanchor method has practical advantages in terms of implementation and applicability.\n\n3. Demonstrating that, when combined with spectral methods, the anchor method is capable\nof revealing \ufb01ne details of language evolution and linguistic af\ufb01nities between disjoint\ncommunities.\n\n2 Related Work\n\nThe study of domain adaptation of natural language, such as diachronic shifts, is an active research\n\ufb01eld, with word- and corpus-level adaptation constituting the two main topics.\n\n2.1 Word-level Adaptation\n\nNon-Distributional Approaches. Word-level adaptation methods quantify the semantic and syn-\ntactic shift of individual words in different text corpora such as those from disparate communities\nor time periods. Graph-based methods [5] such as Markov clustering have been used to identify\nmultiple word senses in varying contexts and are useful for resolving ambiguity related to polysemous\nwords, words which have multiple meanings. Topic modeling algorithms, such as the Hierarchal\nDirichlet Process (HDP) [19], have also been applied to learn variations in word sense usage. The\nvalue of word sense induction methods for understanding word-level adaptation is due to some word\nsenses occurring more or less frequently across different domains (corpora). For instance, consider\nthe word \u201carms\u201d which can either mean body parts or weapons. A medical corpus may have a\nhigher relative frequency of the former sense when compared to a news corpus. Frequency statistics,\nwhich use relative word counts, have been used to predict the rate of lexical replacement in various\nIndo-European languages over time [26, 28], where more common words are shown to evolve or be\nreplaced at a slower rate than those less frequently used.\nDistributional Approaches. Distributional methods for word-level shifts use second order statistics,\nor word co-occurrence distributions, to characterize semantic and syntactic shifts of individual words\nin different corpora. Distributional methods have been used to determine whether different senses for\n\n2\n\n\fa word have been introduced, removed, or split by studying differences in co-occurring words across\ncorpora from disparate time-periods [14, 25]. Vector space embedding models, such as Word2Vec\n[24], learn vector representations in the Euclidean space. After training embeddings on different\ncorpora, such as Google Books for disjoint time periods, one can compare the nearest neighbors of a\nparticular word in different embedding spaces to detect semantic variations [8, 16, 11, 31]. When the\nnearest neighbors are different in these embedding spaces for a particular word, it is likely that the\nmeaning of the word is different across the two corpora. The introduction of the anchoring approach\nextends this idea by selecting the union of a word\u2019s nearest neighbors in each embedding space as the\nset of anchors and is used to detect word-level linguistic shifts due to cultural factors [10]. Anchoring\nmethods have also been used to by compare word embeddings learned from diachronic corpora such\nas periods of war using a supervised selection of \"con\ufb02ict-speci\ufb01c\" anchor words [17, 18].\n\n2.2 Corpus-level Adaptation\n\nIn contrast to word-level adaptation, corpus-level adaptation methods are used to compute the\nsemantic similarity of natural language corpora. Non-distributional methods such as Jensen Shannon\nDivergence (JSD), have been applied to count statistics and t-SNE embeddings to study the linguistic\nvariations in the Google Books corpus over time [27, 35].\nAlignment-based distributional methods make use of the observation that vector space embeddings\nare rotation invariant and as a result are equivalent up to a unitary transformation [11, 33]. Alignment\nmethods, which learn a unitary transform between two sets of word embeddings, use the residual\nloss of the alignment objective to quantify the linguistic dissimilarity between the corpora on which\nthe embeddings were trained. In the context of multi-lingual corpora, Mikolov et al. [23] \ufb01nds that\nthe alignment method works as well as neural network-based methods for aligning two embedding\nspaces trained on corpora from different languages. Furthermore, algorithms for jointly training word\nembeddings from diachronic corpora have been researched to discover and regularize corpus-level\nshifts due to temporal factors [30, 36]. In the context of diachronic word shifts, Hamilton et al. [11]\naligns word embeddings trained on diachronic corpora using the alignment method. In Hamilton et al.\n[10], anchoring is proposed speci\ufb01cally as a word-level \"local\" method while alignment is used to\ncapture corpus-level \"global\" shifts. Similar concepts are used in tensor-based schemes [39, 40] and\nrecommendation systems based on deep-learning [15, 41].\n\n3 Global Anchor Method for Detecting Corpus-Level Language Shifts\nGiven two corpora E and F, we ask the fundamental question of how different they are in terms of\nlanguage usage. Various factors contribute to the differences, for example, chronology or community\nvariations. Let E, F be two separate word embeddings trained on E and F and consisting of common\nvocabulary. As a recap, the alignment method \ufb01nds an orthogonal matrix Q\u2217 which minimizes\n(cid:107)E \u2212 F Q(cid:107), and the residual (cid:107)E \u2212 F Q\u2217(cid:107) is the dissimilarity between the two corpora.\nWe propose the global anchor method, a generalization of the local anchor method for detecting corpus\nlevel adaptation. We \ufb01rst introduce the local anchor method for word-level adaptation detection, upon\nwhich our global method is constructed.\n\n3.1 Local Anchor Method for Word-Level Adaptation Detection\n\nThe shift of a word\u2019s meaning can be revealed by comparing it against a set of anchor words [18, 17],\nwhich is a direct result of the distributional hypothesis [10, 13, 6]. Speci\ufb01cally, let {1,\u00b7\u00b7\u00b7 , l} be the\nindices of the l anchor words, common to two different corpora. To measure how much the meaning\nof a word i has shifted between the two corpora, one triangulates it against the l anchors in the two\nembedding spaces by calculating the inner products of word i\u2019s vector representation with those of\nthe l anchors. Since the embedding for word i is the i-th row of the embedding matrix, this procedure\nproduces two length-l vectors, namely\n\n((cid:104)Ei,\u00b7, E1,\u00b7(cid:105),\u00b7\u00b7\u00b7 ,(cid:104)Ei,\u00b7, El,\u00b7(cid:105)) and ((cid:104)Fi,\u00b7, F1,\u00b7(cid:105),\u00b7\u00b7\u00b7 ,(cid:104)Fi,\u00b7, Fl,\u00b7(cid:105)).\n\nThe norm of the difference of these two vectors,\n\n(cid:107)(cid:104)Ei,\u00b7, E1,\u00b7(cid:105),\u00b7\u00b7\u00b7 ,(cid:104)Ei,\u00b7, El,\u00b7(cid:105)) \u2212 ((cid:104)Fi,\u00b7, F1,\u00b7(cid:105),\u00b7\u00b7\u00b7 ,(cid:104)Fi,\u00b7, Fl,\u00b7(cid:105))(cid:107)\n\n3\n\n\fre\ufb02ects the drift of word wi with respect to the l anchor words. The anchors are usually selected as\na set of pre-de\ufb01ned words in a supervised fashion or by a nearest neighbor search, to re\ufb02ect shifts\nalong a speci\ufb01c direction [3, 17] or a local neighborhood [10].\n\n3.2 The Global Anchor Method\n\nTwo issues arise from the local anchor method for corpus-level adaptation, namely its local nature\nand the need of anchors to be hand-picked or selected using nearest neighbors. We address them by\nintroducing the global anchor method, a generalization of the local approach. In the global anchor\nmethod, we use all the words in the common vocabulary as anchors, which brings two bene\ufb01ts. First,\nhuman supervision is no longer needed as anchors are no longer hand picked. Second, the anchor set\nis enriched so that shift detections are no longer restricted to one direction. These two bene\ufb01ts make\nthe global anchor method suitable for detecting corpus level adaptation. In the global anchor method,\nthe expression for the corpus-level dissimilarity simpli\ufb01es to:\n\n(cid:107)EET \u2212 F F T(cid:107).\n\nConsider the i-th row of EET and F F T respectively. (EET )i,\u00b7 = ((cid:104)Ei,\u00b7, E1,\u00b7(cid:105),\u00b7\u00b7\u00b7 ,(cid:104)Ei,\u00b7, En,\u00b7(cid:105))\nwhich measures the i-th vector Ei using all other vectors as anchors. The same is true for (F F T )i,\u00b7.\nThe norm of the difference, (cid:107)(EET )i,\u00b7 \u2212 (F F T )i,\u00b7(cid:107), measures the relative shift of word i across the\ntwo embeddings. If this distance is large, it is likely that the meaning of the i-th word is different\nin the two corpora. This leads to an embedding distance metric also known as the Pairwise Inner\nProduct loss [37, 38].\n\n4 The Alignment and Global Anchor Methods: Equivalent Detection of\n\nLinguistic Shifts\n\nBoth the alignment and global anchor methods provide metrics for corpus dissimilarity. We prove in\nthis section that the metrics which the two methods produce are equivalent. The proof is based on the\nisotropy observation of vector embeddings [1] and projection geometry. Recall from real analysis\n[29], that two metrics d1 and d2 are equivalent if there exist positive c1 and c2 such that:\n\nc1d1(x, y) \u2264 d2(x, y) \u2264 c2d1(x, y), \u2200x, y.\n\n4.1 The Isotropy of Word Vectors\n\nWe show that the columns of embedding matrices are approximately orthonormal, which arises\nnaturally from the isotropy of word embeddings [1]. The isotropy requires the distribution of the\nvectors to be uniform along all directions. This implies Ew/(cid:107)Ew(cid:107) follows a uniform distribution on\na sphere, which is equivalent in distribution to the case when Ew has i.i.d., zero-mean normal entries\n[21]. Under this assumption, we invoke a result by Bai and Yin [2]:\nTheorem 1. Suppose the entries of X \u2208 Rn\u00d7d, d/n = p \u2208 (0, 1), are random i.i.d. with zero mean,\nunit variance, and \ufb01nite 4th moment. Let \u03bbmin and \u03bbmax be the smallest and largest singular values\nof X T X/n, respectively. Then:\n\np)2,\n\na.c.= (1 \u2212 \u221a\nn\u03c3(1 \u2212(cid:112)d/n) and\n\nlim\nn\u2192\u221e \u03bbmax\n\nn\u2192\u221e \u03bbmin\nlim\n\u221a\n\nThis shows that the largest and smallest singular values for the embedding matrix, under the i.i.d.\nassumption, are asymptotically\nthe dimensionality d, usually in the order of hundreds, is much smaller than the vocabulary size n,\nwhich can be tens of thousands up to millions [24]. This leads to the result that the singular values of\nthe embedding matrix should be tightly clustered, which is empirically veri\ufb01ed by Arora et al. [1]. In\nother words, the columns of E and F are close to orthonormal.\n\nn\u03c3(1 +(cid:112)d/n) respectively. Further, notice\n\n\u221a\n\n\u221a\n\na.c.= (1 +\n\np)2\n\n4.2 The Equivalence of the Global Anchor Method and the Alignment Method\n\nThe orthonormality of the columns of E and F means they can be viewed as the basis for the\nsubspaces they span. Lemma 2 is a classical result regarding the principal angles [7] between\nsubspaces. Using the lemma, we prove Theorems 3 and 4, and Corollary 4.1.\n\n4\n\n\fLemma 2. Suppose E \u2208 Rn\u00d7d, F \u2208 Rn\u00d7d are two matrices with orthonormal columns. Then:\n\n1. SVD(ET F )= U CV T , where Ci = cos(\u03b8i) is the cosine of the ith principal angle between\n\nsubspaces spanned by the columns of E and F .\n\n2. SVD(ET\u22a5F )= \u02dcU SV T , where Si = sin(\u03b8i) is the sine of the ith principal angle between\nsubspaces spanned by the columns of E and F , where E\u22a5 \u2208 Rn\u00d7(n\u2212d) is an orthogonal\nbasis for E\u2019s null space.\n\nLet \u0398 = (\u03b81,\u00b7\u00b7\u00b7 , \u03b8d) be the vector of principal angles between the subspaces spanned by E and F ,\nand all operations on \u0398, such as sin and raising to a power, be applied element-wise.\nTheorem 3. The metric for the alignment method, (cid:107)E \u2212 F Q\u2217(cid:107), equals 2(cid:107) sin(\u0398/2)(cid:107).\n\nProof. Note that\n\n(E \u2212 F Q)(E \u2212 F Q)T = EET + F F T \u2212 EQT F T \u2212 F QET\n\nWe perform a change of basis into the columns of [E E\u22a5],\n\n(E \u2212 F Q)(E \u2212 F Q)T\n\n(cid:18)(cid:20)I\n(cid:18)(cid:20)I\n\n0\n\n0\n\n(cid:21)\n(cid:21)\n\n0\n0\n\n0\n0\n\n+\n\n+\n\n= [E E\u22a5]\n\n= [E E\u22a5]\n\n(1)\nNotice that the Q\u2217 minimizing (cid:107)E \u2212 F Q\u2217(cid:107) equals Q\u2217 = V U T [32]. Plug in Q\u2217 to (1) and we get\n\n\u2212\n\n\u2212\n\n\u02dcU CSU T\n\nET\u22a5F F T E ET\u22a5F F T E\u22a5\n\u2212\n\n(cid:21)\n(cid:20)ET F Q 0\n(cid:21)\n(cid:20)ET F F T E ET F F T E\u22a5\n(cid:21)\n(cid:20)\n(cid:20)U CV T Q 0\n(cid:20) U C 2U T U CS \u02dcU T\n(cid:21)\n(cid:21)(cid:20)I + C 2 \u2212 2C CS \u2212 S\n(cid:20)U 0\n(cid:21)(cid:20) (I \u2212 C)2 \u2212S(1 \u2212 C)\n(cid:20)U 0\n\nET\u22a5F Q 0\n\u2212\n\n\u02dcU SV T Q 0\n\nCS \u2212 S\n\n\u02dcU S2 \u02dcU T\n\nS2\n\n\u02dcU\n\n0\n\n= [E E\u22a5]\n\n0\n\n\u02dcU\n\n\u2212S(1 \u2212 C)\n\n(cid:20)QT F T E QT F T E\u22a5\n\n0\n\n0\n\nQT V CU T QT V S \u02dcU T\n\n(cid:21)(cid:19)(cid:20)ET\n(cid:21)\n(cid:21)\n(cid:21)(cid:19)(cid:20)ET\n\nET\u22a5\n\nET\u22a5\n\n0\n\n0\n\n(cid:21)(cid:20)U 0\n(cid:21)(cid:20)U 0\n\n(cid:21)T(cid:20)ET\n(cid:21)\n(cid:21)T(cid:20)ET\n\nET\u22a5\n\n\u02dcU\n\n0\n\n(cid:21) (2)\n\nS2\n\n0\n\n\u02dcU\n\nET\u22a5\n\n(E \u2212 F Q)(E \u2212 F Q)T = [E E\u22a5]\n\nBy applying the trigonometric identities 1 \u2212 cos(\u03b8) = 2 sin2(\u03b8/2) and sin(\u03b8) = 2 sin(\u03b8/2) cos(\u03b8/2)\nto equation (2), we have\n(I \u2212 C)2 = 4 sin4(\u0398/2), \u2212 S(1 \u2212 C) = \u22124 sin3(\u0398/2) cos(\u0398/2), S2 = 4 sin2(\u0398/2) cos2(\u0398/2)\nPlug in the quantities into (2),\n\n(cid:20)U 0\n\n0\n\n\u02dcU\n\n(cid:21)(cid:20) sin(\u0398/2)\n\n\u2212 cos(\u0398/2)\n\n(cid:21)\n\n= [E E\u22a5]\n\n4 sin2(\u0398/2)\n\n(cid:20) sin(\u0398/2)\n\n\u2212 cos(\u0398/2)\n\n(cid:21)T(cid:20)U 0\n\n(cid:21)T(cid:20)ET\n\n(cid:21)\n\n0\n\n\u02dcU\n\nET\u22a5\n\n(cid:13)(cid:13)EET \u2212 F F T(cid:13)(cid:13) =\n\nAs a result, the singular values of E \u2212 F Q\u2217 are 2 sin(\u0398/2). So (cid:107)E \u2212 F Q\u2217(cid:107) = 2(cid:107) sin(\u0398/2)(cid:107)\nTheorem 4. The metric for the global anchor method, (cid:107)EET \u2212 F F T(cid:107), equals\n2(cid:107) sin \u0398(cid:107).\nProof. First, notice [E E\u22a5] forms a unitary matrix of Rn. Also note the Frobenius norm is unitary-\ninvariant. The above observations allow us to perform a change of basis:\n\n\u221a\n\n(cid:21)\n\n(EET \u2212 F F T ) [E E\u22a5]\n\n(cid:13)(cid:13)(cid:13)(cid:13) =\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20)I\n(cid:20)ET F F T E ET F F T E\u22a5\n(cid:21)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:21)T(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13) =\n(cid:20) U C 2U T U CS \u02dcU T\n(cid:21)(cid:20)I \u2212 C 2 \u2212CS\n(cid:20)U 0\n(cid:21)(cid:20)U 0\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13) =\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20)S 0\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20)S 0\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13) =\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:21)(cid:20) S \u2212C\n\nET\u22a5F F T E ET\u22a5F F T E\u22a5\n\n\u2212C \u2212S\n\n\u02dcU CSU T\n\n\u02dcU S2 \u02dcU T\n\n\u2212CS\n\n\u2212S2\n\n0 S\n\n0 S\n\n\u2212\n\n0\n0\n\n\u02dcU\n\n\u02dcU\n\n0\n\n0\n\n0\n\nET\u22a5\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20)ET\n(cid:21)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20)I\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20) S2\n\n0\n\n\u221a\n\n0\n0\n\n\u2212\n\u2212CS\n\u2212CS \u2212S2\n\u221a\n2(cid:107)S(cid:107) =\n\n=\n\n=\n\n=\n\n2(cid:107)sin \u0398(cid:107)\n\n5\n\n\fCorollary 4.1. (cid:107)E \u2212 F Q\u2217(cid:107) \u2264 (cid:107)EET \u2212 F F T(cid:107) \u2264 \u221a\nProof. By Theorem 3 and 4, (cid:107)EET \u2212 F F T(cid:107) =\n2(cid:107) sin(\u0398/2)(cid:107). Finally, the corollary can be obtained since\nwhich is a result of\n\n2 \u2264 2 cos(\u03b8/2) \u2264 2 for \u03b8 \u2208 [0, \u03c0/2].\n\n\u221a\n\n2(cid:107)E \u2212 F Q\u2217(cid:107).\n\u221a\n\n2(cid:107) sin \u0398(cid:107) and minQ\u2208O(d) (cid:107)E \u2212 F Q(cid:107) =\n2 sin(\u03b8/2) \u2264 sin(\u03b8) \u2264 2 sin(\u03b8/2),\n\n\u221a\n\n4.3 Validation of the Equivalence between Alignment and Global Anchor Methods\n\nWe proved that the anchor and alignment methods are equivalent in detecting linguistic variations\nfor two corpora up to at most a constant factor of\n2/2 under the isotropy assumption. To empir-\nically verify that the equivalence holds, we conduct the following experiment. Let E(i) and E(j)\n(cid:80)d\ncorrespond to word embeddings trained on the Google Books dataset for distinct years i and j in\n{1900, 1901,\u00b7\u00b7\u00b7 , 2000} respectively1. Normalize E(i) and E(j) by their average column-wise norm,\nk=1 (cid:107)E[\u00b7,k](cid:107), so the embedding matrices have the same Frobenius norm. For every such pair\n\n\u221a\n\n1\nd\n(i, j), we compute\n\nminQ\u2208O(d) (cid:107)E(i) \u2212 E(j)Q(cid:107)\n(cid:107)E(i)E(i)T \u2212 E(j)E(j)T(cid:107) .\n\u221a\n\n2/2 \u2248 0.707 and 1. Since we evaluate\nOur theoretical analysis showed that this number is between\nfor every possible pair of years, there are in total 10,000 such ratios. The statistics are summarized in\nTable 1. The empirical results indeed match the theoretical analysis. Not only are the ratios within the\nrange [\n2/2, 1], but also they are tightly clustered around 0.83 meaning that empirically the output\nof alignment method is approximately a constant of the global anchor method.\n\n\u221a\n\nTable 1: The Ratio of Distances Given by the Alignment and Global Anchor Methods\n\nmean\n0.826\n\nstd\n0.015\n\nmin.\n0.774\n\nmax.\n0.855\n\n\u221a\n\ntheo. min.\n2/2 \u2248 0.707\n\ntheo. max.\n\n1\n\nRatio\n\n\u221a\n\n4.4 Advantages of the Global Anchor Method over the Alignment Method\nTheorems 3, 4 and Corollary 4.1 together establish the equivalence of the anchor and alignment\nmethods in identifying corpus-level language shifts using word embeddings; the methods differ\n2. Despite the theoretical equivalence, there are several practical\nby at most a constant factor of\ndifferences to consider. We brie\ufb02y discuss some of these differences.\n\u2022 Applicability: The alignment methods can be applied only to embeddings of the same\ndimensionality, since the orthogonal transformation it uses is an isometry onto the original\nRd space. On the other hand, the global anchor method can be used to compare embeddings\nof different dimensionalities.\n\n\u2022 Implementation: The global anchor method involves only matrix multiplications, making\nit convenient to code. It is also easy to parallelize, as the matrix product and entry-wise\ndifferences are naturally parameterizable and a map-reduce implementation is straightfor-\nward. The alignment method, on the other hand, requires solving a constrained optimization\nproblem. This problem can be solved using gradient [36] or SVD [11] methods.\n\nDue to the bene\ufb01ts in implementation and applicability, along with the established equivalence of\nthe global anchoring and alignment methods, the global anchor method should be preferred for\nquantifying corpus-level linguistic shifts. In the next section, we conduct experiments using the\nglobal anchor method to detect linguistic shifts in corpora which vary in time and domain.\n\n5 Detecting Linguistic Shifts with the Global Anchor Method\nThe global anchor method can be used to discover language evolution and domain speci\ufb01c language\nshifts.\nIn the \ufb01rst experiment, we demonstrate that the global anchor method discovers the\n\ufb01ne-grained evolutionary trajectory of language, using the Google Books N-gram data [20]. The\nGoogle Books dataset is a collection of digitized publications between 1800 and 2009, which makes\nup roughly 6% of the total number of books ever published. The data is presented in n-gram format,\n\n1Detail of the dataset and training will be discussed in the Section 5.\n\n6\n\n\fwhere n ranges from 1 to 5. We collected the n-gram data for English \ufb01ctions between 1900 and\n2000, trained skip-gram Word2Vec models for each year, and compared the distance between\nembeddings of different years using the global anchor method.\n\nIn our second experiment, we show that the global anchor method can be used to \ufb01nd community-\nbased linguistic af\ufb01nities of text from varying academic \ufb01elds and categories on arXiv, an online\npre-print service widely used in the \ufb01elds of computer science, mathematics, and physics. We\ncollected all available arXiv LaTeX \ufb01les submitted to 50 different academic communities between\nJanuary 2007 and December 2017 resulting in corpora assembled from approximately 75,000\nacademic papers, each associated with a single primary academic \ufb01eld and category. After parsing\nthese LaTeX \ufb01les into natural language, we constructed disjoint corpora and trained skip-gram\nWord2Vec models for each category. We then compute the anchor loss for each pair of categories.\n\nWe conduct two more experiments on Reddit community (subreddit) comments as well as the Corpus\nof Historical American English (COHA); these experiments along with further analysis on word-level\nlinguistic shifts are deferred to the Appendix due to space constraints. Our codes and datasets are\npublicly available on GitHub2.\n5.1 Language Evolution\nThe global anchor method can reveal the evolution of language, since it provides a quantitative metric\n(cid:107)EET \u2212 F F T(cid:107) between corpora from different years. Figure 1 is a visualization of the anchor\ndistance between different embeddings trained on the Google Books n-gram dataset, where the ijth\nentry is the anchor distance between E(i) and E(j). In Figure 1a, we grouped n-gram counts and\ntrained embeddings for every decade, and in Figure 1b the embeddings are trained for every year.\nFirst, we observe that there is a banded structure. Linguistic variation increases with respect to |i\u2212 j|,\nwhich is expected. The banded structure was also observed by Pechenick et al. [27] who used word\nfrequency methods instead of distributional approaches. Languages do not evolve at constant speed\nand the results of major events can have deep effects on the evolution of natural language. Due to the\n\ufb01ner structure of word embeddings, compared to \ufb01rst-order statistics like frequencies, the anchor\nmethod captures more than just banded structure. An example is the effect of wars. In Figure 1b, we\nsee that embeddings trained on years between 1940-1945 have greater anchor loss when compared to\nembeddings from 1920-1925 than those from 1915-1918. Figure 2 demonstrates the row vector in\nFigure 1b for the year 1944.\n\n(a) Anchor difference across decades\n\n(b) Anchor difference across years\n\nFigure 1: Temporal evolution of English language and the banded structure\n\nIn Figure 2, there is a clear upward trend of the anchor difference as one moves away from 1944.\nHowever, there is a major dip around 1915-1918 (WWI), and two minor dips around 1959 (Korean\nWar) and 1967 (Vietnam War). This pattern is consistent across 1939-1945 (WWII) for the anchor\nmethods, but not as clear when using frequency methods. As per the distributional hypothesis [6],\none should consider that frequency methods, unlike co-occurrence approaches, do not capture the\nsemantics of words but rather the relative frequency of their usage. As discussed in Pechenick et al.\n\n2https://github.com/ziyin-dl/global-anchor-method\n\n7\n\n18101990199018100200400600800100019001904190819121916192019241928193219361940194419481952195619601964196819721976198019841988199219962000190019051910191519201925193019351940194519501955196019651970197519801985199019952000640720800880960\fFigure 2: Anchor difference for year 1944, note the dips during war times\n\n[27], frequency change of popular words (his, her, which, etc.) contribute the most to the frequency\ndiscrepancies. This, however, does not mean the two corpora are linguistically different, as these\npopular words may retain their meaning and could be used in the same contexts, despite frequency\ndifferences. The global anchor method is less sensitive to this type of artifact as it captures change of\nword meaning rather than frequency, and as a result is able to show \ufb01ner structures of language shifts.\n5.2 Trajectory of Language Evolution\nAs discussed in Section 5.1, the global anchor method can provide \ufb01ner structure about the rate of\nevolution compared to frequency-based approaches. The distance matrix provided by the anchor\nmethod can further give information about the direction of evolution via the graph Laplacian\ntechnique [34]. The graph Laplacian method looks for points in a low dimensional space where\nthe distance between the pair (i, j) re\ufb02ects the corresponding entry of the anchor loss matrix.\nAlgorithm 1 describes the procedure for obtaining Laplacian Embeddings from the anchor loss matrix.\n\nAlgorithm 1 Laplacian Embedding for Distance Matrix\n1: Given a distance matrix M\n2: Let S = exp (\u2212 1\n\n3: Calculate the Laplacian L = I \u2212 D\u22121/2SD\u22121/2, where D = diag(d) and di =(cid:80)\n\n4: Compute the singular value decomposition U DV T = L;\n5: Take the last k columns of U, U\u00b7,n\u2212k:n, as the dimension k embedding of M.\n\n2\u03c32 M ) be the exponentiated similarity matrix;\n\nj Sij;\n\nIn Figure 3a, we show the 2-dimensional embedding of the anchor distance matrix for Google Books\nn-gram (English Fiction) embeddings from year 1900 to 2000. We can see that the years follow\na trajectory starting from the bottom-left and gradually ending at the top-right. There are a few\nnoticeable deviations on this trajectory, speci\ufb01cally the years 1914-1918, 1938-1945 and 1981-2000.\nIt is clear that the \ufb01rst two periods were major war-times, and these two deviations closely resemble\neach other, indicating that are driven by the same type of event. The last deviation is due to the rise\nof scienti\ufb01c literature, where a signi\ufb01cant amount of technical terminologies (e.g. computer) were\nintroduced starting from the 1980s. This was identi\ufb01ed as a major bias of Google Books dataset [27].\n5.3 Linguistic Variation in Academic Subjects\nIn Figure 3b, we use the global anchor method to detect linguistic similarity of arXiv papers from\ndifferent academic communities. We downloaded and parsed the LaTeX \ufb01les posted on arXiv between\nJan. 2007 and Dec. 2017, and trained embeddings for each academic category using text from the\ncorresponding papers. The anchor distance matrix is deferred to the appendix due to page limits. We\napplied the Laplacian Embedding, Algorithm 1, to the anchor distance matrix, and obtained spectral\nembeddings for different categories. It can be observed that the categories are generally clustered\naccording to their \ufb01elds; math, physics and computer science categories all forms their own clusters.\nAdditionally, the global anchor method revealed a few exceptions which make sense at second glance:\n\n8\n\n19001905191019151920192519301935194019461951195619611966197119761981198619911996840860880900920940960980original PIP loss between 1944 and other yearssmoothed PIP loss\f\u2022 Statistical Mechanics (cond-mat.stat-mech) is closer to math and computer science categories\n\u2022 History and Overview of Mathematics (math.HO) is far away from other math categories\n\u2022 Information theory (cs.IT) is closer to math topics than other computer science categories\n\n(a) Anchor difference across years of N-gram corpus\n\n(b) Anchor difference across ArXiv topics\n\nFigure 3: 2-D embedding of corpora reveals evolution trajectory and domain similarity\n\n6 Conclusion and Future Work\nIn this paper, we introduced the global anchor method for detecting corpus-level linguistic\nshifts. We showed both theoretically and empirically that the global anchor method provides\nan equivalent metric to the alignment method, a widely used method for corpus-level shift\ndetection. Meanwhile, the global anchor method excels in applicability and implementation. We\ndemonstrated that the global anchor method can be used to capture linguistic shifts caused by\ntime and domain. It is able to reveal \ufb01ner structures compared to frequency-based approaches,\nsuch as linguistic variations caused by wars and linguistic similarities between academic communities.\n\nWe demonstrated in Section 5 important applications of the global anchor method in detecting\ndiachronic and domain-speci\ufb01c linguistic shifts using word embeddings. As embedding models are\nfoundational tools in Deep Learning, the global anchor method can be used to address the problems of\nTransfer Learning and Domain Adaptation, which are ubiquitous in NLP and Information Retrieval.\nIn these \ufb01elds, Transfer Learning is important as it attempts to use models learned from a source\ndomain effectively in different target domains, potentially with much smaller amounts of data. The\nef\ufb01cacy of model transfer depends critically on the domain dissimilarity, which is what our method\nquanti\ufb01es.\n\nWhile we mainly discuss corpus-level adaptation in this paper, future work includes using the anchor\nmethod to discover global trends and patterns in different corpora, which lies between corpus and\nword-level linguistic shifts. In particular, unsupervised methods for selecting anchors are of great\ninterest.\n\n9\n\ncond-mat.dis-nncond-mat.mes-hallcond-mat.mtrl-scicond-mat.softcond-mat.stat-mechcond-mat.str-elcond-mat.supr-concs.AIcs.CCcs.CRcs.DMcs.DScs.GTcs.ITcs.LOmath.ACmath.AGmath.APmath.ATmath.CAmath.COmath.CTmath.CVmath.DGmath.DSmath.FAmath.GMmath.GNmath.GRmath.GTmath.HOmath.KTmath.LOmath.MGmath.NAmath.NTmath.OAmath.OCmath.PRmath.QAmath.RAmath.RTmath.SGmath.SPmath.STphysics.atom-phphysics.class-phphysics.gen-phphysics.hist-phphysics.plasm-ph\f7 Acknowledgements\nThe authors would like to thank Professors Dan Jurafsky and Will Hamilton for their helpful comments\nand discussions. Additionally, we thank the anonymous reviewers for their feedback during the\nreview process.\n\nReferences\n[1] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent vari-\nable model approach to pmi-based word embeddings. Transactions of the Association for\nComputational Linguistics (TACL), 4:385\u2013399, 2016.\n\n[2] ZD Bai and YQ Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance\n\nmatrix. In Advances In Statistics, pages 108\u2013127. World Scienti\ufb01c, 2008.\n\n[3] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.\nMan is to computer programmer as woman is to homemaker? debiasing word embeddings. In\nAdvances in Neural Information Processing Systems, pages 4349\u20134357, 2016.\n\n[4] Mark Davies. Corpus of historical american english (coha). 2015. doi: 10.7910/DVN/8SRSYK.\n\nURL https://doi.org/10.7910/DVN/8SRSYK.\n\n[5] Beate Dorow and Dominic Widdows. Discovering corpus-speci\ufb01c word senses. In Proceedings\nof the tenth conference on European chapter of the Association for Computational Linguistics\n(EACL), pages 79\u201382, 2003.\n\n[6] John R Firth. A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis, 1957.\n[7] Aur\u00e9l Gal\u00e1ntai and CS J Heged\u02ddus. Jordan\u2019s principal angles in complex vector spaces. Numerical\n\nLinear Algebra with Applications, 13(7):589\u2013598, 2006.\n\n[8] Kristina Gulordava and Marco Baroni. A distributional similarity approach to the detection\nof semantic change in the google books ngram corpus. In Proceedings of the GEMS 2011\nWorkshop on GEometrical Models of Natural Language Semantics, pages 67\u201371, 2011.\n\n[9] William L Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. Inducing domain-speci\ufb01c\nsentiment lexicons from unlabeled corpora. In Proceedings of the 2016 Conference on Empirical\nMethods in Natural Language Processing (EMNLP), pages 595\u2013605, 2016.\n\n[10] William L Hamilton, Jure Leskovec, and Dan Jurafsky. Cultural shift or linguistic drift?\nIn Proceedings of the 2016\ncomparing two computational measures of semantic change.\nConference on Empirical Methods in Natural Language Processing (EMNLP), pages 2116\u2013\n2121, 2016.\n\n[11] William L Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embeddings reveal\nstatistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association\nfor Computational Linguistics (ACL), volume 1, pages 1489\u20131501, 2016.\n\n[12] William L Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on\nlarge graphs. In Advances in Neural Information Processing Systems (NIPS), pages 1025\u20131035,\n2017.\n\n[13] Zellig S Harris. Distributional structure. Word, 10(2-3):146\u2013162, 1954.\n[14] Adam Jatowt and Kevin Duh. A framework for analyzing semantic change of words across\ntime. In Digital Libraries (JCDL), 2014 IEEE/ACM Joint Conference on, pages 229\u2013238, 2014.\n[15] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning\ndata-driven curriculum for very deep neural networks on corrupted labels. In International\nConference on Machine Learning, pages 2309\u20132318, 2018.\n\n[16] Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. Temporal analysis\nof language through neural language models. Proceedings of the 52nd Annual Meeting of the\nAssociation for Computational Linguistics (ACL), page 61, 2014.\n\n[17] Andrey Kutuzov, Erik Velldal, and Lilja \u00d8vrelid. Temporal dynamics of semantic relations in\nword embeddings: an application to predicting armed con\ufb02ict participants. In Proceedings of\nthe 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages\n1824\u20131829, 2017.\n\n10\n\n\f[18] Andrey Kutuzov, Erik Velldal, and Lilja \u00d8vrelid. Tracing armed con\ufb02icts with diachronic word\nembedding models. In Proceedings of the Events and Stories in the News Workshop, Association\nfor Computational Linguistics (ACL), pages 31\u201336, 2017.\n\n[19] Jey Han Lau, Paul Cook, Diana McCarthy, David Newman, and Timothy Baldwin. Word sense\ninduction for novel sense detection. In Proceedings of the 13th Conference of the European\nchapter of the Association for Computational Linguistics (EACL), pages 591\u2013601, 2012.\n\n[20] Yuri Lin, Jean-Baptiste Michel, Erez Lieberman Aiden, Jon Orwant, Will Brockman, and\nSlav Petrov. Syntactic annotations for the google books ngram corpus. In Proceedings of the\nAssociation for Computational Linguistics (ACL) 2012 System Demonstrations, pages 169\u2013174,\n2012.\n\n[21] George Marsaglia et al. Choosing a point from the surface of a sphere. The Annals of\n\nMathematical Statistics, 43(2):645\u2013646, 1972.\n\n[22] Stuck In The Matrix.\n\nURL\npublicly_available_reddit_comment/.\n\n2015.\nhttps://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_\n\nReddit public comments (2007-10 through 2015-05).\n\n[23] Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. Exploiting similarities among languages\nfor machine translation. CoRR, abs/1309.4168, 2013. URL http://arxiv.org/abs/1309.\n4168.\n\n[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in Neural Information\nProcessing Systems (NIPS), pages 3111\u20133119, 2013.\n\n[25] Sunny Mitra, Ritwik Mitra, Martin Riedl, Chris Biemann, Animesh Mukherjee, and Pawan\nGoyal. That\u2019s sick dude!: Automatic identi\ufb01cation of word sense change across different\ntimescales. In Proceedings of the 52nd Annual Meeting of the Association for Computational\nLinguistics (ACL), pages 1020\u20131029, 2014.\n\n[26] Mark Pagel, Quentin D Atkinson, and Andrew Meade. Frequency of word-use predicts rates of\n\nlexical evolution throughout indo-european history. Nature, 449(7163):717, 2007.\n\n[27] Eitan Adam Pechenick, Christopher M Danforth, and Peter Sheridan Dodds. Characterizing\nthe google books corpus: Strong limits to inferences of socio-cultural and linguistic evolution.\nPloS one, 10(10):e0137041, 2015.\n\n[28] Florencia Reali and Thomas L Grif\ufb01ths. Words as alleles: connecting language evolution with\nbayesian learners to models of genetic drift. Proceedings of the Royal Society of London B:\nBiological Sciences, 277(1680):429\u2013436, 2010.\n\n[29] Walter Rudin. Real and Complex Analysis, 3rd Ed. McGraw-Hill, Inc., New York, NY, USA,\n\n1987. ISBN 0070542341.\n\n[30] Maja R Rudolph and David M. Blei. Dynamic embeddings for language evolution. In Pro-\nceedings of the 2018 International Conference on World Wide Web (WWW), pages 1003\u20131011,\n2018.\n\n[31] Eyal Sagi, Stefan Kaufmann, and Brady Clark. Tracing semantic change with latent semantic\n\nanalysis. Current methods in historical semantics, pages 161\u2013183.\n\n[32] Peter H Sch\u00f6nemann. A generalized solution of the orthogonal procrustes problem. Psychome-\n\ntrika, 31(1):1\u201310, 1966.\n\n[33] Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. Of\ufb02ine bilingual\nword vectors, orthogonal transformations and the inverted softmax. In International Conference\non Learning Representations (ICLR), 2017.\n\n[34] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[35] Yang Xu and Charles Kemp. A computational evaluation of two laws of semantic change. In\n\nCognitive Science Society (CogSci), 2015.\n\n[36] Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. Dynamic word embeddings\nfor evolving semantic discovery. In Proceedings of the Eleventh ACM International Conference\non Web Search and Data Mining (WSDM), pages 673\u2013681, 2018.\n\n11\n\n\f[37] Zi Yin. Pairwise inner product distance: Metric for functionality, stability, dimensionality of\n\nvector embedding. arXiv preprint arXiv:1803.00502, 2018.\n\n[38] Zi Yin and Yuanyuan Shen. On the dimensionality of word embedding. In Advances in Neural\n\nInformation Processing Systems 31, pages 895\u2013906. 2018.\n\n[39] Xiaoqin Zhang, Di Wang, Zhengyuan Zhou, and Yi Ma. Simultaneous recti\ufb01cation and align-\nment via robust recovery of low-rank tensors. In Advances in Neural Information Processing\nSystems, pages 1637\u20131645, 2013.\n\n[40] Xiaoqin Zhang, Zhengyuan Zhou, Di Wang, and Yi Ma. Hybrid singular value thresholding\nfor tensor completion. In Proceedings of the Twenty-Eighth AAAI Conference on Arti\ufb01cial\nIntelligence, pages 1362\u20131368. AAAI Press, 2014.\n\n[41] Zhengyuan Zhou, Susan Athey, and Stefan Wager. Of\ufb02ine multi-action policy learning: Gener-\n\nalization and optimization. arXiv preprint arXiv:1810.04778, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5728, "authors": [{"given_name": "Zi", "family_name": "Yin", "institution": "Stanford University"}, {"given_name": "Vin", "family_name": "Sachidananda", "institution": "Stanford University"}, {"given_name": "Balaji", "family_name": "Prabhakar", "institution": "Stanford Univeristy"}]}