{"title": "On the Dimensionality of Word Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 887, "page_last": 898, "abstract": "In this paper, we provide a theoretical understanding of word embedding and its dimensionality. Motivated by the unitary-invariance of word embedding, we propose the Pairwise Inner Product (PIP) loss, a novel metric on the dissimilarity between word embeddings. Using techniques from matrix perturbation theory, we reveal a fundamental bias-variance trade-off in dimensionality selection for word embeddings. This bias-variance trade-off sheds light on many empirical observations which were previously unexplained, for example the existence of an optimal dimensionality. Moreover, new insights and discoveries, like when and how word embeddings are robust to over-fitting, are revealed. By optimizing over the bias-variance trade-off of the PIP loss, we can explicitly answer the open question of dimensionality selection for word embedding.", "full_text": "On the Dimensionality of Word Embedding\n\nZi Yin\n\nStanford University\n\nYuanyuan Shen\n\nMicrosoft Corp. & Stanford University\n\ns0960974@gmail.com\n\nYuanyuan.Shen@microsoft.com\n\nAbstract\n\nIn this paper, we provide a theoretical understanding of word embedding and its\ndimensionality. Motivated by the unitary-invariance of word embedding, we pro-\npose the Pairwise Inner Product (PIP) loss, a novel metric on the dissimilarity\nbetween word embeddings. Using techniques from matrix perturbation theory, we\nreveal a fundamental bias-variance trade-off in dimensionality selection for word\nembeddings. This bias-variance trade-off sheds light on many empirical observa-\ntions which were previously unexplained, for example the existence of an optimal\ndimensionality. Moreover, new insights and discoveries, like when and how word\nembeddings are robust to over-\ufb01tting, are revealed. By optimizing over the bias-\nvariance trade-off of the PIP loss, we can explicitly answer the open question of\ndimensionality selection for word embedding.\n\n1\n\nIntroduction\n\nWord embeddings are very useful and versatile tools, serving as keys to many fundamental prob-\nlems in numerous NLP research [Turney and Pantel, 2010]. To name a few, word embeddings\nare widely applied in information retrieval [Salton, 1971, Salton and Buckley, 1988, Sparck Jones,\n1972], recommendation systems [Breese et al., 1998, Yin et al., 2017], image description [Frome\net al., 2013], relation discovery [Mikolov et al., 2013c] and word level translation [Mikolov et al.,\n2013b]. Furthermore, numerous important applications are built on top of word embeddings. Some\nprominent examples are long short-term memory (LSTM) networks [Hochreiter and Schmidhuber,\n1997] that are used for language modeling [Bengio et al., 2003], machine translation [Sutskever\net al., 2014, Bahdanau et al., 2014], text summarization [Nallapati et al., 2016] and image caption\ngeneration [Xu et al., 2015, Vinyals et al., 2015]. Other important applications include named entity\nrecognition [Lample et al., 2016], sentiment analysis [Socher et al., 2013] and so on.\nHowever, the impact of dimensionality on word embedding has not yet been fully understood. As\na critical hyper-parameter, the choice of dimensionality for word vectors has huge in\ufb02uence on the\nperformance of a word embedding. First, it directly impacts the quality of word vectors - a word\nembedding with a small dimensionality is typically not expressive enough to capture all possible\nword relations, whereas one with a very large dimensionality suffers from over-\ufb01tting. Second,\nthe number of parameters for a word embedding or a model that builds on word embeddings (e.g.\nrecurrent neural networks) is usually a linear or quadratic function of dimensionality, which directly\naffects training time and computational costs. Therefore, large dimensionalities tend to increase\nmodel complexity, slow down training speed, and add inferential latency, all of which are constraints\nthat can potentially limit model applicability and deployment [Wu et al., 2016].\nDimensionality selection for embedding is a well-known open problem. In most NLP research, di-\nmensionality is either selected ad hoc or by grid search, either of which can lead to sub-optimal\nmodel performances. For example, 300 is perhaps the most commonly used dimensionality in vari-\nous studies [Mikolov et al., 2013a, Pennington et al., 2014, Bojanowski et al., 2017]. This is possibly\ndue to the in\ufb02uence of the groundbreaking paper, which introduced the skip-gram Word2Vec model\nand chose a dimensionality of 300 [Mikolov et al., 2013a]. A better empirical approach used by\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fsome researchers is to \ufb01rst train many embeddings of different dimensionalities, evaluate them on a\nfunctionality test (like word relatedness or word analogy), and then pick the one with the best em-\npirical performance. However, this method suffers from 1) greatly increased time complexity and\ncomputational burden, 2) inability to exhaust all possible dimensionalities and 3) lack of consensus\nbetween different functionality tests as their results can differ. Thus, we need a universal criterion\nthat can re\ufb02ect the relationship between the dimensionality and quality of word embeddings in order\nto establish a dimensionality selection procedure for embedding methods.\nIn this regard, we outline a few major contributions of our paper:\n\n1. We introduce the PIP loss, a novel metric on the dissimilarity between word embeddings;\n2. We develop a mathematical framework that reveals a fundamental bias-variance trade-off\nin dimensionality selection. We explain the existence of an optimal dimensionality, a phe-\nnomenon commonly observed but lacked explanations;\n\n3. We quantify the robustness of embedding algorithms using the exponent parameter \u03b1, and\nestablish that many widely used embedding algorithms, including skip-gram and GloVe,\nare robust to over-\ufb01tting;\n\n4. We propose a mathematically rigorous answer to the open problem of dimensionality selec-\ntion by minimizing the PIP loss. We perform this procedure and cross-validate the results\nwith grid search for LSA, skip-gram Word2Vec and GloVe on an English corpus.\n\nFor the rest of the paper, we consider the problem of learning an embedding for a vocabulary of\nsize n, which is canonically de\ufb01ned as V = {1, 2,\u00b7\u00b7\u00b7 , n}. Speci\ufb01cally, we want to learn a vector\nrepresentation vi \u2208 Rd for each token i. The main object is the embedding matrix E \u2208 Rn\u00d7d,\nconsisting of the stacked vectors vi, where Ei,\u00b7 = vi. All matrix norms in the paper are Frobenius\nnorms unless otherwise stated.\n\n2 Preliminaries and Background Knowledge\n\nOur framework is built on the following preliminaries:\n\n1. Word embeddings are unitary-invariant;\n2. Most existing word embedding algorithms can be formulated as low rank matrix approxi-\n\nmations, either explicitly or implicitly.\n\n2.1 Unitary Invariance of Word Embeddings\n\nThe unitary-invariance of word embeddings has been discovered in recent research [Hamilton et al.,\n2016, Artetxe et al., 2016, Smith et al., 2017, Yin, 2018]. It states that two embeddings are essentially\nidentical if one can be obtained from the other by performing a unitary operation, e.g., a rotation. A\nunitary operation on a vector corresponds to multiplying the vector by a unitary matrix, i.e. v(cid:48) = vU,\nwhere U T U = U U T = Id. Note that a unitary transformation preserves the relative geometry of\nthe vectors, and hence de\ufb01nes an equivalence class of embeddings. In Section 3, we introduce the\nPairwise Inner Product loss, a unitary-invariant metric on embedding similarity.\n\n2.2 Word Embeddings from Explicit Matrix Factorization\n\nA wide range of embedding algorithms use explicit matrix factorization, including the popular La-\ntent Semantics Analysis (LSA). In LSA, word embeddings are obtained by truncated SVD of a signal\nmatrix M which is usually based on co-occurrence statistics, for example the Pointwise Mutual In-\nformation (PMI) matrix, positive PMI (PPMI) matrix and Shifted PPMI (SPPMI) matrix [Levy and\nGoldberg, 2014]. Eigen-words [Dhillon et al., 2015] is another example of this type.\nCaron [2001], Bullinaria and Levy [2012], Turney [2012], Levy and Goldberg [2014] described a\ngeneric approach of obtaining embeddings from matrix factorization. Let M be the signal matrix\n(e.g. the PMI matrix) and M = U DV T be its SVD. A k-dimensional embedding is obtained by\ntruncating the left singular matrix U at dimension k, and multiplying it by a power of the trun-\ncated diagonal matrix D, i.e. E = U1:kD\u03b1\n1:k,1:k for some \u03b1 \u2208 [0, 1]. Caron [2001], Bullinaria\n\n2\n\n\fand Levy [2012] discovered through empirical studies that different \u03b1 works for different language\ntasks. In Levy and Goldberg [2014] where the authors explained the connection between skip-gram\nWord2Vec and matrix factorization, \u03b1 is set to 0.5 to enforce symmetry. We discover that \u03b1 controls\nthe robustness of embeddings against over-\ufb01tting, as will be discussed in Section 5.1.\n\n2.3 Word Embeddings from Implicit Matrix Factorization\n\nIn NLP, two most widely used embedding models are skip-gram Word2Vec [Mikolov et al., 2013c]\nand GloVe [Pennington et al., 2014]. Although they learn word embeddings by optimizing over some\nobjective functions using stochastic gradient methods, they have both been shown to be implicitly\nperforming matrix factorizations.\n\nSkip-gram Skip-gram Word2Vec maximizes the likelihood of co-occurrence of the center word\nand context words. The log likelihood is de\ufb01ned as\n\nn(cid:88)\n\ni+w(cid:88)\n\ni=0\n\nj=i\u2212w,j(cid:54)=i\n\nlog(\u03c3(vT\n\nj vi)), where \u03c3(x) =\n\nex\n\n1 + ex\n\nLevy and Goldberg [2014] showed that skip-gram Word2Vec\u2019s objective is an implicit symmetric\nfactorization of the Pointwise Mutual Information (PMI) matrix:\n\nPMIij = log\n\np(vi, vj)\np(vi)p(vj)\n\nSkip-gram is sometimes enhanced with techniques like negative sampling [Mikolov et al., 2013b],\nwhere the signal matrix becomes the Shifted PMI matrix [Levy and Goldberg, 2014].\n\nGloVe Levy et al. [2015] pointed out that the objective of GloVe is implicitly a symmetric factor-\nization of the log-count matrix. The factorization is sometimes augmented with bias vectors and the\nlog-count matrix is sometimes raised to an exponent \u03b3 \u2208 [0, 1] [Pennington et al., 2014].\n3 PIP Loss: a Novel Unitary-invariant Loss Function for Embeddings\n\nHow do we know whether a trained word embedding is good enough? Questions of this kind cannot\nbe answered without a properly de\ufb01ned loss function. For example, in statistical estimation (e.g.\nlinear regression), the quality of an estimator \u02c6\u03b8 can often be measured using the l2 loss E[(cid:107)\u02c6\u03b8 \u2212 \u03b8\u2217\n(cid:107)2\n2]\nwhere \u03b8\u2217 is the unobserved ground-truth parameter. Similarly, for word embedding, a proper metric\nis needed in order to evaluate the quality of a trained embedding.\nAs discussed in Section 2.1, a reasonable loss function between embeddings should respect the\nunitary-invariance. This rules out choices like direct comparisons, for example using (cid:107)E1 \u2212 E2(cid:107) as\nthe loss function. We propose the Pairwise Inner Product (PIP) loss, which naturally arises from the\nunitary-invariance, as the dissimilarity metric between two word embeddings:\nDe\ufb01nition 1 (PIP matrix). Given an embedding matrix E \u2208 Rn\u00d7d, de\ufb01ne its associated Pairwise\nInner Product (PIP) matrix to be\n\nPIP(E) = EET\n\nIt can be seen that the (i, j)-th entry of the PIP matrix corresponds to the inner product between the\nembeddings for word i and word j, i.e. PIPi,j = (cid:104)vi, vj(cid:105). To compare E1 and E2, two embedding\nmatrices on a common vocabulary, we propose the PIP loss:\nDe\ufb01nition 2 (PIP loss). The PIP loss between E1 and E2 is de\ufb01ned as the norm of the difference\nbetween their PIP matrices\n\n(cid:107)PIP(E1) \u2212 PIP(E2)(cid:107) = (cid:107)E1ET\n\n1 \u2212 E2ET\n\n2 (cid:107) =\n\n((cid:104)v(1)\n\ni\n\n, v(1)\n\nj\n\n(cid:105) \u2212 (cid:104)v(2)\n\ni\n\n, v(2)\n\nj\n\n(cid:105))2\n\ni,j\n\nNote that the i-th row of the PIP matrix, viET = ((cid:104)vi, v1(cid:105),\u00b7\u00b7\u00b7 ,(cid:104)vi, vn(cid:105)), can be viewed as the rela-\ntive position of vi anchored against all other vectors {v1,\u00b7\u00b7\u00b7 , vn}. In essence, the PIP loss measures\nthe vectors\u2019 relative position shifts between E1 and E2, thereby removing their dependencies on any\nspeci\ufb01c coordinate system. The PIP loss respects the unitary-invariance. Speci\ufb01cally, if E2 = E1U\n\n3\n\n(cid:115)(cid:88)\n\n\fwhere U is a unitary matrix, then the PIP loss between E1 and E2 is zero because E2ET\n1 .\n2 = E1ET\nIn addition, the PIP loss serves as a metric of functionality dissimilarity. A practitioner may only\ncare about the usability of word embeddings, for example, using them to solve analogy and related-\nness tasks [Schnabel et al., 2015, Baroni et al., 2014], which are the two most important properties\nof word embeddings. Since both properties are tightly related to vector inner products, a small PIP\nloss between E1 and E2 leads to a small difference in E1 and E2\u2019s relatedness and analogy as the\nPIP loss measures the difference in inner products1. As a result, from both theoretical and prac-\ntical standpoints, the PIP loss is a suitable loss function for embeddings. Furthermore, we show\nin Section 4 that this formulation opens up a new angle to understanding the effect of embedding\ndimensionality with matrix perturbation theory.\n\n4 How Does Dimensionality Affect the Quality of Embedding?\n\n\u2206\n= U\u00b7,1:dD\u03b1\n\nWith the PIP loss, we can now study the quality of trained word embeddings for any algorithm that\nuses matrix factorization. Suppose a d-dimensional embedding is derived from a signal matrix M\nwith the form f\u03b1,d(M )\n1:d,1:d, where M = U DV T is the SVD. In the ideal scenario, a\ngenie reveals a clean signal matrix M (e.g. PMI matrix) to the algorithm, which yields the oracle\nembedding E = f\u03b1,d(M ). However, in practice, there is no magical oil lamp, and we have to\nestimate \u02dcM (e.g. empirical PMI matrix) from the training data, where \u02dcM = M + Z is perturbed\nby the estimation noise Z. The trained embedding \u02c6E = f\u03b1,k( \u02dcM ) is computed by factorizing this\nnoisy matrix. To ensure \u02c6E is close to E, we want the PIP loss (cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107) to be small. In\nparticular, this PIP loss is affected by k, the dimensionality we select for the trained embedding.\nArora [2016] discussed in an article about a mysterious empirical observation of word embeddings:\n\u201c... A striking \ufb01nding in empirical work on word embeddings is that there is a sweet spot for the\ndimensionality of word vectors: neither too small, nor too large\u201d2. He proceeded by discussing two\npossible explanations: low dimensional projection (like the Johnson-Lindenstrauss Lemma) and the\nstandard generalization theory (like the VC dimension), and pointed out why neither is suf\ufb01cient for\nexplaining this phenomenon. While some may argue that this is caused by under\ufb01tting/over\ufb01tting,\nthe concept itself is too broad to provide any useful insight. We show that this phenomenon can be\nexplicitly explained by a bias-variance trade-off in Section 4.1, 4.2 and 4.3. Equipped with the PIP\nloss, we give a mathematical presentation of the bias-variance trade-off using matrix perturbation\ntheory. We \ufb01rst introduce a classical result in Lemma 1. The proof is deferred to the appendix,\nwhich can also be found in Stewart and Sun [1990].\nLemma 1. Let X, Y be two orthogonal matrices of Rn\u00d7n. Let X = [X0, X1] and Y = [Y0, Y1] be\nthe \ufb01rst k columns of X and Y respectively, namely X0, Y0 \u2208 Rn\u00d7k and k \u2264 n. Then\nwhere c is a constant depending on the norm only. c = 1 for 2-norm and \u221a2 for Frobenius norm.\nAs pointed out by several papers [Caron, 2001, Bullinaria and Levy, 2012, Turney, 2012, Levy and\nGoldberg, 2014], embedding algorithms can be generically characterized as E = U1:k,\u00b7D\u03b1\n1:k,1:k for\nsome \u03b1 \u2208 [0, 1]. For illustration purposes, we \ufb01rst consider a special case where \u03b1 = 0.\n4.1 The Bias Variance Trade-off for a Special Case: \u03b1 = 0\n\n0 (cid:107) = c(cid:107)X T\n\n(cid:107)X0X T\n\n0 \u2212 Y0Y T\n\n0 Y1(cid:107)\n\nThe following theorem shows how the PIP loss can be naturally decomposed into a bias term and a\nvariance term when \u03b1 = 0:\nTheorem 1. Let E \u2208 Rn\u00d7d and \u02c6E \u2208 Rn\u00d7k be the oracle and trained embeddings, where k \u2264 d.\nAssume both have orthonormal columns. Then the PIP loss has a bias-variance decomposition\n\n(cid:107)PIP(E) \u2212 PIP( \u02c6E)(cid:107)2 = d \u2212 k + 2(cid:107) \u02c6ET E\n\n\u22a5(cid:107)2\n\nProof. The proof utilizes techniques from matrix perturbation theory. To simplify notations, denote\nX0 = E, Y0 = \u02c6E, and let X = [X0, X1], Y = [Y0, Y1] be the complete n by n orthogonal matrices.\n\n1A detailed discussion on the PIP loss and analogy/relatedness is deferred to the appendix\n2http://www.offconvex.org/2016/02/14/word-embeddings-2/\n\n4\n\n\fSince k \u2264 d, we can further split X0 into X0,1 and X0,2, where the former has k columns and the\nlatter d \u2212 k. Now, the PIP loss equals\n\n(cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107)2 =(cid:107)X0,1X T\n=(cid:107)X0,1X T\n= 2(cid:107)Y T\n=2(cid:107)Y T\n=d \u2212 k + 2(cid:107)Y T\n\n(a)\n\n0,1 \u2212 Y0Y T\n0,1 \u2212 Y0Y T\n\n0,2(cid:107)2\n0 + X0,2X T\n0 (cid:107)2 + (cid:107)X0,2X T\n\n0 [X0,2, X1](cid:107)2 + d \u2212 k \u2212 2(cid:104)Y0Y T\n0 X0,2(cid:107)2 + 2(cid:107)Y T\n\n0 , X0,2X T\n0 X1(cid:107)2 + d \u2212 k \u2212 2(cid:104)Y0Y T\n\n0,2(cid:107)2 + 2(cid:104)X0,1X T\n0,2(cid:105)\n0 , X0,2X T\n\n0,2(cid:105)\n\n0 X1(cid:107)2 = d \u2212 k + 2(cid:107) \u02c6ET E\n\n\u22a5(cid:107)2\n\n0,1 \u2212 Y0Y T\n\n0 , X0,2X T\n\n0,2(cid:105)\n\nwhere in equality (a) we used Lemma 1.\n\nThe observation is that the right-hand side now consists of two parts, which we identify as bias and\nvariance. The \ufb01rst part d\u2212k is the amount of lost signal, which is caused by discarding the rest d\u2212k\ndimensions when selecting k \u2264 d. However, (cid:107) \u02c6ET E\u22a5\n(cid:107) increases as k increases, as the noise perturbs\nthe subspace spanned by E, and the singular vectors corresponding to smaller singular values are\nmore prone to such perturbation. As a result, the optimal dimensionality k\u2217 which minimizes the\nPIP loss lies in between 0 and d, the rank of the matrix M.\n\n4.2 The Bias Variance Trade-off for the Generic Case: \u03b1 \u2208 (0, 1]\nIn this generic case, the columns of E, \u02c6E are no longer orthonormal, which does not satisfy the\nassumptions in matrix perturbation theory. We develop a novel technique where Lemma 1 is applied\nin a telescoping fashion. The proof of the theorem is deferred to the appendix.\nTheorem 2. Let M = U DV T , \u02dcM = \u02dcU \u02dcD \u02dcV T be the SVDs of the clean and estimated signal\nmatrices. Suppose E = U\u00b7,1:dD\u03b1\n1:k,1:k is the\ntrained embedding, for some k \u2264 d. Let D = diag(\u03bbi) and \u02dcD = diag(\u02dc\u03bbi), then\ni \u2212 \u03bb2\u03b1\n\n1:d,1:d is the oracle embedding, and \u02c6E = \u02dcU\u00b7,1:k \u02dcD\u03b1\n\n(cid:107)PIP(E) \u2212 PIP( \u02c6E)(cid:107) \u2264\n\n(cid:118)(cid:117)(cid:117)(cid:116) d(cid:88)\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\ni+1)(cid:107) \u02dcU T\u00b7,1:iU\u00b7,i:n(cid:107)\n\nk(cid:88)\n\n\u221a\n2\n\ni \u2212 \u02dc\u03bb2\u03b1\n\ni )2 +\n\n\u03bb4\u03b1\ni +\n\n(\u03bb2\u03b1\n\n(\u03bb2\u03b1\n\ni=k+1\n\ni=1\n\ni=1\n\nAs before, the three terms in Theorem 2 can be characterized into bias and variances. The \ufb01rst term\nis the bias as we lose part of the signal by choosing k \u2264 d. Notice that the embedding matrix E\nconsists of signal directions (given by U) and their magnitudes (given by D\u03b1). The second term is\nthe variance on the magnitudes, and the third term is the variance on the directions.\n\n4.3 The Bias-Variance Trade-off Captures the Signal-to-Noise Ratio\n\nWe now present the main theorem, which shows that the bias-variance trade-off re\ufb02ects the \u201csignal-\nto-noise ratio\u201d in dimensionality selection.\nTheorem 3 (Main theorem). Suppose \u02dcM = M + Z, where M is the signal matrix, symmetric with\nspectrum {\u03bbi}d\ni=1. Z is the estimation noise, symmetric with iid, zero mean, variance \u03c32 entries.\nFor any 0 \u2264 \u03b1 \u2264 1 and k \u2264 d, let the oracle and trained embeddings be\n\nE = U\u00b7,1:dD\u03b1\n\n1:d,1:d, \u02c6E = \u02dcU\u00b7,1:k \u02dcD\u03b1\n\n1:k,1:k\n\nwhere M = U DV T , \u02dcM = \u02dcU \u02dcD \u02dcV T are the SVDs of the clean and estimated signal matrices. Then\n\ni=1\n\ni=1\n\n5\n\n1. When \u03b1 = 0,\n\nE[(cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107)] \u2264\n\n2. When 0 < \u03b1 \u2264 1,\n\nE[(cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107)] \u2264\n\n(cid:118)(cid:117)(cid:117)(cid:116) d(cid:88)\n\ni=k+1\n\n\u221a\n\n\u03bb4\u03b1\ni + 2\n\n(cid:115)\nd \u2212 k + 2\u03c32 (cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\n\u03bb4\u03b1\u22122\n\n2n\u03b1\u03c3\n\n+\n\ni\n\nr\u2264k,s>d\n\n(\u03bbr \u2212 \u03bbs)\u22122\n\nk(cid:88)\n\ni \u2212 \u03bb2\u03b1\n\n(\u03bb2\u03b1\n\ni+1)\u03c3\n\n\u221a\n2\n\n(cid:115) (cid:88)\n\nr\u2264i<s\n\n(\u03bbr \u2212 \u03bbs)\u22122\n\n\f(cid:118)(cid:117)(cid:117)(cid:116) d(cid:88)\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\nProof. We sketch the proof for part 2, as the proof of part 1 is simpler and can be done with the\nsame arguments. We start by taking expectation on both sides of Theorem 2:\n\nk(cid:88)\n\n\u221a\n2\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\nE[(cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107)] \u2264\n\ni + E\n\u03bb4\u03b1\n\ni \u2212 \u02dc\u03bb2\u03b1\n\n(\u03bb2\u03b1\n\ni )2 +\n\ni \u2212 \u03bb2\u03b1\n\ni+1)E[(cid:107) \u02dcU T\u00b7,1:iU\u00b7,i:n(cid:107)],\n\n(\u03bb2\u03b1\n\ni=k+1\n\ni=1\n\ni=1\n\nThe \ufb01rst term involves only the spectrum, which is the same after taking expectation. The second\nterm is upper bounded using Lemma 2 below, derived from Weyl\u2019s theorem. We state the lemma,\nand leave the proof to the appendix.\nLemma 2. Under the conditions of Theorem 3,\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\nE\n\ni \u2212 \u02dc\u03bb2\u03b1\n\n(\u03bb2\u03b1\n\n\u221a\ni )2 \u2264 2\n\n2n\u03b1\u03c3\n\n\u03bb4\u03b1\u22122\n\ni\n\ni=1\n\ni=1\n\nFor the last term, we use the Sylvester operator technique by Stewart and Sun [1990]. Our result is\npresented in Lemma 3, and the proof of which is discussed in the appendix.\nLemma 3. For two matrices M and \u02dcM = M + Z, denote their SVDs as M = U DV T and\n\u02dcM = \u02dcU \u02dcD \u02dcV T . Write the left singular matrices in block form as U = [U0, U1], \u02dcU = [ \u02dcU0, \u02dcU1], and\nsimilarly partition D into diagonal blocks D0 and D1. If the spectrum of D0 and D1 has separation\n\n\u03b4k\n\n\u2206=\n\nmin\n\n1\u2264i\u2264k,k<j\u2264n\n\n{\u03bbi \u2212 \u03bbj} = \u03bbk \u2212 \u03bbk+1 > 0,\n\nand Z has iid, zero mean entries with variance \u03c32, then\n\nE[(cid:107) \u02dcU T\n\n1 U0(cid:107)] \u2264 \u03c3\n\n(\u03bbi \u2212 \u03bbj)\u22122\n\n(cid:115) (cid:88)\n\n1\u2264i\u2264k<j\u2264n\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\nNow, collect results in Lemma 2 and Lemma 3, we obtain an upper bound approximation for the\nPIP loss:\n\nE[(cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107)] \u2264\n\n\u221a\n2n\u03b1\u03c3\n\n\u03bb4\u03b1\ni + 2\n\n\u03bb4\u03b1\u22122\n\ni\n\n+\n\n\u221a\n\n2\n\ni \u2212 \u03bb2\u03b1\n\n(\u03bb2\u03b1\n\ni+1)\u03c3\n\n(\u03bbr \u2212 \u03bbs)\u22122\n\n(cid:118)(cid:117)(cid:117)(cid:116) d(cid:88)\n\nk(cid:88)\n\n(cid:115) (cid:88)\n\nr\u2264i<s\n\nwhich completes the proof.\n\ni=k+1\n\ni=1\n\ni=0\n\nTheorem 3 shows that when dimensionality is too small, too much signal power (speci\ufb01cally, the\nspectrum of the signal M) is discarded, causing the \ufb01rst term to be too large (high bias). On the\nother hand, when dimensionality is too large, too much noise is included, causing the second and\nthird terms to be too large (high variance). This explicitly answers the question of Arora [2016].\n\n5 Two New Discoveries\n\nIn this section, we introduce two more discoveries regarding the fundamentals of word embedding.\nThe \ufb01rst is the relationship between the robustness of embedding and the exponent parameter \u03b1, with\na corollary that both skip-gram and GloVe are robust to over-\ufb01tting. The second is a dimensionality\nselection method by explicitly minimizing the PIP loss between the oracle and trained embeddings3.\nAll our experiments use the Text8 corpus [Mahoney, 2011], a standard benchmark corpus used for\nvarious natural language tasks.\n\n5.1 Word Embeddings\u2019 Robustness to Over-Fitting Increases with Respect to \u03b1\n\nTheorem 3 provides a good indicator for the sensitivity of the PIP loss with respect to over-\nparametrization. Vu [2011] showed that the approximations obtained by matrix perturbation theory\nare minimax tight. As k increases, the bias term\ndecreases, which can be viewed as a\nzeroth-order term because the arithmetic means of singular values are dominated by the large ones.\n\ni=k \u03bb4\u03b1\ni\n\n(cid:113)(cid:80)d\n\n3Code can be found on GitHub: https://github.com/ziyin-dl/word-embedding-dimensionality-selection\n\n6\n\n\fAs a result, when k is already large (say, the singular values retained contain more than half of the\ntotal energy of the spectrum), increasing k has only marginal effect on the PIP loss.\nOn the other hand, the variance terms demonstrate a \ufb01rst-order effect, which contains the difference\nof the singular values, or singular gaps. Both variance terms grow at the rate of \u03bb2\u03b1\u22121\nwith respect\nto the dimensionality k (the analysis is left to the appendix). For small \u03bbk (i.e. \u03bbk < 1), the rate\n\u03bb2\u03b1\u22121\nincreases as \u03b1 decreases: when \u03b1 < 0.5, this rate can be very large; When 0.5 \u2264 \u03b1 \u2264 1,\nk\nthe rate is bounded and sub-linear, in which case the PIP loss will be robust to over-parametrization.\nIn other words, as \u03b1 becomes larger, the embedding algorithm becomes less sensitive to over-\ufb01tting\ncaused by the selection of an excessively large dimensionality k. To illustrate this point, we compute\nthe PIP loss of word embeddings (approximated by Theorem 3) for the PPMI LSA algorithm, and\nplot them for different \u03b1\u2019s in Figure 1a.\n\nk\n\n(a) Theorem 3\n\n(b) WordSim353 Test\n\n(c) MTurk771 Test\n\nFigure 1: Sensitivity to over-parametrization: theoretical prediction versus empirical results\n\nOur discussion that over-\ufb01tting hurts algorithms with smaller \u03b1 more can be empirically veri\ufb01ed.\nFigure 1b and 1c display the performances (measured by the correlation between vector cosine sim-\nilarity and human labels) of word embeddings of various dimensionalities from the PPMI LSA\nalgorithm, evaluated on two word correlation tests: WordSim353 [Finkelstein et al., 2001] and\nMTurk771 [Halawi et al., 2012]. These results validate our theory: performance drop due to over-\nparametrization is more signi\ufb01cant for smaller \u03b1.\nFor the popular skip-gram [Mikolov et al., 2013b] and GloVe [Pennington et al., 2014], \u03b1 equals\n0.5 as they are implicitly doing a symmetric factorization. Our previous discussion suggests that\nthey are robust to over-parametrization. We empirically verify this by training skip-gram and GloVe\nembeddings. Figure 2 shows the empirical performance on three word functionality tests. Even with\nextreme over-parametrization (up to k = 10000), skip-gram still performs within 80% to 90% of op-\ntimal performance, for both analogy test [Mikolov et al., 2013a] and relatedness tests (WordSim353\n[Finkelstein et al., 2001] and MTurk771 [Halawi et al., 2012]). This observation holds for GloVe as\nwell as shown in Figure 3.\n\n(a) Google Analogy Test\n\n(b) WordSim353 Test\n\n(c) MTurk771 Test\n\nFigure 2: skip-gram Word2Vec: over-parametrization does not signi\ufb01cantly hurt performance\n\n5.2 Optimal Dimensionality Selection: Minimizing the PIP Loss\nThe optimal dimensionality can be selected by \ufb01nding the k\u2217 that minimizes the PIP loss between\nthe trained embedding and the oracle embedding. With a proper estimate of the spectrum D = {\u03bb}d\n\n1\n\n7\n\n0500100015002000250030003500Dimensions020406080100120RelativePIPErrorRelativePIPErrorPredictionsfordifferent\u03b1\u03b1=0\u03b1=0.25\u03b1=0.5\u03b1=0.75\u03b1=1.00200040006000800010000Dimensions0.10.20.30.40.50.60.70.8CorrelationwithHumanLabelsCorrelationwithHumanLabelsforDifferent\u03b1,TestSetiswordsim353\u03b1=0.0\u03b1=0.25\u03b1=0.5\u03b1=0.75\u03b1=1.00200040006000800010000Dimensions\u22120.10.00.10.20.30.40.50.6CorrelationwithHumanLabelsCorrelationwithHumanLabelsforDifferent\u03b1,TestSetismturk771\u03b1=0.0\u03b1=0.25\u03b1=0.5\u03b1=0.75\u03b1=1.0\f(a) Google Analogy Test\n\n(b) WordSim353 Test\n\n(c) MTurk771 Test\n\nFigure 3: GloVe: over-parametrization does not signi\ufb01cantly hurt performance\n\nand the variance of noise \u03c32, we can use the approximation in Theorem 3. Another approach is to use\nthe Monte-Carlo method where we simulate the clean signal matrix M = U DV and the noisy signal\nmatrix \u02dcM = M + Z. By factorizing M and \u02dcM, we can simulate the oracle embedding E = U D\u03b1\nand trained embeddings \u02c6Ek = \u02dcU\u00b7,1:k \u02dcD\u03b1\n1:k,1:k, in which case the PIP loss between them can be\ndirectly calculated. We found empirically that the Monte-Carlo procedure is more accurate as the\nsimulated PIP losses concentrate tightly around their means across different runs. In the following\nexperiments, we demonstrate that dimensionalities selected using the Monte-Carlo approach achieve\nnear-optimal performances on various word intrinsic tests. As a \ufb01rst step, we demonstrate how one\ncan obtain good estimates of {\u03bbi}d\n5.2.1 Spectrum and Noise Estimation from Corpus\nNoise Estimation We note that for most NLP tasks, the signal matrices are estimated by count-\ning or transformations of counting, including taking log or normalization. This holds for word\nembeddings that are based on co-occurrence statistics, e.g., LSA, skip-gram and GloVe. We use a\ncount-twice trick to estimate the noise: we randomly split the data into two equally large subsets,\nand get matrices \u02dcM1 = M + Z1, \u02dcM2 = M + Z2 in Rm\u00d7n, where Z1, Z2 are two independent copies\nof noise with variance 2\u03c32. Now, \u02dcM1 \u2212 \u02dcM2 = Z1 \u2212 Z2 is a random matrix with zero mean and\nvariance 4\u03c32. Our estimator is the sample standard deviation, a consistent estimator:\n\n1 and \u03c3 in 5.2.1.\n\n\u02c6\u03c3 =\n\n\u221a\n1\nmn\n2\n\n(cid:107) \u02dcM1 \u2212 \u02dcM2(cid:107)\n\nSpectral Estimation Spectral estimation is a well-studied subject in statistical literature [Cai et al.,\n2010, Cand`es and Recht, 2009, Kong and Valiant, 2017]. For our experiments, we use the well-\nestablished universal singular value thresholding (USVT) proposed by Chatterjee [2015].\n\n\u02c6\u03bbi = (\u02dc\u03bbi \u2212 2\u03c3\n\n\u221a\nn)+\n\nwhere \u02dc\u03bbi is the i-th empirical singular value and \u03c3 is the noise standard deviation. This estimator is\nshown to be minimax optimal [Chatterjee, 2015].\n\n5.2.2 Dimensionality Selection: LSA, Skip-gram Word2Vec and GloVe\n1 and the noise \u03c3, we can use the Monte-Carlo procedure de-\nAfter estimating the spectrum {\u03bbi}d\nscribed above to estimate the PIP loss. For three popular embedding algorithms: LSA, skip-gram\nWord2Vec and GloVe, we \ufb01nd their optimal dimensionalities k\u2217 that minimize their respective PIP\nloss. We de\ufb01ne the sub-optimality of a particular dimensionality k as the additional PIP loss com-\npared with k\u2217: (cid:107)EkET\nk \u2212 EET(cid:107)\u2212(cid:107)Ek\u2217 Ek\u2217 T \u2212 EET(cid:107). In addition, we de\ufb01ne the p% sub-optimal\ninterval as the interval of dimensionalities whose sub-optimality are no more than p% of that of a\n1-D embedding. In other words, if k is within the p% interval, then the PIP loss of a k-dimensional\nembedding is at most p% worse than the optimal embedding. We show an example in Figure 4.\n\nLSA with PPMI Matrix For the LSA algorithm, the optimal dimensionalities and sub-optimal\nintervals around them (5%, 10%, 20% and 50%) for different \u03b1 values are shown in Table 1. Figure\n4 shows how PIP losses vary across different dimensionalities. From the shapes of the curves, we\ncan see that models with larger \u03b1 suffer less from over-parametrization, as predicted in Section 5.1.\n\n8\n\n0200040006000800010000dimensions0.000.020.040.060.080.10top-k hit rateAnalogy Performance vs Embedding Sizequestions-words0200040006000800010000dimensions0.00.10.20.30.40.50.60.7correlation with hunam labelsSimilarity Task Performance vs Embedding Sizewordsim3530200040006000800010000dimensions0.20.30.40.5correlation with hunam labelsSimilarity Task Performance vs Embedding Sizemturk771\fWe further cross-validated our theoretical results with intrinsic functionality tests on word related-\nness. The empirically optimal dimensionalities that achieve highest correlations with human labels\nfor the two word relatedness tests (WordSim353 and MTurk777) lie close to the theoretically se-\nlected k\u2217\u2019s. All of them fall in the 5% interval except when \u03b1 = 0, in which case they fall in the\n20% sub-optimal interval.\n\n(a) \u03b1 = 0\n\n(b) \u03b1 = 0.5\n\n(c) \u03b1 = 1\n\nFigure 4: PIP loss and its bias-variance trade-off allow for explicit dimensionality selection for LSA\n\nTable 1: Optimal dimensionalities for word relatedness tests are close to PIP loss minimizing ones\n50% interval WS353 opt. MT771 opt.\n\nPIP arg min\n\n\u03b1\n0\n\n0.25\n0.5\n0.75\n\n1\n\n214\n138\n108\n90\n82\n\n5% interval\n[164,289]\n[95,190]\n[61,177]\n[39,206]\n[23,426]\n\n10% interval\n[143,322]\n[78,214]\n[45,214]\n[27,290]\n[16,918]\n\n20% interval\n[115,347]\n[57,254]\n[29,280]\n[16,485]\n[9,2204]\n\n[62,494]\n[23,352]\n[9,486]\n[5,1544]\n[3,2204]\n\n127\n146\n146\n155\n365\n\n116\n116\n116\n176\n282\n\nWord2Vec with Skip-gram For skip-gram, we use the PMI matrix as its signal matrix [Levy\nand Goldberg, 2014]. On the theoretical side, the PIP loss-minimizing dimensionality k\u2217 and the\nsub-optimal intervals (5%, 10%, 20% and 50%) are reported in Table 2. On the empirical side, the\noptimal dimensionalities for WordSim353, MTurk771 and Google analogy tests are 56, 102 and 220\nrespectively for skip-gram. They agree with the theoretical selections: one is within the 5% interval\nand the other two are within the 10% interval.\n\nTable 2: PIP loss minimizing dimensionalities and intervals for Skip-gram on Text8 corpus\n\n129\n\n[48,269]\n\n[67,218]\n\narg min +5% interval +10% interval +20% interval +50% interval WS353 MT771 Analogy\n\nSurrogate Matrix\nSkip-gram (PMI)\nGloVe For GloVe, we use the log-count matrix as its signal matrix [Pennington et al., 2014]. On\nthe theoretical side, the PIP loss-minimizing dimensionality k\u2217 and sub-optimal intervals (5%, 10%,\n20% and 50%) are reported in Table 3. On the empirical side, the optimal dimensionalities for\nWordSim353, MTurk771 and Google analogy tests are 220, 860, and 560. Again, they agree with\nthe theoretical selections: two are within the 5% interval and the other is within the 10% interval.\n\n[29,365]\n\n[9,679]\n\n102\n\n220\n\n56\n\nTable 3: PIP loss minimizing dimensionalities and intervals for GloVe on Text8 corpus\n\nSurrogate Matrix\nGloVe (log-count)\n\narg min +5% interval +10% interval +20% interval +50% interval WS353 MT771 Analogy\n\n719\n\n[290,1286]\n\n[160,1663]\n\n[55,2426]\n\n[5,2426]\n\n220\n\n860\n\n560\n\nThe above three experiments show that our method is a powerful tool in practice: the dimensionali-\nties selected according to empirical grid search agree with the PIP-loss minimizing criterion, which\ncan be done simply by knowing the spectrum and noise standard deviation.\n6 Conclusion\nIn this paper, we present a theoretical framework for understanding vector embedding dimension-\nality. We propose the PIP loss, a metric of dissimilarity between word embeddings. We focus on\nembedding algorithms that can be formulated as explicit or implicit matrix factorizations including\nthe widely-used LSA, skip-gram and GloVe, and reveal a bias-variance trade-off in dimensionality\nselection using matrix perturbation theory. With this theory, we discover the robustness of word\nembeddings trained from these algorithms and its relationship to the exponent parameter \u03b1. In addi-\ntion, we propose a dimensionality selection procedure, which consists of estimating and minimizing\nthe PIP loss. This procedure is theoretically justi\ufb01ed, accurate and fast. All of our discoveries are\nconcretely validated on real datasets.\n\n9\n\n0500100015002000Dimensions4648505254PIPErrorPIPErrorGroundTruthgrounttruthPIPerror+5.0%+10.0%+20.0%+50.0%0500100015002000Dimensions12001400160018002000220024002600PIPErrorPIPErrorGroundTruthgrounttruthPIPerror+5.0%+10.0%+20.0%+50.0%0500100015002000Dimensions200000300000400000500000PIPErrorPIPErrorGroundTruthgrounttruthPIPerror+5.0%+10.0%+20.0%+50.0%\fAcknoledgements The authors would like to thank Andrea Montanari, John Duchi, Will Hamil-\nton, Dan Jurafsky, Percy Liang, Peng Qi and Greg Valiant for the helpful discussions. We thank\nBalaji Prabhakar, Pin Pin Tea-mangkornpan and Feiran Wang for proofreading an earlier version\nand for their suggestions. Finally, we thank the anonymous reviewers for their valuable feedback\nand suggestions.\n\nReferences\nSanjeev Arora. Word embeddings: Explaining their properties, 2016. URL http://www.\n\noffconvex.org/2016/02/14/word-embeddings-2/. [Online; accessed 16-May-2018].\n\nMikel Artetxe, Gorka Labaka, and Eneko Agirre. Learning principled bilingual mappings of word\nembeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on\nEmpirical Methods in Natural Language Processing, pages 2289\u20132294, 2016.\n\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\nMarco Baroni, Georgiana Dinu, and Germ\u00b4an Kruszewski. Don\u2019t count, predict! a systematic com-\nparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd\nAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol-\nume 1, pages 238\u2013247, 2014.\n\nYoshua Bengio, R\u00b4ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic\n\nlanguage model. Journal of machine learning research, 3(Feb):1137\u20131155, 2003.\n\nPiotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with\nsubword information. Transactions of the Association for Computational Linguistics, 5:135\u2013146,\n2017. ISSN 2307-387X.\n\nJohn S Breese, David Heckerman, and Carl Kadie. Empirical analysis of predictive algorithms for\ncollaborative \ufb01ltering. In Proceedings of the Fourteenth conference on Uncertainty in arti\ufb01cial\nintelligence, pages 43\u201352. Morgan Kaufmann Publishers Inc., 1998.\n\nJohn A Bullinaria and Joseph P Levy. Extracting semantic representations from word co-occurrence\n\nstatistics: stop-lists, stemming, and svd. Behavior research methods, 44(3):890\u2013907, 2012.\n\nJian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholding algorithm for\n\nmatrix completion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\nEmmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foun-\n\ndations of Computational mathematics, 9(6):717, 2009.\n\nJohn Caron. Experiments with LSA scoring: Optimal rank and basis. Computational information\n\nretrieval, pages 157\u2013169, 2001.\n\nSourav Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statis-\n\ntics, 43(1):177\u2013214, 2015.\n\nChandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii.\n\nSIAM Journal on Numerical Analysis, 7(1):1\u201346, 1970.\n\nParamveer S Dhillon, Dean P Foster, and Lyle H Ungar. Eigenwords: spectral word embeddings.\n\nJournal of Machine Learning Research, 16:3035\u20133078, 2015.\n\nLev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and\nIn Proceedings of the 10th\n\nEytan Ruppin. Placing search in context: The concept revisited.\ninternational conference on World Wide Web, pages 406\u2013414. ACM, 2001.\n\nAndrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise:\nA deep visual-semantic embedding model. In Advances in neural information processing systems,\npages 2121\u20132129, 2013.\n\n10\n\n\fGuy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. Large-scale learning of word\nrelatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference\non Knowledge discovery and data mining, pages 1406\u20131414. ACM, 2012.\n\nWilliam L Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embeddings reveal statis-\n\ntical laws of semantic change. arXiv preprint arXiv:1605.09096, 2016.\n\nSepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\nTosio Kato. Perturbation theory for linear operators, volume 132. Springer Science & Business\n\nMedia, 2013.\n\nWeihao Kong and Gregory Valiant. Spectrum estimation from samples. The Annals of Statistics, 45\n\n(5):2218\u20132247, 2017.\n\nGuillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer.\nNeural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260\u2013\n270, 2016.\n\nOmer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Ad-\n\nvances in neural information processing systems, pages 2177\u20132185, 2014.\n\nOmer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned\nfrom word embeddings. Transactions of the Association for Computational Linguistics, 3:211\u2013\n225, 2015.\n\nMatt Mahoney. Large text compression benchmark, 2011.\n\nTomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word represen-\n\ntations in vector space. arXiv preprint arXiv:1301.3781, 2013a.\n\nTomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for ma-\n\nchine translation. arXiv preprint arXiv:1309.4168, 2013b.\n\nTomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representa-\ntions of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling,\nZ. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Sys-\ntems 26, pages 3111\u20133119. Curran Associates, Inc., 2013c.\n\nLeon Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly journal of\n\nmathematics, 11(1):50\u201359, 1960.\n\nRamesh Nallapati, Bowen Zhou, Cicero dos Santos, C\u00b8 a glar Gulc\u00b8ehre, and Bing Xiang. Abstractive\ntext summarization using sequence-to-sequence rnns and beyond. CoNLL 2016, page 280, 2016.\n\nChristopher C Paige and Musheng Wei. History and generality of the CS decomposition. Linear\n\nAlgebra and its Applications, 208:303\u2013326, 1994.\n\nJeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word\nrepresentation. In Proceedings of the 2014 conference on empirical methods in natural language\nprocessing (EMNLP), pages 1532\u20131543, 2014.\n\nGerard Salton. The SMART retrieval system\u2014experiments in automatic document processing.\n\nPrentice-Hall, Inc., 1971.\n\nGerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval.\n\nInformation processing & management, 24(5):513\u2013523, 1988.\n\nTobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. Evaluation methods for\nunsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods\nin Natural Language Processing, pages 298\u2013307, 2015.\n\n11\n\n\fSamuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. Of\ufb02ine bilingual word\nvectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859,\n2017.\n\nRichard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. In Proceedings of the 2013 conference on empirical methods in natural language pro-\ncessing, pages 1631\u20131642, 2013.\n\nKaren Sparck Jones. A statistical interpretation of term speci\ufb01city and its application in retrieval.\n\nJournal of documentation, 28(1):11\u201321, 1972.\n\nGilbert W Stewart. Stochastic perturbation theory. SIAM review, 32(4):579\u2013610, 1990.\n\nGilbert W Stewart and Ji-guang Sun. Matrix perturbation theory. Academic press, 1990.\n\nIlya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.\n\nIn Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\nPeter D Turney. Domain and function: A dual-space model of semantic relations and compositions.\n\nJournal of Arti\ufb01cial Intelligence Research, 44:533\u2013585, 2012.\n\nPeter D Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics.\n\nJournal of arti\ufb01cial intelligence research, 37:141\u2013188, 2010.\n\nOriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\nimage caption generator. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 3156\u20133164, 2015.\n\nVan Vu. Singular vectors under random perturbation. Random Structures & Algorithms, 39(4):\n\n526\u2013538, 2011.\n\nHermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differen-\ntialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische\nAnnalen, 71(4):441\u2013479, 1912.\n\nYonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,\nMaxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine trans-\narXiv preprint\nlation system: Bridging the gap between human and machine translation.\narXiv:1609.08144, 2016.\n\nKelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich\nZemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual\nattention. In International Conference on Machine Learning, pages 2048\u20132057, 2015.\n\nZi Yin. Understand functionality and dimensionality of vector embeddings:\nhypothesis, the pairwise inner product loss and its bias-variance trade-off.\narXiv:1803.00502, 2018.\n\nthe distributional\narXiv preprint\n\nZi Yin, Keng-hao Chang, and Ruofei Zhang. DeepProbe: Information directed sequence understand-\ning and chatbot design via recurrent neural networks. In Proceedings of the 23rd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 2131\u20132139. ACM,\n2017.\n\n12\n\n\f", "award": [], "sourceid": 496, "authors": [{"given_name": "Zi", "family_name": "Yin", "institution": "Stanford University"}, {"given_name": "Yuanyuan", "family_name": "Shen", "institution": "Stanford University & Microsoft"}]}