{"title": "Neural Word Embedding as Implicit Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 2177, "page_last": 2185, "abstract": "We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. We find that another embedding method, NCE, is implicitly factorizing a similar matrix, where each cell is the (shifted) log conditional probability of a word given its context. We show that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks. When dense low-dimensional vectors are preferred, exact factorization with SVD can achieve solutions that are at least as good as SGNS's solutions for word similarity tasks. On analogy questions SGNS remains superior to SVD. We conjecture that this stems from the weighted nature of SGNS's factorization.", "full_text": "Neural Word Embedding\n\nas Implicit Matrix Factorization\n\nOmer Levy\n\nYoav Goldberg\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nBar-Ilan University\n\nomerlevy@gmail.com\n\nBar-Ilan University\n\nyoav.goldberg@gmail.com\n\nAbstract\n\nWe analyze skip-gram with negative-sampling (SGNS), a word embedding\nmethod introduced by Mikolov et al., and show that it is implicitly factorizing\na word-context matrix, whose cells are the pointwise mutual information (PMI) of\nthe respective word and context pairs, shifted by a global constant. We \ufb01nd that\nanother embedding method, NCE, is implicitly factorizing a similar matrix, where\neach cell is the (shifted) log conditional probability of a word given its context.\nWe show that using a sparse Shifted Positive PMI word-context matrix to represent\nwords improves results on two word similarity tasks and one of two analogy tasks.\nWhen dense low-dimensional vectors are preferred, exact factorization with SVD\ncan achieve solutions that are at least as good as SGNS\u2019s solutions for word simi-\nlarity tasks. On analogy questions SGNS remains superior to SVD. We conjecture\nthat this stems from the weighted nature of SGNS\u2019s factorization.\n\nIntroduction\n\n1\nMost tasks in natural language processing and understanding involve looking at words, and could\nbene\ufb01t from word representations that do not treat individual words as unique symbols, but instead\nre\ufb02ect similarities and dissimilarities between them. The common paradigm for deriving such repre-\nsentations is based on the distributional hypothesis of Harris [15], which states that words in similar\ncontexts have similar meanings. This has given rise to many word representation methods in the\nNLP literature, the vast majority of whom can be described in terms of a word-context matrix M in\nwhich each row i corresponds to a word, each column j to a context in which the word appeared, and\neach matrix entry Mij corresponds to some association measure between the word and the context.\nWords are then represented as rows in M or in a dimensionality-reduced matrix based on M.\nRecently, there has been a surge of work proposing to represent words as dense vectors, derived using\nvarious training methods inspired from neural-network language modeling [3, 9, 23, 21]. These\nrepresentations, referred to as \u201cneural embeddings\u201d or \u201cword embeddings\u201d, have been shown to\nperform well in a variety of NLP tasks [26, 10, 1]. In particular, a sequence of papers by Mikolov and\ncolleagues [20, 21] culminated in the skip-gram with negative-sampling (SGNS) training method\nwhich is both ef\ufb01cient to train and provides state-of-the-art results on various linguistic tasks. The\ntraining method (as implemented in the word2vec software package) is highly popular, but not\nwell understood. While it is clear that the training objective follows the distributional hypothesis\n\u2013 by trying to maximize the dot-product between the vectors of frequently occurring word-context\npairs, and minimize it for random word-context pairs \u2013 very little is known about the quantity being\noptimized by the algorithm, or the reason it is expected to produce good word representations.\nIn this work, we aim to broaden the theoretical understanding of neural-inspired word embeddings.\nSpeci\ufb01cally, we cast SGNS\u2019s training method as weighted matrix factorization, and show that its\nobjective is implicitly factorizing a shifted PMI matrix \u2013 the well-known word-context PMI matrix\nfrom the word-similarity literature, shifted by a constant offset. A similar result holds also for the\n\n1\n\n\fNCE embedding method of Mnih and Kavukcuoglu [24]. While it is impractical to directly use the\nvery high-dimensional and dense shifted PMI matrix, we propose to approximate it with the positive\nshifted PMI matrix (Shifted PPMI), which is sparse. Shifted PPMI is far better at optimizing SGNS\u2019s\nobjective, and performs slightly better than word2vec derived vectors on several linguistic tasks.\nFinally, we suggest a simple spectral algorithm that is based on performing SVD over the Shifted\nPPMI matrix. The spectral algorithm outperforms both SGNS and the Shifted PPMI matrix on the\nword similarity tasks, and is scalable to large corpora. However, it lags behind the SGNS-derived\nrepresentation on word-analogy tasks. We conjecture that this behavior is related to the fact that\nSGNS performs weighted matrix factorization, giving more in\ufb02uence to frequent pairs, as opposed\nto SVD, which gives the same weight to all matrix cells. While the weighted and non-weighted\nobjectives share the same optimal solution (perfect reconstruction of the shifted PMI matrix), they\nresult in different generalizations when combined with dimensionality constraints.\n\n2 Background: Skip-Gram with Negative Sampling (SGNS)\n\nOur departure point is SGNS \u2013 the skip-gram neural embedding model introduced in [20] trained\nusing the negative-sampling procedure presented in [21]. In what follows, we summarize the SGNS\nmodel and introduce our notation. A detailed derivation of the SGNS model is available in [14].\nSetting and Notation The skip-gram model assumes a corpus of words w \u2208 VW and their\ncontexts c \u2208 VC, where VW and VC are the word and context vocabularies.\nIn [20, 21]\nthe words come from an unannotated textual corpus of words w1, w2, . . . , wn (typically n is in\nthe billions) and the contexts for word wi are the words surrounding it in an L-sized window\nwi\u2212L, . . . , wi\u22121, wi+1, . . . , wi+L. Other de\ufb01nitions of contexts are possible [18]. We denote the\ncollection of observed words and context pairs as D. We use #(w, c) to denote the number of times\n#(w(cid:48), c)\n\nthe pair (w, c) appears in D. Similarly, #(w) =(cid:80)\n\n#(w, c(cid:48)) and #(c) =(cid:80)\n\nare the number of times w and c occurred in D, respectively.\nEach word w \u2208 VW is associated with a vector (cid:126)w \u2208 Rd and similarly each context c \u2208 VC is\nrepresented as a vector (cid:126)c \u2208 Rd, where d is the embedding\u2019s dimensionality. The entries in the\nvectors are latent, and treated as parameters to be learned. We sometimes refer to the vectors (cid:126)w as\nrows in a |VW|\u00d7 d matrix W , and to the vectors (cid:126)c as rows in a |VC|\u00d7 d matrix C. In such cases, Wi\n(Ci) refers to the vector representation of the ith word (context) in the corresponding vocabulary.\nWhen referring to embeddings produced by a speci\ufb01c method x, we will usually use W x and C x\nexplicitly, but may use just W and C when the method is clear from the discussion.\n\nc(cid:48)\u2208VC\n\nw(cid:48)\u2208VW\n\nSGNS\u2019s Objective Consider a word-context pair (w, c). Did this pair come from the observed data\nD? Let P (D = 1|w, c) be the probability that (w, c) came from the data, and P (D = 0|w, c) =\n1 \u2212 P (D = 1|w, c) the probability that (w, c) did not. The distribution is modeled as:\n\nP (D = 1|w, c) = \u03c3( (cid:126)w \u00b7 (cid:126)c) =\n\n1\n\n1 + e\u2212 (cid:126)w\u00b7(cid:126)c\n\nwhere (cid:126)w and (cid:126)c (each a d-dimensional vector) are the model parameters to be learned.\nThe negative sampling objective tries to maximize P (D = 1|w, c) for observed (w, c) pairs while\nmaximizing P (D = 0|w, c) for randomly sampled \u201cnegative\u201d examples (hence the name \u201cnegative\nsampling\u201d), under the assumption that randomly selecting a context for a given word is likely to\nresult in an unobserved (w, c) pair. SGNS\u2019s objective for a single (w, c) observation is then:\n\n(1)\nwhere k is the number of \u201cnegative\u201d samples and cN is the sampled context, drawn according to the\nempirical unigram distribution PD(c) = #(c)\n\nlog \u03c3( (cid:126)w \u00b7 (cid:126)c) + k \u00b7 EcN\u223cPD [log \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)cN )]\n\n|D| . 1\n\n1In the algorithm described in [21], the negative contexts are sampled according to p3/4(c) = #c3/4\nZ . Sampling according to p3/4 indeed produces somewhat superior results\ninstead of the unigram distribution #c\non some of the semantic evaluation tasks. It is straight-forward to modify the PMI metric in a similar fashion\nby replacing the p(c) term with p3/4(c), and doing so shows similar trends in the matrix-based methods as it\ndoes in word2vec\u2019s stochastic gradient based training method. We do not explore this further in this paper,\nand report results using the unigram distribution.\n\nZ\n\n2\n\n\fThe objective is trained in an online fashion using stochastic gradient updates over the observed\npairs in the corpus D. The global objective then sums over the observed (w, c) pairs in the corpus:\n\n#(w, c) (log \u03c3( (cid:126)w \u00b7 (cid:126)c) + k \u00b7 EcN\u223cPD [log \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)cN )])\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\nw\u2208VW\n\nc\u2208VC\n\n(cid:96) =\n\nOptimizing this objective makes observed word-context pairs have similar embeddings, while scat-\ntering unobserved pairs. Intuitively, words that appear in similar contexts should have similar em-\nbeddings, though we are not familiar with a formal proof that SGNS does indeed maximize the\ndot-product of similar words.\n\n3 SGNS as Implicit Matrix Factorization\nSGNS embeds both words and their contexts into a low-dimensional space Rd, resulting in the\nword and context matrices W and C. The rows of matrix W are typically used in NLP tasks (such\nas computing word similarities) while C is ignored. It is nonetheless instructive to consider the\nproduct W \u00b7 C(cid:62) = M. Viewed this way, SGNS can be described as factorizing an implicit matrix\nM of dimensions |VW| \u00d7 |VC| into two smaller matrices.\nWhich matrix is being factorized? A matrix entry Mij corresponds to the dot product Wi \u00b7 Cj =\n(cid:126)wi \u00b7 (cid:126)cj. Thus, SGNS is factorizing a matrix in which each row corresponds to a word w \u2208 VW ,\neach column corresponds to a context c \u2208 VC, and each cell contains a quantity f (w, c) re\ufb02ecting\nthe strength of association between that particular word-context pair. Such word-context association\nmatrices are very common in the NLP and word-similarity literature, see e.g. [29, 2]. That said, the\nobjective of SGNS (equation 2) does not explicitly state what this association metric is. What can\nwe say about the association function f (w, c)? In other words, which matrix is SGNS factorizing?\n\n3.1 Characterizing the Implicit Matrix\n\nConsider the global objective (equation 2) above. For suf\ufb01ciently large dimensionality d (i.e. allow-\ning for a perfect reconstruction of M), each product (cid:126)w \u00b7 (cid:126)c can assume a value independently of the\nothers. Under these conditions, we can treat the objective (cid:96) as a function of independent (cid:126)w \u00b7 (cid:126)c terms,\nand \ufb01nd the values of these terms that maximize it.\nWe begin by rewriting equation 2:\n\nw\u2208VW\n\nc\u2208VC\n\n#(w, c) (k \u00b7 EcN\u223cPD [log \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)cN )])\n\n(cid:88)\n(cid:88)\n(cid:88)\n#(w) (k \u00b7 EcN\u223cPD [log \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)cN )])\n(cid:88)\n(cid:88)\n\nlog \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)cN )\n\n#(cN )\n|D|\n\ncN\u2208VC\n\nw\u2208VW\n\nlog \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)cN )\n\n#(cN )\n|D|\n\ncN\u2208VC\\{c}\n\n(3)\n\n(4)\n\n(5)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nc\u2208VC\n\nw\u2208VW\n\n=\n\nw\u2208VW\n\nc\u2208VC\n\n(cid:96) =\n\n#(w, c) (log \u03c3( (cid:126)w \u00b7 (cid:126)c)) +\n\n#(w, c) (log \u03c3( (cid:126)w \u00b7 (cid:126)c)) +\n\nand explicitly expressing the expectation term:\nEcN\u223cPD [log \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)cN )] =\n\n=\n\n#(c)\n|D|\n\nlog \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)c) +\n\nCombining equations 3 and 4 reveals the local objective for a speci\ufb01c (w, c) pair:\nlog \u03c3(\u2212 (cid:126)w \u00b7 (cid:126)c)\n\n(cid:96)(w, c) = #(w, c) log \u03c3( (cid:126)w \u00b7 (cid:126)c) + k \u00b7 #(w) \u00b7 #(c)\n|D|\n\nTo optimize the objective, we de\ufb01ne x = (cid:126)w \u00b7 (cid:126)c and \ufb01nd its partial derivative with respect to x:\n\nWe compare the derivative to zero, and after some simpli\ufb01cation, arrive at:\n\n\u2202(cid:96)\n\u2202x\n\n= #(w, c) \u00b7 \u03c3(\u2212x) \u2212 k \u00b7 #(w) \u00b7 #(c)\n|D|\n\n\u00b7 \u03c3(x)\n\n\uf8eb\uf8ed #(w, c)\n\nk \u00b7 #(w) \u00b7 #(c)\n|D|\n\n\uf8f6\uf8f8 ex \u2212\n\n\u2212 1\n\ne2x \u2212\n\n#(w, c)\n\nk \u00b7 #(w) \u00b7 #(c)\n|D|\n\n= 0\n\n3\n\n\fIf we de\ufb01ne y = ex, this equation becomes a quadratic equation of y, which has two solutions,\ny = \u22121 (which is invalid given the de\ufb01nition of y) and:\n\n#(w, c)\n\ny =\n\nk \u00b7 #(w) \u00b7 #(c)\n|D|\nSubstituting y with ex and x with (cid:126)w \u00b7 (cid:126)c reveals:\n\u00b7 1\nk\n\n(cid:18) #(w, c) \u00b7 |D|\n(cid:16) #(w,c)\u00b7|D|\n\n#(w) \u00b7 #(c)\n\n(cid:126)w \u00b7 (cid:126)c = log\n\n(cid:17)\n\n(cid:19)\n\nInterestingly, the expression log\n(PMI) of (w, c), which we discuss in depth below.\nFinally, we can describe the matrix M that SGNS is factorizing:\n\n#(w)\u00b7#(c)\n\n#(w, c) \u00b7 |D|\n#w \u00b7 #(c)\n\n\u00b7 1\nk\n\n=\n\n(cid:18) #(w, c) \u00b7 |D|\n\n#(w) \u00b7 #(c)\n\n(cid:19)\n\n\u2212 log k\n\n(6)\n\n= log\n\nis the well-known pointwise mutual information\n\nM SGNS\n\nij\n\n= Wi \u00b7 Cj = (cid:126)wi \u00b7 (cid:126)cj = P M I(wi, cj) \u2212 log k\n\n(7)\n\nFor a negative-sampling value of k = 1, the SGNS objective is factorizing a word-context matrix\nin which the association between a word and its context is measured by f (w, c) = P M I(w, c).\nWe refer to this matrix as the PMI matrix, M P M I. For negative-sampling values k > 1, SGNS is\nfactorizing a shifted PMI matrix M P M Ik = M P M I \u2212 log k.\nOther embedding methods can also be cast as factorizing implicit word-context matrices. Using a\nsimilar derivation, it can be shown that noise-contrastive estimation (NCE) [24] is factorizing the\n(shifted) log-conditional-probability matrix:\n\nM NCE\n\nij = (cid:126)wi \u00b7 (cid:126)cj = log\n\n\u2212 log k = log P (w|c) \u2212 log k\n\n(8)\n\n(cid:18) #(w, c)\n\n(cid:19)\n\n#(c)\n\n3.2 Weighted Matrix Factorization\nWe obtained that SGNS\u2019s objective is optimized by setting (cid:126)w \u00b7 (cid:126)c = P M I(w, c) \u2212 log k for every\n(w, c) pair. However, this assumes that the dimensionality of (cid:126)w and (cid:126)c is high enough to allow for\nperfect reconstruction. When perfect reconstruction is not possible, some (cid:126)w\u00b7(cid:126)c products must deviate\nfrom their optimal values. Looking at the pair-speci\ufb01c objective (equation 5) reveals that the loss\nfor a pair (w, c) depends on its number of observations (#(w, c)) and expected negative samples\n(k \u00b7 #(w) \u00b7 #(c)/|D|). SGNS\u2019s objective can now be cast as a weighted matrix factorization prob-\nlem, seeking the optimal d-dimensional factorization of the matrix M P M I \u2212 log k under a metric\nwhich pays more for deviations on frequent (w, c) pairs than deviations on infrequent ones.\n\n3.3 Pointwise Mutual Information\n\nPointwise mutual information is an information-theoretic association measure between a pair of\ndiscrete outcomes x and y, de\ufb01ned as:\n\nP M I(x, y) = log\n\nP (x, y)\n\nP (x)P (y)\n\n(9)\n\nIn our case, P M I(w, c) measures the association between a word w and a context c by calculating\nthe log of the ratio between their joint probability (the frequency in which they occur together)\nand their marginal probabilities (the frequency in which they occur independently). PMI can be\nestimated empirically by considering the actual number of observations in a corpus:\n\nP M I(w, c) = log\n\n#(w, c) \u00b7 |D|\n#(w) \u00b7 #(c)\n\n(10)\n\nThe use of PMI as a measure of association in NLP was introduced by Church and Hanks [8] and\nwidely adopted for word similarity tasks [11, 27, 29].\nWorking with the PMI matrix presents some computational challenges. The rows of M PMI con-\ntain many entries of word-context pairs (w, c) that were never observed in the corpus, for which\n\n4\n\n\fP M I(w, c) = log 0 = \u2212\u221e. Not only is the matrix ill-de\ufb01ned, it is also dense, which is a major\npractical issue because of its huge dimensions |VW| \u00d7 |VC|. One could smooth the probabilities\nusing, for instance, a Dirichlet prior by adding a small \u201cfake\u201d count to the underlying counts matrix,\nrendering all word-context pairs observed. While the resulting matrix will not contain any in\ufb01nite\nvalues, it will remain dense.\nAn alternative approach, commonly used in NLP, is to replace the M PMI matrix with M PMI\n, in\nwhich P M I(w, c) = 0 in cases #(w, c) = 0, resulting in a sparse matrix. We note that M PMI\nis\n0\ninconsistent, in the sense that observed but \u201cbad\u201d (uncorrelated) word-context pairs have a negative\nmatrix entry, while unobserved (hence worse) ones have 0 in their corresponding cell. Consider for\nexample a pair of relatively frequent words (high P (w) and P (c)) that occur only once together.\nThere is strong evidence that the words tend not to appear together, resulting in a negative PMI\nvalue, and hence a negative matrix entry. On the other hand, a pair of frequent words (same P (w)\nand P (c)) that is never observed occurring together in the corpus, will receive a value of 0.\nA sparse and consistent alternative from the NLP literature is to use the positive PMI (PPMI) metric,\nin which all negative values are replaced by 0:\n\n0\n\nP P M I(w, c) = max (P M I (w, c) , 0)\n\n(11)\n\nWhen representing words, there is some intuition behind ignoring negative values: humans can\neasily think of positive associations (e.g. \u201cCanada\u201d and \u201csnow\u201d) but \ufb01nd it much harder to invent\nnegative ones (\u201cCanada\u201d and \u201cdesert\u201d). This suggests that the perceived similarity of two words\nis more in\ufb02uenced by the positive context they share than by the negative context they share. It\ntherefore makes some intuitive sense to discard the negatively associated contexts and mark them\nas \u201cuninformative\u201d (0) instead.2 Indeed, it was shown that the PPMI metric performs very well on\nsemantic similarity tasks [5].\nBoth M PMI\nand M PPMI are well known to the NLP community. In particular, systematic comparisons\nof various word-context association metrics show that PMI, and more so PPMI, provide the best\nresults for a wide range of word-similarity tasks [5, 16]. It is thus interesting that the PMI matrix\nemerges as the optimal solution for SGNS\u2019s objective.\n\n0\n\n4 Alternative Word Representations\n\nAs SGNS with k = 1 is attempting to implicitly factorize the familiar matrix M PMI, a natural algo-\nrithm would be to use the rows of M PPMI directly when calculating word similarities. Though PPMI\nis only an approximation of the original PMI matrix, it still brings the objective function very close\nto its optimum (see Section 5.1). In this section, we propose two alternative word representations\nthat build upon M PPMI.\n\n4.1 Shifted PPMI\n\nWhile the PMI matrix emerges from SGNS with k = 1, it was shown that different values of k can\nsubstantially improve the resulting embedding. With k > 1, the association metric in the implicitly\nfactorized matrix is P M I(w, c) \u2212 log(k). This suggests the use of Shifted PPMI (SPPMI), a novel\nassociation metric which, to the best of our knowledge, was not explored in the NLP and word-\nsimilarity communities:\n\nSP P M Ik(w, c) = max (P M I (w, c) \u2212 log k, 0)\n\n(12)\n\nAs with SGNS, certain values of k can improve the performance of M SPPMIk on different tasks.\n\n4.2 Spectral Dimensionality Reduction: SVD over Shifted PPMI\n\nWhile sparse vector representations work well, there are also advantages to working with dense low-\ndimensional vectors, such as improved computational ef\ufb01ciency and, arguably, better generalization.\n\n2A notable exception is the case of syntactic similarity. For example, all verbs share a very strong negative\nassociation with being preceded by determiners, and past tense verbs have a very strong negative association to\nbe preceded by \u201cbe\u201d verbs and modals.\n\n5\n\n\fAn alternative matrix factorization method to SGNS\u2019s stochastic gradient training is truncated Sin-\ngular Value Decomposition (SVD) \u2013 a basic algorithm from linear algebra which is used to achieve\nthe optimal rank d factorization with respect to L2 loss [12]. SVD factorizes M into the product\nof three matrices U \u00b7 \u03a3 \u00b7 V (cid:62), where U and V are orthonormal and \u03a3 is a diagonal matrix of sin-\ngular values. Let \u03a3d be the diagonal matrix formed from the top d singular values, and let Ud and\nVd be the matrices produced by selecting the corresponding columns from U and V . The matrix\nMd = Ud\u00b7 \u03a3d\u00b7 V (cid:62)\nd is the matrix of rank d that best approximates the original matrix M, in the sense\nthat it minimizes the approximation errors. That is, Md = arg minRank(M(cid:48))=d (cid:107)M(cid:48) \u2212 M(cid:107)2.\nWhen using SVD, the dot-products between the rows of W = Ud \u00b7 \u03a3d are equal to the dot-products\nbetween rows of Md. In the context of word-context matrices, the dense, d dimensional rows of W\nare perfect substitutes for the very high-dimensional rows of Md. Indeed another common approach\nin the NLP literature is factorizing the PPMI matrix M PPMI with SVD, and then taking the rows\nof W SVD = Ud \u00b7 \u03a3d and CSVD = Vd as word and context representations, respectively. However,\nusing the rows of W SVD as word representations consistently under-perform the W SGNS embeddings\nderived from SGNS when evaluated on semantic tasks.\n\nSymmetric SVD We note that in the SVD-based factorization, the resulting word and context\nmatrices have very different properties. In particular, the context matrix CSVD is orthonormal while\nthe word matrix W SVD is not. On the other hand, the factorization achieved by SGNS\u2019s training\nprocedure is much more \u201csymmetric\u201d, in the sense that neither W W2V nor CW2V is orthonormal, and\nno particular bias is given to either of the matrices in the training objective. We therefore propose\nachieving similar symmetry with the following factorization:\n\nW SVD1/2 = Ud \u00b7(cid:112)\n\n\u03a3d\n\nCSVD1/2 = Vd \u00b7(cid:112)\n\n\u03a3d\n\n(13)\n\nWhile it is not theoretically clear why the symmetric approach is better for semantic tasks, it does\nwork much better empirically.3\n\nSVD versus SGNS The spectral algorithm has two computational advantages over stochastic gra-\ndient training. First, it is exact, and does not require learning rates or hyper-parameter tuning.\nSecond, it can be easily trained on count-aggregated data (i.e. {(w, c, #(w, c))} triplets), making it\napplicable to much larger corpora than SGNS\u2019s training procedure, which requires each observation\nof (w, c) to be presented separately.\nOn the other hand, the stochastic gradient method has advantages as well: in contrast to SVD, it\ndistinguishes between observed and unobserved events; SVD is known to suffer from unobserved\nvalues [17], which are very common in word-context matrices. More importantly, SGNS\u2019s objective\nweighs different (w, c) pairs differently, preferring to assign correct values to frequent (w, c) pairs\nwhile allowing more error for infrequent pairs (see Section 3.2). Unfortunately, exact weighted\nSVD is a hard computational problem [25]. Finally, because SGNS cares only about observed\n(and sampled) (w, c) pairs, it does not require the underlying matrix to be a sparse one, enabling\noptimization of dense matrices, such as the exact P M I \u2212 log k matrix. The same is not feasible\nwhen using SVD.\nAn interesting middle-ground between SGNS and SVD is the use of stochastic matrix factorization\n(SMF) approaches, common in the collaborative \ufb01ltering literature [17]. In contrast to SVD, the\nSMF approaches are not exact, and do require hyper-parameter tuning. On the other hand, they\nare better than SVD at handling unobserved values, and can integrate importance weighting for\nexamples, much like SGNS\u2019s training procedure. However, like SVD and unlike SGNS\u2019s procedure,\nthe SMF approaches work over aggregated (w, c) statistics allowing (w, c, f (w, c)) triplets as input,\nmaking the optimization objective more direct, and scalable to signi\ufb01cantly larger corpora. SMF\napproaches have additional advantages over both SGNS and SVD, such as regularization, opening\nthe way to a range of possible improvements. We leave the exploration of SMF-based algorithms\nfor word embeddings to future work.\n\n3The approach can be generalized to W SVD\u03b1 = Ud\u00b7(\u03a3d)\u03b1, making \u03b1 a tunable parameter. This observation\nwas previously made by Caron [7] and investigated in [6, 28], showing that different values of \u03b1 indeed perform\nbetter than others for various tasks. In particular, setting \u03b1 = 0 performs well for many tasks. We do not explore\ntuning the \u03b1 parameter in this work.\n\n6\n\n\fMethod\n\nPMI\u2212 log k\n\nSPPMI\n\nk = 1\nk = 5\nk = 15\n\n0%\n0%\n0%\n\n0.00009%\n0.00004%\n0.00002%\n\nd = 100\n26.1%\n95.8%\n266%\n\nSVD\nd = 500\n25.2%\n95.1%\n266%\n\nd = 1000\n\n24.2%\n94.9%\n265%\n\nd = 100\n31.4%\n39.3%\n7.80%\n\nSGNS\nd = 500\n29.4%\n36.0%\n6.37%\n\nd = 1000\n\n7.40%\n7.13%\n5.97%\n\nTable 1: Percentage of deviation from the optimal objective value (lower values are better). See 5.1 for details.\n\n5 Empirical Results\n\nWe compare the matrix-based algorithms to SGNS in two aspects. First, we measure how well each\nalgorithm optimizes the objective, and then proceed to evaluate the methods on various linguistic\ntasks. We \ufb01nd that for some tasks there is a large discrepancy between optimizing the objective and\ndoing well on the linguistic task.\nExperimental Setup All models were trained on English Wikipedia, pre-processed by removing\nnon-textual elements, sentence splitting, and tokenization. The corpus contains 77.5 million sen-\ntences, spanning 1.5 billion tokens. All models were derived using a window of 2 tokens to each\nside of the focus word, ignoring words that appeared less than 100 times in the corpus, resulting\nin vocabularies of 189,533 terms for both words and contexts. To train the SGNS models, we used\na modi\ufb01ed version of word2vec which receives a sequence of pre-extracted word-context pairs\n[18].4 We experimented with three values of k (number of negative samples in SGNS, shift param-\n\u03a3d as explained in Section 4.\n\neter in PMI-based methods): 1, 5, 15. For SVD, we take W = Ud \u00b7 \u221a\n5.1 Optimizing the Objective\nNow that we have an analytical solution for the objective, we can measure how well each algorithm\noptimizes this objective in practice. To do so, we calculated (cid:96), the value of the objective (equation 2)\ngiven each word (and context) representation.5 For sparse matrix representations, we substituted (cid:126)w\u00b7(cid:126)c\nwith the matching cell\u2019s value (e.g. for SPPMI, we set (cid:126)w \u00b7 (cid:126)c = max(PMI(w, c) \u2212 log k, 0)). Each\nalgorithm\u2019s (cid:96) value was compared to (cid:96)Opt, the objective when setting (cid:126)w \u00b7 (cid:126)c = PMI(w, c) \u2212 log k,\nwhich was shown to be optimal (Section 3.1). The percentage of deviation from the optimum is\nde\ufb01ned by ((cid:96) \u2212 (cid:96)Opt)/((cid:96)Opt) and presented in table 1.\nWe observe that SPPMI is indeed a near-perfect approximation of the optimal solution, even though\nit discards a lot of information when considering only positive cells. We also note that for the\nfactorization methods, increasing the dimensionality enables better solutions, as expected. SVD is\nslightly better than SGNS at optimizing the objective for d \u2264 500 and k = 1. However, while\nSGNS is able to leverage higher dimensions and reduce its error signi\ufb01cantly, SVD fails to do so.\nFurthermore, SVD becomes very erroneous as k increases. We hypothesize that this is a result of the\nincreasing number of zero-cells, which may cause SVD to prefer a factorization that is very close to\nthe zero matrix, since SVD\u2019s L2 objective is unweighted, and does not distinguish between observed\nand unobserved matrix cells.\n\n5.2 Performance of Word Representations on Linguistic Tasks\nLinguistic Tasks and Datasets We evaluated the word representations on four dataset, covering\nword similarity and relational analogy tasks. We used two datasets to evaluate pairwise word simi-\nlarity: Finkelstein et al.\u2019s WordSim353 [13] and Bruni et al.\u2019s MEN [4]. These datasets contain word\npairs together with human-assigned similarity scores. The word vectors are evaluated by ranking\nthe pairs according to their cosine similarities, and measuring the correlation (Spearman\u2019s \u03c1) with\nthe human ratings.\nThe two analogy datasets present questions of the form \u201ca is to a\u2217 as b is to b\u2217\u201d, where b\u2217 is hidden,\nand must be guessed from the entire vocabulary. The Syntactic dataset [22] contains 8000 morpho-\n\n4http://www.bitbucket.org/yoavgo/word2vecf\n5 Since it is computationally expensive to calculate the exact objective, we approximated it. First, instead of\nenumerating every observed word-context pair in the corpus, we sampled 10 million such pairs, according to\ntheir prevalence. Second, instead of calculating the expectation term explicitly (as in equation 4), we sampled\na negative example {(w, cN )} for each one of the 10 million \u201cpositive\u201d examples, using the contexts\u2019 unigram\ndistribution, as done by SGNS\u2019s optimization procedure (explained in Section 2).\n\n7\n\n\fWS353 (WORDSIM) [13]\nCorr.\nRepresentation\n0.691\n(k=5)\nSVD\n0.687\n(k=15)\nSPPMI\n(k=5)\nSPPMI\n0.670\nSGNS\n(k=15)\n0.666\n0.661\n(k=15)\nSVD\n0.652\n(k=1)\nSVD\n0.644\n(k=5)\nSGNS\n0.633\n(k=1)\nSGNS\nSPPMI\n(k=1)\n0.605\n\nMEN (WORDSIM) [4]\nCorr.\nRepresentation\n0.735\n(k=1)\nSVD\n0.734\n(k=5)\nSVD\n0.721\n(k=5)\nSPPMI\nSPPMI\n(k=15)\n0.719\n0.716\n(k=15)\nSGNS\n0.708\n(k=5)\nSGNS\n0.694\n(k=15)\nSVD\n0.690\n(k=1)\nSGNS\nSPPMI\n(k=1)\n0.688\n\nMIXED ANALOGIES [20]\nAcc.\nRepresentation\n0.655\n(k=1)\nSPPMI\n0.644\n(k=5)\nSPPMI\n0.619\n(k=15)\nSGNS\nSGNS\n(k=5)\n0.616\n0.571\n(k=15)\nSPPMI\n0.567\n(k=1)\nSVD\n0.540\n(k=1)\nSGNS\n0.472\n(k=5)\nSVD\nSVD\n(k=15)\n0.341\n\nSYNT. ANALOGIES [22]\nAcc.\nRepresentation\n0.627\n(k=15)\nSGNS\n0.619\n(k=5)\nSGNS\n0.59\n(k=1)\nSGNS\nSPPMI\n(k=5)\n0.466\n0.448\n(k=1)\nSVD\n0.445\n(k=1)\nSPPMI\n0.353\n(k=15)\nSPPMI\n0.337\n(k=5)\nSVD\nSVD\n(k=15)\n0.208\n\nTable 2: A comparison of word representations on various linguistic tasks. The different representations were\ncreated by three algorithms (SPPMI, SVD, SGNS) with d = 1000 and different values of k.\n\nsyntactic analogy questions, such as \u201cgood is to best as smart is to smartest\u201d. The Mixed dataset [20]\ncontains 19544 questions, about half of the same kind as in Syntactic, and another half of a more se-\nmantic nature, such as capital cities (\u201cParis is to France as Tokyo is to Japan\u201d). After \ufb01ltering ques-\ntions involving out-of-vocabulary words, i.e. words that appeared in English Wikipedia less than\n100 times, we remain with 7118 instances in Syntactic and 19258 instances in Mixed. The analogy\nquestions are answered using Levy and Goldberg\u2019s similarity multiplication method [19], which is\nstate-of-the-art in analogy recovery: arg maxb\u2217\u2208VW \\{a\u2217,b,a} cos(b\u2217, a\u2217)\u00b7cos(b\u2217, b)/(cos(b\u2217, a)+\u03b5).\nThe evaluation metric for the analogy questions is the percentage of questions for which the argmax\nresult was the correct answer (b\u2217).\nResults Table 2 shows the experiments\u2019 results. On the word similarity task, SPPMI yields better\nresults than SGNS, and SVD improves even more. However, the difference between the top PMI-\nbased method and the top SGNS con\ufb01guration in each dataset is small, and it is reasonable to say\nthat they perform on-par. It is also evident that different values of k have a signi\ufb01cant effect on all\nmethods: SGNS generally works better with higher values of k, whereas SPPMI and SVD prefer\nlower values of k. This may be due to the fact that only positive values are retained, and high values\nof k may cause too much loss of information. A similar observation was made for SGNS and SVD\nwhen observing how well they optimized the objective (Section 5.1). Nevertheless, tuning k can\nsigni\ufb01cantly increase the performance of SPPMI over the traditional PPMI con\ufb01guration (k = 1).\nThe analogies task shows different behavior. First, SVD does not perform as well as SGNS and\nSPPMI. More interestingly, in the syntactic analogies dataset, SGNS signi\ufb01cantly outperforms the\nrest. This trend is even more pronounced when using the additive analogy recovery method [22] (not\nshown). Linguistically speaking, the syntactic analogies dataset is quite different from the rest, since\nit relies more on contextual information from common words such as determiners (\u201cthe\u201d, \u201ceach\u201d,\n\u201cmany\u201d) and auxiliary verbs (\u201cwill\u201d, \u201chad\u201d) to solve correctly. We conjecture that SGNS performs\nbetter on this task because its training procedure gives more in\ufb02uence to frequent pairs, as opposed\nto SVD\u2019s objective, which gives the same weight to all matrix cells (see Section 3.2).\n6 Conclusion\nWe analyzed the SGNS word embedding algorithms, and showed that it is implicitly factorizing the\n(shifted) word-context PMI matrix M PMI \u2212 log k using per-observation stochastic gradient updates.\nWe presented SPPMI, a modi\ufb01cation of PPMI inspired by our theoretical \ufb01ndings. Indeed, using\nSPPMI can improve upon the traditional PPMI matrix. Though SPPMI provides a far better solution\nto SGNS\u2019s objective, it does not necessarily perform better than SGNS on linguistic tasks, as evident\nwith syntactic analogies. We suspect that this may be related to SGNS down-weighting rare words,\nwhich PMI-based methods are known to exaggerate.\nWe also experimented with an alternative matrix factorization method, SVD. Although SVD was\nrelatively poor at optimizing SGNS\u2019s objective, it performed slightly better than the other methods\non word similarity datasets. However, SVD underperforms on the word-analogy task. One of the\nmain differences between the SVD and SGNS is that SGNS performs weighted matrix factoriza-\ntion, which may be giving it an edge in the analogy task. As future work we suggest investigating\nweighted matrix factorizations of word-context matrices with PMI-based association metrics.\nAcknowledgements This work was partially supported by the EC-funded project EXCITEMENT\n(FP7ICT-287923). We thank Ido Dagan and Peter Turney for their valuable insights.\n\n8\n\n\fReferences\n[1] Marco Baroni, Georgiana Dinu, and Germ\u00b4an Kruszewski. Dont count, predict! a systematic comparison\n\nof context-counting vs. context-predicting semantic vectors. In ACL, 2014.\n\n[2] Marco Baroni and Alessandro Lenci. Distributional memory: A general framework for corpus-based\n\nsemantics. Computational Linguistics, 36(4):673\u2013721, 2010.\n\n[3] Yoshua Bengio, R\u00b4ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language\n\nmodel. Journal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n[4] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. Distributional semantics in technicolor.\n\nIn ACL, 2012.\n\n[5] John A Bullinaria and Joseph P Levy. Extracting semantic representations from word co-occurrence\n\nstatistics: a computational study. Behavior Research Methods, 39(3):510\u2013526, 2007.\n\n[6] John A Bullinaria and Joseph P Levy. Extracting semantic representations from word co-occurrence\n\nstatistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44(3):890\u2013907, 2012.\n\n[7] John Caron. Experiments with LSA scoring: optimal rank and basis. In Proceedings of the SIAM Com-\n\nputational Information Retrieval Workshop, pages 157\u2013169, 2001.\n\n[8] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography.\n\nComputational linguistics, 16(1):22\u201329, 1990.\n\n[9] Ronan Collobert and Jason Weston. A uni\ufb01ed architecture for natural language processing: Deep neural\n\nnetworks with multitask learning. In ICML, 2008.\n\n[10] Ronan Collobert, Jason Weston, L\u00b4eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.\n\nNatural language processing (almost) from scratch. The Journal of Machine Learning Research, 2011.\n\n[11] Ido Dagan, Fernando Pereira, and Lillian Lee. Similarity-based estimation of word cooccurrence proba-\n\nbilities. In ACL, 1994.\n\n[12] C Eckart and G Young. The approximation of one matrix by another of lower rank. Psychometrika,\n\n1:211\u2013218, 1936.\n\n[13] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan\n\nRuppin. Placing search in context: The concept revisited. ACM TOIS, 2002.\n\n[14] Yoav Goldberg and Omer Levy. word2vec explained: deriving Mikolov et al.\u2019s negative-sampling word-\n\nembedding method. arXiv preprint arXiv:1402.3722, 2014.\n\n[15] Zellig Harris. Distributional structure. Word, 10(23):146\u2013162, 1954.\n[16] Douwe Kiela and Stephen Clark. A systematic study of semantic vector space model parameters.\n\nWorkshop on Continuous Vector Space Models and their Compositionality, 2014.\n\nIn\n\n[17] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender sys-\n\ntems. Computer, 2009.\n\n[18] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In ACL, 2014.\n[19] Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word representations. In\n\nCoNLL, 2014.\n\n[20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations\n\nin vector space. CoRR, abs/1301.3781, 2013.\n\n[21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed represen-\n\ntations of words and phrases and their compositionality. In NIPS, 2013.\n\n[22] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word\n\nrepresentations. In NAACL, 2013.\n\n[23] Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. In Advances in\n\nNeural Information Processing Systems, pages 1081\u20131088, 2008.\n\n[24] Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings ef\ufb01ciently with noise-contrastive\n\nestimation. In NIPS, 2013.\n\n[25] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In ICML, 2003.\n[26] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for\n\nsemi-supervised learning. In ACL, 2010.\n\n[27] Peter D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In ECML, 2001.\n[28] Peter D. Turney. Domain and function: A dual-space model of semantic relations and compositions.\n\nJournal of Arti\ufb01cial Intelligence Research, 44:533\u2013585, 2012.\n\n[29] Peter D. Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics.\n\nJournal of Arti\ufb01cial Intelligence Research, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1145, "authors": [{"given_name": "Omer", "family_name": "Levy", "institution": "Bar-Ilan University"}, {"given_name": "Yoav", "family_name": "Goldberg", "institution": "Bar-Ilan University"}]}