{"title": "Near-Optimal Smoothing of Structured Conditional Probability Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 4860, "page_last": 4868, "abstract": "Utilizing the structure of a probabilistic model can significantly increase its learning speed. Motivated by several recent applications, in particular bigram models in language processing, we consider learning low-rank conditional probability matrices under expected KL-risk. This choice makes smoothing, that is the careful handling of low-probability elements, paramount. We derive an iterative algorithm that extends classical non-negative matrix factorization to naturally incorporate additive smoothing and prove that it converges to the stationary points of a penalized empirical risk. We then derive sample-complexity bounds for the global minimizer of the penalized risk and show that it is within a small factor of the optimal sample complexity. This framework generalizes to more sophisticated smoothing techniques, including absolute-discounting.", "full_text": "Near-Optimal Smoothing of Structured Conditional\n\nProbability Matrices\n\nUniversity of California, San Diego\n\nToyota Technological Institute at Chicago\n\nMoein Falahatgar\n\nSan Diego, CA, USA\n\nmoein@ucsd.edu\n\nMesrob I. Ohannessian\n\nChicago, IL, USA\nmesrob@ttic.edu\n\nAlon Orlitsky\n\nUniversity of California, San Diego\n\nSan Diego, CA, USA\n\nalon@ucsd.edu\n\nAbstract\n\nUtilizing the structure of a probabilistic model can signi\ufb01cantly increase its learning\nspeed. Motivated by several recent applications, in particular bigram models\nin language processing, we consider learning low-rank conditional probability\nmatrices under expected KL-risk. This choice makes smoothing, that is the careful\nhandling of low-probability elements, paramount. We derive an iterative algorithm\nthat extends classical non-negative matrix factorization to naturally incorporate\nadditive smoothing and prove that it converges to the stationary points of a penalized\nempirical risk. We then derive sample-complexity bounds for the global minimzer\nof the penalized risk and show that it is within a small factor of the optimal\nsample complexity. This framework generalizes to more sophisticated smoothing\ntechniques, including absolute-discounting.\n\nIntroduction\n\n1\nOne of the fundamental tasks in statistical learning is probability estimation. When the possible\noutcomes can be divided into k discrete categories, e.g. types of words or bacterial species, the task\nof interest is to use data to estimate the probability masses p1,\u00b7\u00b7\u00b7 , pk, where pj is the probability of\nobserving category j. More often than not, it is not a single distribution that is to be estimated, but\nmultiple related distributions, e.g. frequencies of words within various contexts or species in different\nsamples. We can group these into a conditional probability (row-stochastic) matrix Pi,1,\u00b7\u00b7\u00b7 , Pi,k\nas i varies over c contexts, and Pij represents the probability of observing category j in context i.\nLearning these distributions individually would cause the data to be unnecessarily diluted. Instead,\nthe structure of the relationship between the contexts should be harnessed.\nA number of models have been proposed to address this structured learning task. One of the wildly\nsuccessful approaches consists of positing that P , despite being a c\u00d7k matrix, is in fact of much lower\nrank m. Effectively, this means that there exists a latent context space of size m (cid:28) c, k into which\nthe original context maps probabilistically via a c \u00d7 m stochastic matrix A, then this latent context\nin turn determines the outcome via an m \u00d7 k stochastic matrix B. Since this structural model means\nthat P factorizes as P = AB, this problem falls within the framework of low-rank (non-negative)\nmatrix factorization. Many topic models, such as the original work on probabilistic latent semantic\nanalysis PLSA, also map to this framework. We narrow our attention here to such low-rank models,\nbut note that more generally these efforts fall under the areas of structured and transfer learning.\nOther examples include: manifold learning, multi-task learning, and hierarchical models.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn natural language modeling, low-rank models are motivated by the inherent semantics of language:\ncontext \ufb01rst maps into meaning which then maps to a new word prediction. An alternative form\nof such latent structure, word embeddings derived from recurrent neural networks (or LSTMs) are\nthe state-of-the-art of current language models, [20, 25, 28]. A \ufb01rst chief motivation for the present\nwork is to establish a theoretical underpinning of the success of such representations. We restrict the\nexposition to bigram models. The traditional de\ufb01nition of the bigram is that language is modeled as a\nsequence of words generated by a \ufb01rst order Markov-chain. Therefore the \u2018context\u2019 of a new word is\nsimply its preceding word, and we have c = k. Since the focus here is not the dependencies induced\nby such memory, but rather the rami\ufb01cations of the structural assumptions on P , we take bigrams to\nmodel word-pairs independently sampled by \ufb01rst choosing the contextual word with probability \u03c0\nand then choosing the second word according to the conditional probability P , thus resulting in a\njoint distribution over word-pairs (\u03c0iPij).\nWhat is the natural measure of performance for a probability matrix estimator? Since ultimately\nsuch estimators are used to accurately characterize the likelihood of test data, the measure of choice\nused in empirical studies is the perplexity, or alternatively its logarithm, the cross entropy. For data\nconsisting of n word-pairs, if Cij is the number of times pair (i, j) appears, then the cross entropy\n. The population quantity that corresponds to this empirical\nof an estimator Q is 1\nn\nij \u03c0iPij log Pij\n.\nQij\nNote that this is indeed the expectation of the cross entropy modulo the true entropy, an additive term\nthat does not depend on Q. This is the natural notion of risk for the learning task, since we wish\nto infer the likelihood of future data, and our goal can now be more concretely stated as using the\ndata to produce an estimator Qn with a \u2018small\u2019 value of D(P(cid:107)Qn). The choice of KL-divergence\nintroduces a peculiar but important problem: the necessity to handle small frequencies appropriately.\nIn particular, using the empirical conditional probability is not viable, since a zero in Q implies\nin\ufb01nite risk. This is the problem of smoothing, which has received a great amount of attention by the\nNLP community. Our second salient motivation for the present work is to propose principled methods\nof integrating well-established smoothing techniques, such as add- 1\n2 and absolute discounting, into\nthe framework of structured probability matrix estimation.\nOur contributions are as follows, we provide:\n\u2022 A general framework for integrating smoothing and structured probability matrix estimation, as an\n\nperformance measure is the (row-by-row weighted) KL-divergence D(P(cid:107)Q) =(cid:80)\n\nij Cij log 1\nQij\n\n(cid:80)\n\nalternating-minimization that converges to a stationary point of a penalized empirical risk.\n\nglobal minimizer of this penalized empirical risk.\n\n\u2022 A sample complexity upper bound of O(km log2(2n + k)/n) for the expected KL-risk, for the\n\u2022 A lower bound that matches this upper bound up to the logarithmic term, showing near-optimality.\nThe paper is organized as follows. Section 2 reviews related work. Section 3 states the problem\nand Section 4 highlights our main results. Section 5 proposes our central algorithm and Section 6\nanalyzes its idealized variant. Section 7 provides some experiments and Section 8 concludes.\n2 Related Work\nLatent variable models, and in particular non-negative matrix factorization and topic models, have\nbeen such an active area of research in the past two decades that the space here cannot possibly do\njustice to the many remarkable contributions. We list here some of the most relevant to place our\nwork in context. We start by mentioning the seminal papers [12, 18] which proposed the alternating\nminimization algorithm that forms the basis of the current work. This has appeared in many forms in\nthe literature, including the multiplicative updates [29]. Some of the earliest work is reviewed in [23].\nThese may be generally interpreted as discrete analogs to PCA (and even ICA) [10].\nAn in\ufb02uential Bayesian generative topic model, the Latent Dirichlet Allocation, [7] is very closely\nrelated to what we propose. In fact, add-half smoothing effectively corresponds to a Dirichlet(1/2)\n(Jeffreys) prior. Our exposition differs primarily in adopting a minimax sample complexity perspective\nwhich is often not found in the otherwise elegant Bayesian framework. Furthermore, exact Bayesian\ninference remains a challenge and a lot of effort has been expended lately toward simple iterative\nalgorithms with provable guarantees, e.g.\n[3, 4]. Besides, a rich array of ef\ufb01cient smoothing\ntechniques exists for probability vector estimation [2, 16, 22, 26], of which one could directly avail\nin the methodology that is presented here.\n\n2\n\n\fA direction that is very related to ours was recently proposed in [13]. There, the primary goal is to\nrecover the rows of A and B in (cid:96)1-risk. This is done at the expense of additional separation conditions\non these rows. This makes the performance measure not easily comparable to our context, though\nwith the proper weighted combination it is easy to see that the implied (cid:96)1-risk result on P is subsumed\nby our KL-risk result (via Pinsker\u2019s inequality), up to logarithmic factors, while the reverse isn\u2019t true.\nFurthermore, the framework of [13] is restricted to symmetric joint probability matrices, and uses\nan SVD-based algorithm that is dif\ufb01cult to scale beyond very small latent ranks m. Apart from this\nrecent paper for the (cid:96)1-risk, sample complexity bounds for related (not fully latent) models have been\nproposed for the KL-risk, e.g. [1]. But these remain partial, and far from optimal. It is also worth\nnoting that information geometry gives conditions under which KL-risk behaves close to (cid:96)2-risk [8],\nthus leading to a Frobenius-type risk in the matrix case.\nAlthough the core optimization problem itself is not our focus, we note that despite being a non-\nconvex problem, many instances of matrix factorization admit ef\ufb01cient solutions. Our own heuristic\ninitialization method is evidence of this. Recent work, in the (cid:96)2 context, shows that even simple\ngradient descent, appropriately initialized, could often provably converge to the global optimum [6].\nConcerning whether such low-rank models are appropriate for language modeling, there has been\nevidence that some of the abovementioned word embeddings [20] can be interpreted as implicit matrix\nfactorization [19]. Some of the traditional bigram smoothing techniques, such as the Kneser-Ney\nalgorithm [17, 11], are also reminiscent of rank reduction [14, 24, 15].\n\n3 Problem Statement\nData Dn consists of n pairs (Xs, Ys), s = 1,\u00b7\u00b7\u00b7 , n, where Xs is a context and Ys is the corresponding\noutcome. In the spirit of a bigram language model, we assume that the context and outcome spaces\nhave the same cardinality, namely k. Thus (Xs, Ys) takes values in [k]2. We denote the count of pairs\n\n(i, j) by Cij. As a shortcut, we also write the row-sums as Ci =(cid:80)\n\nj Cij.\n\nWe assume the underlying generative model of the data to be i.i.d., where each pair is drawn by \ufb01rst\nsampling the context Xs according to a probability distribution \u03c0 = (\u03c0i) over [k] and then sampling\nYs conditionally on Xs according to a k \u00d7 k conditional probability (stochastic) matrix P = (Pij), a\nnon-negative matrix where each row sums to 1. We also assume that P has non-negative rank m. We\ndenote the set of all such matrices by Pm. They can all be factorized (non-uniquely) as P = AB,\nwhere both A and B are stochastic matrices in turn, of size k \u00d7 m and m \u00d7 k respectively.\nA conditional probability matrix estimator is an algorithm that maps the data into a stochastic matrix\nQn(X1,\u00b7\u00b7\u00b7 , Xn) that well-approximates P , in the absence of any knowledge about the underlying\nmodel. We generally drop the explicit notation showing dependence on the data, and use instead\nthe implicit n-subscript notation. The performance, or how well any given stochastic matrix Q\napproximates P , is measured according to the KL-risk:\n\nR(Q) =\n\n\u03c0iPij log\n\nPij\nQij\n\n(1)\n\n(cid:88)\n\nij\n\nNote that this corresponds to an expected loss, with the log-loss L(Q, i, j) = log Pij/Qij. Although\nwe do seek out PAC-style (in-probability) bounds for R(Qn), in order to give a concise de\ufb01nition\nof optimality, we consider the average-case performance E[R(Qn)]. The expectation here is with\nrespect to the data. Since the underlying model is completely unknown, we would like to do well\nagainst adversarial choices of \u03c0 and P , and thus we are interested in a uniform upper bound of the\nform:\n\nr(Qn) = max\n\u03c0,P\u2208Pm\n\nE[R(Qn)].\n\nThe optimal estimator, in the minimax sense, and the minimax risk of the class Pm are thus given by:\n\nQ(cid:63)\n\nn = arg min\n\nQn\n\nr(Qn) = arg min\nr(cid:63)(Pm) = min\n\nQn\n\nmax\n\u03c0,P\u2208Pm\n\nQn\n\nmax\n\u03c0,P\u2208Pm\n\nE[R(Qn)]\n\nE[R(Qn)].\n\nExplicitly obtaining minimax optimal estimators is a daunting task, and instead we would like to\nexhibit estimators that compare well.\n\n3\n\n\fDe\ufb01nition 1 (Optimality). If an estimator satis\ufb01es E[R(Qn)] \u2264 \u03d5\u00b7 E[R(Q(cid:63)\nn)], \u2200\u03c0, (called an oracle\ninequality), then if \u03d5 is a constant (of n, k, and m), we say that the estimator is (order) optimal.\nIf \u03d5 is not constant, but its growth is negligible with respect to the decay of r(cid:63)(Pm) with n or the\ngrowth of r(cid:63)(Pm) with k or m, then we can call the estimator near-optimal. In particular, we\nreserve this terminology for a logarithmic gap in growth, that is an estimator is near-optimal if\nlog \u03d5/ log r(cid:63)(Pm) \u2192 0 asymptotically in any of n, k, or m. Finally, if \u03d5 does not depend on P we\nhave strong optimality, and r(Qn) \u2264 \u03d5 \u00b7 r(cid:63)(Pm). If \u03d5 does depend on P , we have weak optimality.\nAs a proxy to the true risk (1), we de\ufb01ne the empirical risk:\n\nRn(Q) =\n\n1\nn\n\nCij log\n\nPij\nQij\n\n(2)\n\n(cid:88)\n\nij\n\n2\n\nij\n\nThe conditional probability matrix that minimizes this empirical risk is the empirical conditional\nprobability \u02c6Pn,ij = Cij/Ci. Not only is \u02c6Pn,ij not optimal, but since there always is a positive (even if\nslim) probability that some Cij = 0 even if Pij (cid:54)= 0, it follows that E[Rn( \u02c6Pn)] = \u221e. This shows the\nimportance of smoothing. The simplest benchmark smoothing that we consider is add- 1\n2 smoothing\n\u02c6P Add- 1\n= (Cij + 1/2) / (Ci + k/2) , where we give an additional \u201cphantom\u201d half-sample to each\nword-pair, to avoid zeros. This simple method has optimal minimax performance when estimating\nprobability vectors. However, in the present matrix case it is possible to show that this can be a\nfactor of k/m away from optimal, which is signi\ufb01cant (cf. Figure 1(a) in Section 7). Of course,\nsince we have not used the low-rank structure of P , we may be tempted to \u201csmooth by factoring\u201d, by\nperforming a low-rank approximation of \u02c6Pn. However, this will not eliminate the zero problem, since\na whole column may be zero. These facts highlight the importance of principled smoothing. The\nproblem is therefore to construct (possibly weakly) optimal or near-optimal smoothed estimators.\n\n4 Main Results\n2 -SMOOTHED LOW-RANK algorithm, which essentially consists\nIn Section 5 we introduce the ADD- 1\nof EM-style alternating minimizations, with the addition of smoothing at each stage. Here we state\nthe main results. The \ufb01rst is a characterization of the implicit risk function that the algorithm targets.\nTheorem 2 (Algorithm). QAdd- 1\n2 -LR converges to a stationary point of the penalized empirical risk\n\nRn,penalized(W, H) = Rn(Q) +\n\n1\n2n\n\nlog\n\n1\nWi(cid:96)\n\n+\n\n1\n2n\n\nlog\n\n1\nH(cid:96)j\n\n, where Q = W H.\n\n(3)\n\n(cid:88)\n\ni,(cid:96)\n\n(cid:88)\n\n(cid:96),j\n\nConversely, any stationary point of (3) is a stable point of ADD- 1\nThe proof of Theorem 2 follows closely that of [18]. We now consider the global minimum of\nthis implicit risk, and give a sample complexity bound. By doing so, we intentionally decouple the\nalgorithmic and statistical aspects of the problem and focus on the latter.\nTheorem 3 (Sample Complexity). Let Qn \u2208 Pm achieve the global minimum of Equation 3. Then\nfor all P \u2208 Pm such that Pij > km\nE[R(Qn)] \u2264 c\n\nn log(2n + k) \u2200i, j and n > 3,\n\n2 -SMOOTHED LOW-RANK.\n\nlog2(2n + k),\n\nwith c = 3100.\n\nkm\nn\n\nWe outline the proof in Section 6. The basic ingredients are: showing the problem is near-realizable,\na quantization argument to describe the complexity of Pm, and a PAC-style [27] relative uniform\nconvergence which uses a sub-Poisson concentration for the sums of log likelihood ratios and uniform\nvariance and scale bounds. Finer analysis based on VC theory may be possible, but it would need to\nhandle the challenge of the log-loss being possibly unbounded and negative. The following result\nshows that Theorem 3 gives weak near-optimality for n large, as it is tight up to the logarithmic factor.\nTheorem 4 (Lower Bound). For n > k, the minimax rate of Pm satis\ufb01es:\n\nr(cid:63)(Pm) \u2265 c\n\nkm\nn\n\n,\n\nwith c = 0.06.\n\nThis is based on the vector case lower bound and providing the oracle with additional information:\ninstead of only (Xs, Ys) it observes (Xs, Zs, Ys), where Zs is sampled from Xs using A and Ys is\nsampled from Zs using B. This effectively allows the oracle to estimate A and B directly.\n\n4\n\n\f5 Algorithm\nOur main algorithm is a direct modi\ufb01cation of the classical alternating minimization algorithm for\nnon-negative matrix factorization [12, 18]. This classical algorithm (with a slight variation) can be\nshown to essentially solve the following mathematical program:\n\nQNNMF(\u03a6) = arg min\nQ=W H\n\n\u03a6ij log\n\n1\nQij\n\n.\n\n(cid:88)\n\n(cid:88)\n\ni\n\nj\n\nThe analysis is a simple extension of the original analysis of [12, 18]. By \u201cessentially solves\u201d, we\nmean that each of the update steps can be identi\ufb01ed as a coordinate descent, reducing the cost function\nand ultimately converging as T \u2192 \u221e to a stationary (zero gradient) point of this function. Conversely,\nall stationary points of the function are stable points of the algorithm. In particular, since the problem\nis convex in W and H individually, but not jointly in both, the algorithm can be thought of as taking\nexact steps toward minimizing over W (as H is held \ufb01xed) and then minimizing over H (as W is\nheld \ufb01xed), whence the alternating-minimization name.\nBefore we incorporate smoothing, note that there are two ingredients missing from this algorithm.\nFirst, the cost function is the sum of row-by-row KL-divergences, but each row is not weighted, as\ncompared to Equation (1). If we think of \u03a6ij as \u02c6Pij = Cij/Ci, then the natural weight of row i is\n\u03c0i or its proxy Ci/n. For this, the algorithm can easily be patched. Similarly to the analysis of the\noriginal algorithm, one \ufb01nds that this change essentially minimizes the weighted KL-risks of the\nempirical conditional probability matrix, or equivalently the empirical risk as de\ufb01ned in Equation (2):\n\nQLR(C) = arg min\nQ=W H\n\nRn(Q) = arg min\nQ=W H\n\n(cid:88)\n\n(cid:88)\n\nCi\nn\n\ni\n\nj\n\nCij\nCi\n\nlog\n\n1\nQij\n\n.\n\nOf course, this is nothing but the maximum likelihood estimator of P under the low-rank constraint.\nJust like the empirical conditional probability matrix, it suffers from lack of smoothing. For instance,\nif a whole column of C is zero, then so will be the corresponding column of QERM(C). The \ufb01rst\nnaive attempt at smoothing would be to add- 1\n\n2 to C and then apply the algorithm:\n2 -LR(C) = QLR(C + 1\n2 )\n\nQNaive Add- 1\n\nHowever, this would result in excessive smoothing, especially when m is small. The intuitive reason\nis this: in the extreme case of m = 1 all rows need to be combined, and thus instead of adding 1\n2 to\neach category, QNaiveadd\u2212 1\n2 LR would add k/2, leading to the the uniform distribution overwhelming\nthe original distribution. We may be tempted to mitigate this by adding instead 1/2k, but this doesn\u2019t\ngeneralize well to other smoothing methods. A more principled approach should perform smoothing\ndirectly inside the factorization, and this is exactly what we propose here. Our main algorithm is:\nAlgorithm: ADD- 1\n\n2 -SMOOTHED LOW-RANK\n\n\u2022 Input: k \u00d7 k matrix (Cij); Initial W 0 and H 0; Number of iterations T\n\u2022 Iterations: Start at t = 0, increment and repeat while t < T\n\n\u2013 For all i \u2208 [k], (cid:96) \u2208 [m], update W t\nH t\u22121\n\u2013 For all (cid:96) \u2208 [m], j \u2208 [k], update H t\nW t\u22121\n\u2013 Add-1/2 to each element of W t and H t, then normalize each row.\n\ni(cid:96) \u2190 W t\u22121\n(cid:96)j \u2190 H t\u22121\n\n(W H)t\u22121\n\nCij\n\n(W H)t\u22121\n\nij\n\nCij\n\nj\n\ni\n\ni(cid:96)\n\n(cid:96)j\n\n(cid:96)j\n\ni(cid:96)\n\nij\n\n(cid:80)\n(cid:80)\n\n\u2022 Output: QAdd- 1\n\n2 -LR(C) = W T H T\n\nThe intuition here is that, prior to normalization, the updated W and H can be interpreted as soft\ncounts. One way to see this is to sum each row i of (pre-normalized) W , which would give Ci. As for\nH, the sums of its (pre-normalized) columns reproduce the sums of the columns of C. Next, we are\nnaturally led to ask: is QAdd- 1\n2 LR(C) implicitly minimizing a risk, just as QLR(C) minimizes Rn(Q)?\nTheorem 2 shows that indeed QAdd- 1\n2 -SMOOTHED LOW-RANK lends itself to a host of generalizations. In\nMore interestingly, ADD- 1\nparticular, an important smoothing technique, absolute discounting, is very well suited for heavy-\ntailed data such as natural language [11, 21, 5]. We can generalize it to fractional counts as follows.\nLet Ci indicate counts in traditional (vector) probability estimation, and let D be the total number of\n\n2 LR(C) essentially minimizes a penalized empirical risk.\n\n5\n\n\fdistinct observed categories, i.e. D =(cid:80)\nd be de\ufb01ned as d =(cid:80)\n(cid:40) Ci\u2212\u03b1(cid:80) C\n1\u2212\u03b1(cid:80) C Ci + \u03b1(D+d)\n\n\u02c6P Soft-AD\ni\n\n(C, \u03b1) =\n\ni\n\n(k\u2212D\u2212d)(cid:80) C (1 \u2212 Ci)\n\nif Ci \u2265 1,\nif Ci < 1.\n\nI{Ci \u2265 1}. Let the number of fractional distinct categories\ni CiI{Ci < 1}. We have the following soft absolute discounting smoothing:\n\nThis gives us the following patched algorithm, which we do not place under the lens of theory\ncurrently, but we strongly support it with our experimental results of Section 7.\nAlgorithm: ABSOLUTE-DISCOUNTING-SMOOTHED LOW-RANK\n\n\u2022 Input: Specify \u03b1 \u2208 (0, 1)\n\u2022 Iteration:\n\n\u2013 Add-1/2 to each element of W t, then normalize.\n(cid:96)j \u2190 \u02c6P Soft-AD\n\u2013 Apply soft absolute discounting to H t\n\nj\n\n\u2022 Output: QAD-LR(C, \u03b1) = W T H T\n\n(H t\n\n(cid:96),\u00b7, \u03b1)\n\n6 Analysis\nWe now outline the proof of the sample complexity upper bound of Theorem 3. Thus for the remainder\nof this section we have:\n\nQn(C) = arg min\nQ=W H\n\nRn(Q) +\n\n1\n2n\n\nlog\n\n1\nWi(cid:96)\n\n+\n\n1\n2n\n\nlog\n\n1\nH(cid:96)j\n\n,\n\n(cid:88)\n\ni,(cid:96)\n\n(cid:88)\n\n(cid:96),j\n\nthat is Qn \u2208 Pm achieves the global minimum of Equation 3. Since we have a penalized empirical\nrisk minimization at hand, we can study it within the classical PAC-learning framework. However,\nn are often associated withe the realizable case, where Rn(Qn) is exactly zero [27].\nrates of order 1\nThe following Lemma shows that we are near the realizable regime.\nLemma 5 (Near-realizability). We have\n\nE[Rn(Qn)] \u2264 k\nn\n\n+\n\nkm\nn\n\nlog(2n + k).\n\nWe characterize the complexity of the class Pm by quantizing probabilities, as follows. Given a\npositive integer L, de\ufb01ne \u2206L to be the subset of the appropriate simplex \u2206 consisting of L-empirical\ndistributions (or \u201ctypes\u201d in information theory): \u2206L consists exactly of those distributions p that can\nbe written as pi = Li/L, where Li are non-negative integers that sum to L.\nDe\ufb01nition 6 (Quantization). Given a positive integer L, de\ufb01ne the L-quantization operation as\nmapping a probability vector p to the closest (in (cid:96)1-distance) element of \u2206L, \u02dcp = arg minq\u2208\u2206L (cid:107)p \u2212\nq(cid:107)1. For a matrix P \u2208 Pm, de\ufb01ne an L-quantization for any given factorization choice P = AB as\n\u02dcP = \u02dcA \u02dcB, where each row of \u02dcA and \u02dcB is the L-quantization of the respective row of A and B. Lastly,\nde\ufb01ne Pm,L to be the set of all quantized probability matrices derived from Pm.\nVia counting arguments, the cardinality of Pm,L is bounded by |Pm,L| \u2264 (L + 1)2km. This quantized\nfamily gives us the following approximation ability.\nLemma 7 (De-quantization). For a probability vector p, L-quantization satis\ufb01es |pi \u2212 \u02dcpi| \u2264 1\nall i, and (cid:107)p \u2212 \u02dcp(cid:107)1 \u2264 2\n|Qij \u2212 \u02dcQij| \u2264 3\n\nL for\nL . For a conditional probability matrix Q \u2208 Pm, any quantization \u02dcQ satis\ufb01es\n\nL for all i. Furthermore, if Q > \u0001 per entry and L > 6\n\n|R(Q) \u2212 R( \u02dcQ)| \u2264 6\nL\u0001\n\nand\n\n\u0001 , then:\n|Rn(Q) \u2212 Rn( \u02dcQ)| \u2264 6\nL\u0001\n\n.\n\nWe now give a PAC-style relative uniform convergence bound on the empirical risk [27].\nTheorem 8 (Relative uniform convergence). Assume lower-bounded P > \u03b4 and choose any \u03c4 > 0.\nWe then have the following uniform bound over all lower-bounded \u02dcQ > \u0001 in Pm,L (De\ufb01nition 6):\n\n\uf8f1\uf8f2\uf8f3 sup\n\n\u02dcQ\u2208Pm,L, \u02dcQ>\u0001\n\nPr\n\n(cid:113)\n\nR( \u02dcQ) \u2212 Rn( \u02dcQ)\n\n> \u03c4\n\nR( \u02dcQ)\n\n\uf8fc\uf8fd\uf8fe \u2264 e\n\n\u2212\n\n6\n\n(cid:115)\n\nn\u03c4 2\n\n20 log\n\n1\n\u0001 +2\u03c4\n\n+2km log(L+1)\n\n10\n\n1\n\u03b4 log\n\n1\n\u0001\n\n.\n\n(4)\n\n\f2 -SMOOTHED LOW-RANK, the add- 1\n\n1\n\n2 )/(Ci + 1\n2 )\n\nn (C, \u03b1) (see Section 5)\n\n2-Smoothed Low-Rank, our proposed algorithm with provable guarantees: QAdd- 1\n\n2 -LR\n\n2 Low-Rank, smoothing the counts then factorizing: QNaive Add- 1\nn (C, \u03b1))\n\nn log 1\nn log2(2n + k). We then complete the proof by de-quantizing using Lemma 7.\n\nThe proof of this Theorem consists, for \ufb01xed \u02dcQ, of showing a sub-Poisson concentration of the sum\nof the log likelihood ratios. This needs care, as a simple Bennett or Bernstein inequality is not enough,\nbecause we need to eventually self-normalize. A critical component is to relate the variance and scale\nof the concentration to the KL-risk and its square root, respectively. The theorem then follows from\nuniformly bounding the normalized variance and scale over Pm,L and a union bound.\nTo put the pieces together, \ufb01rst note that thanks to the fact that the optimum is also a stable point of\nthe ADD- 1\n2 nature of the updates implies that all of the elements\n2n+k . By Lemma 7 and a proper choice of L of the order of (2n + k)2,\nof Qn are lower-bounded by\nthe quantized version won\u2019t be much smaller. We can thus choose \u0001 = 1\n2n+k in Theorem 8 and use\nour assumption of \u03b4 = km\nn log(2n + k). Using Lemmas 5 and 7 to bound the contribution of the\nempirical risk, we can then integrate the probability bound of (4) similarly to the realizable case.\nThis gives a bound on the expected risk of the quantized version of Qn of order km\n\u0001 log L or\neffectively km\n7 Experiments\nHaving expounded the theoretical merit of properly smoothing structered conditional probability\nmatrices, we give a brief empirical study of its practical impact. We use both synthetic and real data.\nThe various methods compared are as follows:\n\u2022 Add- 1\n2, directly on the bigram counts: \u02c6P Add- 1\nn,ij = (Cij + 1\n\u2022 Absolute-discounting, directly on the bigram counts: \u02c6P AD\n\u2022 Naive Add- 1\n\u2022 Naive Absolute-Discounting Low-Rank: QNaive AD-LR = QLR(n \u02c6P AD\n\u2022 Stupid backoff (SB) of Google, a very simple algorithm proposed in [9]\n\u2022 Kneser-Ney (KN), a widely successful algorithm proposed in [17]\n\u2022 Add- 1\n\u2022 Absolute-Discounting-Smoothed Low-Rank, heuristic generalization of our algorithm: QAD-LR\nThe synthetic model is determined randomly. \u03c0 is uniformly sampled from the k-simplex. The matrix\nP = AB is generated as follows. The rows of A are uniformly sampled from the k-simplex. The\nrows of B are generated in one of two ways: either sampled uniformly from the simplex or randomly\npermuted power law distributions, to imitate natural language. The discount parameter is then \ufb01xed\nto 0.75. Figure 1(a) uses uniformly sampled rows of B, and shows that, despite attempting to harness\nthe low-rank structure of P , not only does Naive Add- 1\n2 fall short, but it may even perform worse\n2-Smoothed Low-Rank, on the other hand, reaps\nthan Add- 1\nthe bene\ufb01ts of both smoothing and structure.\nFigure 1(b) expands this setting to compare against other methods. Both of the proposed algorithms\nhave an edge on all other methods. Note that Kneser-Ney is not expected to perform well in this\nregime (rows of B uniformly sampled), because uniformly sampled rows of B do not behave\nlike natural language. On the other hand, for power law rows, even if k (cid:29) n, Kneser-Ney does\nwell, and it is only superseded by Absolute-Discounting-Smoothed Low-Rank. The consistent\ngood performance of Absolute-Discounting-Smoothed Low-Rank may be explained by the fact that\nabsolute-discounting seems to enjoy some of the competitive-optimality of Good-Turing estimation,\nas recently demonstrated by [22]. This is why we chose to illustrate the \ufb02exibility of our framework\nby heuristically using absolute-discounting as the smoothing component.\nBefore moving on to experiments on real data, we give a short description of the data sets. All but the\n\ufb01rst one are readily available through the Python NLTK:\n\u2022 tartuffe, a French text, train and test size: 9.3k words, vocabulary size: 2.8k words.\n\u2022 genesis, English version, train and test size: 19k words, vocabulary size: 4.4k words\n\u2022 brown, shortened Brown corpus, train and test size: 20k words, vocabulary size: 10.5k words\nFor natural language, using absolute-discounting is imperative, and we restrict ourselves to Absolute-\nDiscounting-Smoothed Low-Rank. The results of the performance of various algorithms are listed\nin Table 1. For all these experiments, m = 50 and 200 iterations were performed. Note that the\nproposed method has less cross-entropy per word across the board.\n\n2\n\n7\n\n2 -LR = QLR(C + 1\n2 )\n\n2, which is oblivious to structure. Add- 1\n\n\f(a) k = 100, m = 5\n\n(b) k = 50, m = 3\n\n(c) k = 1000, m = 10\n\nFigure 1: Performance of selected algorithms over synthetic data\n\n(a) Performance on tartuffe\n\n(b) Performance on genesis\n\n(c) rank selection for tartuffe\n\nFigure 2: Experiments on real data\n\nDataset\ntartuffe\ngenesis\nbrown\n\nAdd- 1\n2\n7.1808\n7.3039\n8.847\n\nAD\n6.268\n6.041\n7.9819\n\nSB\n\n6.0426\n5.9058\n7.973\n\nKN\n\n5.7555\n5.7341\n7.7001\n\nAD-LR\n5.6923\n5.6673\n7.609\n\nTable 1: Cross-entropy results for different methods on several small corpora\n\nWe also illustrate the performance of different algorithms as the training size increases. Figure 2\nshows the relative performance of selected algorithms with Stupid Backoff chosen as the baseline. As\nFigure 2(a) suggests, the amount of improvement in cross-entropy at n = 15k is around 0.1 nats/word.\nThis improvement is comparable, even more signi\ufb01cant, than that reported in the celebrated work of\nChen and Goodman [11] for Kneser-Ney over the best algorithms at the time.\nEven though our algorithm is given the rank m as a parameter, the internal dimension is not revealed,\nif ever known. Therefore, we could choose the best m using model selection. Figure 2(c) shows one\nway of doing this, by using a simple cross-validation for the tartuffe data set. In particular, half of the\ndata was held out as a validation set, and for a range of different choices for m, the model was trained\nand its cross-entropy on the validation set was calculated. The \ufb01gure shows that there exists a good\nchoice of m (cid:28) k. A similar behavior is observed for all data sets. Most interestingly, the ratio of the\nbest m to the vocabulary size corpus is reminiscent of the choice of internal dimension in [20].\n8 Conclusion\nDespite the theoretical impetus of the paper, the resulting algorithms considerably improve over\nseveral benchmarks. There is more work ahead, however. Many possible theoretical re\ufb01nements\nare in order, such as eliminating the logarithmic term in the sample complexity and dependence\non P (strong optimality). This framework naturally extends to tensors, such as for higher-order\nN-gram language models. It is also worth bringing back the Markov assumption and understanding\nhow various mixing conditions in\ufb02uence the sample complexity. A more challenging extension,\nand one we suspect may be necessary to truly be competitive with RNNs/LSTMs, is to parallel this\ncontribution in the context of generative models with long memory. The reason we hope to not only\nbe competitive with, but in fact surpass, these models is that they do not use distributional properties\nof language, such as its quintessentially power-law nature. We expect smoothing methods such as\nabsolute-discounting, which do account for this, to lead to considerable improvement.\nAcknowledgments We would like to thank Venkatadheeraj Pichapati and Ananda Theertha Suresh\nfor many helpful discussions. This work was supported in part by NSF grants 1065622 and 1564355.\n\n8\n\nNumber of samples, n050010001500200025003000KL loss00.20.40.60.811.21.4Add-1/2Naive Add-1/2 Low-RankAdd-1/2-Smoothed Low-RankNumber of samples, n#1040.511.522.5KL loss10-210-1Add-1/2-Smoothed Low-RankAbsolute-discountingNaive Abs-Disc Low-RankAbs-Disc-Smoothed Low-RankKneser-NeyNumber of samples, n101520253035404550KL loss1.522.533.54Add-1/2-Smoothed Low-RankAbsolute-discountingNaive Abs-Disc Low-RankAbs-Disc-Smoothed Low-RankKneser-NeyTraining size (number of words)050001000015000Diff in test cross-entropy from baseline (nats/word)-0.4-0.35-0.3-0.25-0.2-0.15-0.1-0.050Stupid Backoff (baseline)Smoothed Low RankKneser-NeyTraining size (number of words)#1040.511.52Diff in test cross-entropy from baseline (nats/word)-0.3-0.25-0.2-0.15-0.1-0.0500.05Stupid Backoff (baseline)Smoothed Low RankKneser-Neym020406080100120Validation set cross-entropy (nats/word)5.685.75.725.745.765.785.85.82\fReferences\n[1] Abe, Warmuth, and Takeuchi. Polynomial learnability of probabilistic concepts with respect to the\n\nKullback-Leibler divergence. In COLT, 1991.\n\n[2] Acharya, Jafarpour, Orlitsky, and Suresh. Optimal probability estimation with applications to prediction\n\nand classi\ufb01cation. In COLT, 2013.\n\n[3] Agarwal, Anandkumar, Jain, and Netrapalli. Learning sparsely used overcomplete dictionaries via\n\nalternating minimization. arXiv preprint arXiv:1310.7991, 2013.\n\n[4] Arora, Ge, Ma, and Moitra. Simple, ef\ufb01cient, and neural algorithms for sparse coding. arXiv preprint\n\narXiv:1503.00778, 2015.\n\n[5] Ben Hamou, Boucheron, and Ohannessian. Concentration Inequalities in the In\ufb01nite Urn Scheme for\n\nOccupancy Counts and the Missing Mass, with Applications. Bernoulli, 2017.\n\n[6] Bhojanapalli, Kyrillidis, and Sanghavi. Dropping convexity for faster semi-de\ufb01nite optimization. arXiv\n\npreprint arXiv:1509.03917, 2015.\n\n[7] Blei, Ng, and Jordan. Latent Dirichlet allocation. JMLR, 2003.\n\n[8] Borade and Zheng. Euclidean information theory. IEEE Int. Zurich Seminar on Comm., 2008.\n\n[9] Brants, Popat, Xu, Och, and Dean. Large language models in machine translation. In EMNLP, 2007.\n\n[10] Buntine and Jakulin. Applying discrete PCA in data analysis. In Proceedings of the 20th conference on\n\nUncertainty in arti\ufb01cial intelligence, pages 59\u201366. AUAI Press, 2004.\n\n[11] Chen and Goodman. An empirical study of smoothing techniques for language modeling. Computer\n\nSpeech & Language, 13(4):359\u2013393, 1999.\n\n[12] Hofmann. Probabilistic latent semantic indexing. In ACM SIGIR, 1999.\n\n[13] Huang, Kakade, Kong, and Valiant. Recovering structured probability matrices.\n\narXiv:1602.06586, 2016.\n\narXiv preprint\n\n[14] Hutchinson, Ostendorf, and Fazel. Low rank language models for small training sets. IEEE SPL, 2011.\n\n[15] Hutchinson, Ostendorf, and Fazel. A Sparse Plus Low-Rank Exponential Language Model for Limited\n\nResource Scenarios. IEEE Trans. on Audio, Speech, and Language Processing, 2015.\n\n[16] Kamath, Orlitsky, Pichapati, and Suresh. On learning distributions from their samples. In COLT, 2015.\n\n[17] Kneser and Ney. Improved backing-off for m-gram language modeling. In ICASSP, 1995.\n\n[18] Lee and Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.\n\n[19] Levy and Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, 2014.\n[20] Mikolov, Kombrink, Burget, \u02c7Cernock`y, and Khudanpur. Extensions of recurrent neural network language\n\nmodel. In ICASSP, 2011.\n\n[21] Ohannessian and Dahleh. Rare probability estimation under regularly varying heavy tails. In COLT, 2012.\n\n[22] Orlitsky and Suresh. Competitive distribution estimation: Why is Good-Turing good. In NIPS, 2015.\n\n[23] Papadimitriou, Tamaki, Raghavan, and Vempala. Latent semantic indexing: A probabilistic analysis. In\n\nACM SIGACT-SIGMOD-SIGART, 1998.\n\n[24] Parikh, Saluja, Dyer, and Xing. Language Modeling with Power Low Rank Ensembles. arXiv preprint\n\narXiv:1312.7077, 2013.\n\n[25] Shazeer, Pelemans, and Chelba. Skip-gram Language Modeling Using Sparse Non-negative Matrix\n\nProbability Estimation. arXiv preprint arXiv:1412.1454, 2014.\n\n[26] Valiant and Valiant. Instance optimal learning. arXiv preprint arXiv:1504.05321, 2015.\n\n[27] Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n\n[28] Williams, Prasad, Mrva, Ash, and Robinson. Scaling Recurrent Neural Network Language Models. arXiv\n\npreprint arXiv:1502.00512, 2015.\n\n[29] Zhu, Yang, and Oja. Multiplicative updates for learning with stochastic matrices. In Im. An. 2013.\n\n9\n\n\f", "award": [], "sourceid": 2459, "authors": [{"given_name": "Moein", "family_name": "Falahatgar", "institution": "UCSD"}, {"given_name": "Mesrob", "family_name": "Ohannessian", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": "University of California, San Diego"}]}