{"title": "Smoothed Gradients for Stochastic Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2438, "page_last": 2446, "abstract": "Stochastic variational inference (SVI) lets us scale up Bayesian computation to massive data. It uses stochastic optimization to fit a variational distribution, following easy-to-compute noisy natural gradients. As with most traditional stochastic optimization methods, SVI takes precautions to use unbiased stochastic gradients whose expectations are equal to the true gradients. In this paper, we explore the idea of following biased stochastic gradients in SVI. Our method replaces the natural gradient with a similarly constructed vector that uses a fixed-window moving average of some of its previous terms. We will demonstrate the many advantages of this technique. First, its computational cost is the same as for SVI and storage requirements only multiply by a constant factor. Second, it enjoys significant variance reduction over the unbiased estimates, smaller bias than averaged gradients, and leads to smaller mean-squared error against the full gradient. We test our method on latent Dirichlet allocation with three large corpora.", "full_text": "Smoothed Gradients for\n\nStochastic Variational Inference\n\nStephan Mandt\n\nDepartment of Physics\nPrinceton University\n\nsmandt@princeton.edu\n\nDavid Blei\n\ndavid.blei@columbia.edu\n\nDepartment of Computer Science\n\nDepartment of Statistics\n\nColumbia University\n\nAbstract\n\nStochastic variational inference (SVI) lets us scale up Bayesian computation to\nmassive data. It uses stochastic optimization to \ufb01t a variational distribution, fol-\nlowing easy-to-compute noisy natural gradients. As with most traditional stochas-\ntic optimization methods, SVI takes precautions to use unbiased stochastic gradi-\nents whose expectations are equal to the true gradients. In this paper, we explore\nthe idea of following biased stochastic gradients in SVI. Our method replaces\nthe natural gradient with a similarly constructed vector that uses a \ufb01xed-window\nmoving average of some of its previous terms. We will demonstrate the many ad-\nvantages of this technique. First, its computational cost is the same as for SVI and\nstorage requirements only multiply by a constant factor. Second, it enjoys signif-\nicant variance reduction over the unbiased estimates, smaller bias than averaged\ngradients, and leads to smaller mean-squared error against the full gradient. We\ntest our method on latent Dirichlet allocation with three large corpora.\n\n1\n\nIntroduction\n\nStochastic variational inference (SVI) lets us scale up Bayesian computation to massive data [1]. SVI\nhas been applied to many types of models, including topic models [1], probabilistic factorization [2],\nstatistical network analysis [3, 4], and Gaussian processes [5].\nSVI uses stochastic optimization [6] to \ufb01t a variational distribution, following easy-to-compute noisy\nnatural gradients that come from repeatedly subsampling from the large data set. As with most\ntraditional stochastic optimization methods, SVI takes precautions to use unbiased, noisy gradients\nwhose expectations are equal to the true gradients. This is necessary for the conditions of [6] to\napply, and guarantees that SVI climbs to a local optimum of the variational objective. Innovations\non SVI, such as subsampling from data non-uniformly [2] or using control variates [7, 8], have\nmaintained the unbiasedness of the noisy gradient.\nIn this paper, we explore the idea of following a biased stochastic gradient in SVI. We are inspired\nby the recent work in stochastic optimization that uses biased gradients. For example, stochastic\naveraged gradients (SAG) iteratively updates only a subset of terms in the full gradient [9]; averaged\ngradients (AG) follows the average of the sequence of stochastic gradients [10]. These methods lead\nto faster convergence on many problems.\nHowever, SAG and AG are not immediately applicable to SVI. First, SAG requires storing all of the\nterms of the gradient. In most applications of SVI there is a term for each data point, and avoiding\nsuch storage is one of the motivations for using the algorithm. Second, the SVI update has a form\nwhere we update the variational parameter with a convex combination of the previous parameter\nand a new noisy version of it. This property falls out of the special structure of the gradient of\nthe variational objective, and has the signi\ufb01cant advantage of keeping the parameter in its feasible\n\n1\n\n\f\u2207\u03bbL = \u03b7 + N \u02c6Si \u2212 \u03bbi,\n\nspace. (E.g., the parameter may be constrained to be positive or even on the simplex.) Averaged\ngradients, as we show below, do not enjoy this property. Thus, we develop a new method to form\nbiased gradients in SVI.\nTo understand our method, we must brie\ufb02y explain the special structure of the SVI stochastic natural\ngradient. At any iteration of SVI, we have a current estimate of the variational parameter \u03bbi, i.e., the\nparameter governing an approximate posterior that we are trying to estimate. First, we sample a data\npoint wi. Then, we use the current estimate of variational parameters to compute expected suf\ufb01cient\nstatistics \u02c6Si about that data point. (The suf\ufb01cient statistics \u02c6Si is a vector of the same dimension as\n\u03bbi.) Finally, we form the stochastic natural gradient of the variational objective L with this simple\nexpression:\n\n(1)\nwhere \u03b7 is a prior from the model and N is an appropriate scaling. This is an unbiased noisy\ngradient [11, 1], and we follow it with a step size \u03c1i that decreases across iterations [6]. Because of\nits algebraic structure, each step amounts to taking a weighted average,\n\n\u03bbi+1 = (1 \u2212 \u03c1i)\u03bbi + \u03c1i(\u03b7 + N \u02c6Si).\n\n(2)\n\nj=0\n\nNote that this keeps \u03bbi in its feasible set.\nWith these details in mind, we can now describe our method. Our method replaces the natural\ngradient in Eq. (1) with a similarly constructed vector that uses a \ufb01xed-window moving average\nof the previous suf\ufb01cient statistics. That is, we replace the suf\ufb01cient statistics with an appropriate\n\u02c6Si\u2212j. Note this is different from averaging the gradients, which also involves the\n\nscaled sum,(cid:80)L\u22121\n\ncurrent iteration\u2019s estimate.\nWe will demonstrate the many advantages of this technique. First, its computational cost is the\nsame as for SVI and storage requirements only multiply by a constant factor (the window length\nL). Second, it enjoys signi\ufb01cant variance reduction over the unbiased estimates, smaller bias than\naveraged gradients, and leads to smaller mean-squared error against the full gradient. Finally, we\ntested our method on latent Dirichlet allocation with three large corpora. We found that it leads to\nfaster convergence.\n\nRelated work We \ufb01rst discuss the related work from the SVI literature. Both Ref. [8] and Ref. [7]\nintroduce control variates to reduce the gradient\u2019s variance. The method leads to unbiased gradient\nestimates. On the other hand, every few hundred iterations, an entire pass through the data set is\nnecessary, which makes the performance and expenses of the method depend on the size of the\ndata set. Ref. [12] develops a method to pre-select documents according to their in\ufb02uence on the\nglobal update. For large data sets, however, it also suffers from high storage requirements. In the\nstochastic optimization literature, we have already discussed SAG [9] and AG [10]. Similarly, Ref.\n[13] introduces an exponentially fading momentum term. It too suffers from the issues of SAG and\nAG, mentioned above.\n\n2 Smoothed stochastic gradients for SVI\n\nLatent Dirichlet Allocation and Variational Inference We start by reviewing stochastic varia-\ntional inference for LDA [1, 14], a topic model that will be our running example. We are given a\ncorpus of D documents with words w1:D,1:N . We want to infer K hidden topics, de\ufb01ned as multino-\nmial distributions over a vocabulary of size V . We de\ufb01ne a multinomial parameter \u03b21:V,1:K, termed\nthe topics. Each document d is associated with a normalized vector of topic weights \u0398d. Further-\nmore, each word n in document d has a topic assignment zdn. This is a K\u2212vector of binary entries,\nsuch that zk\nIn the generative process, we \ufb01rst draw the topics from a Dirichlet, \u03b2k \u223c Dirichlet(\u03b7). For each\ndocument, we draw the topic weights, \u0398d \u223c Dirichlet(\u03b1). Finally, for each word in the document,\nwe draw an assignment zdn \u223c Multinomial(\u0398d), and we draw the word from the assigned topic,\nwdn \u223c Multinomial(\u03b2zdn). The model has the following joint probability distribution:\np(zdn|\u0398d)p(wdn|\u03b21:K, zdn)\n\ndn = 1 if word n in document d is assigned to topic k, and zk\n\np(w, \u03b2, \u0398, z|\u03b7, \u03b1) =\n\ndn = 0 otherwise.\n\np(\u0398d|\u03b1)\n\np(\u03b2k|\u03b7)\n\nN(cid:89)\n\n(3)\n\nK(cid:89)\n\nD(cid:89)\n\nk=1\n\nd=1\n\nn=1\n\n2\n\n\fFollowing [1], the topics \u03b2 are global parameters, shared among all documents. The assignments z\nand topic proportions \u0398 are local, as they characterize a single document.\nIn variational inference [15], we approximate the posterior distribution,\n\np(\u03b2, \u0398, z|w) =\n\np(\u03b2, \u0398, z, w)\n\n(cid:82) d\u03b2d\u0398 p(\u03b2, \u0398, z, w)\n(cid:80)\n(cid:32) D(cid:89)\n(cid:33)(cid:32) D(cid:89)\nN(cid:89)\n\nq(zdn|\u03c6dn)\n\nz\n\n,\n\n(cid:33)\n\nwhich is intractable to compute. The posterior is approximated by a factorized distribution,\n\nq(\u0398d|\u03b3d)\n\nq(\u03b2, \u0398, z) = q(\u03b2|\u03bb)\n\nd=1\n\nd=1\n\nn=1\n\n(5)\nHere, q(\u03b2|\u03bb) and q(\u0398d|\u03b3d) are Dirichlet distributions, and q(zdn|\u03c6dn) are multinomials. The param-\neters \u03bb, \u03b3 and \u03c6 minimize the Kullback-Leibler (KL) divergence between the variational distribution\nand the posterior [16]. As shown in Refs. [1, 17], the objective to maximize is the evidence lower\nbound (ELBO),\n\nL(q) = Eq[log p(x, \u03b2, \u0398, z)] \u2212 Eq[log q(\u03b2, \u0398, z)].\n\n(6)\nThis is a lower bound on the marginal probability of the observations. It is a sensible objective\nfunction because, up to a constant, it is equal to the negative KL divergence between q and the\nposterior. Thus optimizing the ELBO with respect to q is equivalent to minimizing its KL divergence\nto the posterior.\nIn traditional variational methods, we iteratively update the local and global parameters. The local\nparameters are updated as described in [1, 17] . They are a function of the global parameters, so\nat iteration i the local parameter is \u03c6dn(\u03bbi). We are interested in the global parameters. They are\nupdated based on the (expected) suf\ufb01cient statistics S(\u03bbi),\n\n(4)\n\nS(\u03bbi) =\n\n\u03c6dn(\u03bbi) \u00b7 W T\n\ndn\n\n(7)\n\n(cid:88)\n\nN(cid:88)\n\nd\u2208{1,...,D}\n\nn=1\n\n\u03bbi+1 = \u03b7 + S(\u03bbi)\n\nFor \ufb01xed d and n, the multinomial parameter \u03c6dn is K\u00d71. The binary vector Wdn is V\u00d71; it satis\ufb01es\nW v\ndn = 1 if the word n in document d is v, and else contains only zeros. Hence, S is K\u00d7V and\ntherefore has the same dimension as \u03bb. Alternating updates lead to convergence.\n\nStochastic variational inference for LDA The computation of the suf\ufb01cient statistics is inef\ufb01-\ncient because it involves a pass through the entire data set. In Stochastic Variational Inference for\nLDA [1, 14], it is approximated by stochastically sampling a \u201dminibatch\u201d Bi \u2282 {1, ..., D} of |Bi|\ndocuments, estimating S on the basis of the minibatch, and scaling the result appropriately,\n\n\u02c6S(\u03bbi, Bi) =\n\nD\n|Bi|\n\n\u03c6dn(\u03bbi) \u00b7 W T\ndn.\n\n(cid:88)\n\nN(cid:88)\n\nd\u2208Bi\n\nn=1\n\nBecause it depends on the minibatch, \u02c6Si = \u02c6S(\u03bbi, Bi) is now a random variable. We will denote\nvariables that explicitly depend on the random minibatch Bi at the current time i by circum\ufb02exes,\nsuch as \u02c6g and \u02c6S.\nIn SVI, we update \u03bb by admixing the random estimate of the suf\ufb01cient statistics to the current value\nof \u03bb. This involves a learning rate \u03c1i < 1,\n\n\u03bbi+1 = (1 \u2212 \u03c1i)\u03bbi + \u03c1i(\u03b7 + \u02c6S(\u03bbi, Bi))\n\n(8)\nThe case of \u03c1 = 1 and |Bi| = D corresponds to batch variational inference (when sampling without\nreplacement) . For arbitrary \u03c1, this update is just stochastic gradient ascent, as a stochastic estimate\nof the natural gradient of the ELBO [1] is\n\n\u02c6g(\u03bbi, Bi) = (\u03b7 \u2212 \u03bbi) + \u02c6S(\u03bbi, Bi),\n\n(9)\nThis interpretation opens the world of gradient smoothing techniques. Note that the above stochastic\ngradient is unbiased: its expectation value is the full gradient. However, it has a variance. The goal\nof this paper will be to reduce this variance at the expense of introducing a bias.\n\n3\n\n\fAlgorithm 1: Smoothed stochastic gradients for Latent Dirichlet Allocation\nInput: D documents, minibatch size B, number of stored\nsuf\ufb01cient statistics L, learning rate \u03c1t, hyperparameters \u03b1, \u03b7.\nOutput: Hidden variational parameters \u03bb, \u03c6, \u03b3.\n1 Initialize \u03bb randomly and \u02c6gL\n2 Initialize empty queue Q = {}.\n3 for i = 0 to \u221e do\n\ni = 0.\n\nSample minibatch Bi \u2282 {1, . . . , D} uniformly.\ninitialize \u03b3\nrepeat\n\nn \u03c6dn\n\nFor d \u2208 Bi and n \u2208 {1, . . . , N} set\n\n\u03b3d = \u03b1 +(cid:80)\ndn \u221d exp(E[log \u0398dk] + E[log \u03b2k,wd ]), k \u2208 {1, . . . , K}\n\u03c6k\n(cid:80)\n(cid:80)N\nn=1 \u03c6dnW T\n\nuntil \u03c6dn and \u03b3d converge.\nFor each topic k, calculate suf\ufb01cient statistics for minibatch Bi:\n\u02c6Si = D|Bi|\nd\u2208Bi\nAdd new suf\ufb01cient statistic in front of queue Q:\nQ \u2190 { \u02c6Si} + Q\nRemove last element when length L has been reached:\nif length(Q) > L then\nQ \u2190 Q \u2212 { \u02c6Si\u2212L}\n\ndn\n\n4\n5\n6\n7\n\n8\n9\n\n10\n11\n\n12\n\n13\n\n14\n15\n16\n\n17\n18\n19\n\n20\n\n21\n\nend\nUpdate \u03bb, using stored suf\ufb01cient statistics:\ni \u2190 \u02c6SL\n\u02c6SL\ni \u2190 (\u03b7 \u2212 \u03bbi) + \u02c6SL\n\u02c6gL\nt .\n\u03bbt+1 = \u03bbt + \u03c1t \u02c6gL\n\ni\u22121 + ( \u02c6Si \u2212 \u02c6Si\u2212L)/L\n\ni\n\n22\n23 end\n\nSmoothed stochastic gradients for SVI Noisy stochastic gradients can slow down the conver-\ngence of SVI or lead to convergence to bad local optima. Hence, we propose a smoothing scheme\nto reduce the variance of the noisy natural gradient. To this end, we average the suf\ufb01cient statistics\nover the past L iterations. Here is a sketch:\n\n1. Uniformly sample a minibatch Bi \u2282 {1, . . . , D} of documents. Compute the local varia-\n\ntional parameters \u03c6 from a given \u03bbi.\n\n2. Compute the suf\ufb01cient statistics \u02c6Si = \u02c6S(\u03c6(\u03bbi), Bi).\n3. Store \u02c6Si, along with the L most recent suf\ufb01cient statistics. Compute \u02c6SL\n\ni = 1\nL\n\nas their mean.\n\n4. Compute the smoothed stochastic gradient according to\n\ni = (\u03b7 \u2212 \u03bbi) + \u02c6SL\n\u02c6gL\n\ni\n\n5. Use the smoothed stochastic gradient to calculate \u03bbi+1. Repeat.\n\n(cid:80)L\u22121\n\nj=0\n\n\u02c6Si\u2212j\n\n(10)\n\nDetails are in Algorithm 1. We now explore its properties. First, note that smoothing the suf\ufb01cient\nstatistics comes at almost no extra computational costs. In fact, the mean of the stored suf\ufb01cient\nstatistics does not explicitly have to be computed, but rather amounts to the update\n\ni \u2190 \u02c6SL\n\u02c6SL\n\ni\u22121 + ( \u02c6Si \u2212 \u02c6Si\u2212L)/L,\n\n(11)\n\nafter which \u02c6Si\u2212L is deleted. Storing the suf\ufb01cient statistics can be expensive for large values of L:\nIn the context of LDA involving the typical parameters K = 102 and V = 104, using L = 102\namounts to storing 108 64-bit \ufb02oats which is in the Gigabyte range.\nNote that when L = 1 we obtain stochastic variational inference (SVI) in its basic form. This\nincludes deterministic variational inference for L = 1, B = D in the case of sampling without\nreplacement within the minibatch.\n\nBiased gradients Let us now investigate the algorithm theoretically. Note that the only noisy part\nin the stochastic gradient in Eq. (9) is the suf\ufb01cient statistics. Averaging over L stochastic suf\ufb01cient\nstatistics thus promises to reduce the noise in the gradient. We are interested in the effect of the\nadditional parameter L.\n\n4\n\n\fWhen we average over the L most recent suf\ufb01cient statistics, we introduce a bias. As the variational\nparameters change during each iteration, the averaged suf\ufb01cient statistics deviate in expectation\nfrom its current value. This induces biased gradients. In a nutshell, large values of L will reduce the\nvariance but increase the bias.\nTo better understand this tradeoff, we need to introduce some notation. We de\ufb01ned the stochastic\ngradient \u02c6gi = \u02c6g(\u03bbi, Bi) in Eq. (9) and refer to gi = EBi[\u02c6g(\u03bbi, Bi)] as the full gradient (FG). We also\nL in Eq. (10). Now, we need to introduce an auxiliary\nde\ufb01ned the smoothed stochastic gradient \u02c6gi\nj=0 Si\u2212j. This is the time-averaged full gradient. It involves\nvariable, gL\ni\nthe full suf\ufb01cient statistics Si = S(\u03bbi) evaluated along the sequence \u03bb1, \u03bb2,... generated by our\nalgorithm.\nWe can expand the smoothed stochastic gradient into three terms:\n(cid:125)\n(cid:123)(cid:122)\ni \u2212 gL\ni )\n\n:= (\u03b7 \u2212 \u03bbi) + 1\n\ni = gi(cid:124)(cid:123)(cid:122)(cid:125)\n\n(cid:125)\n(cid:123)(cid:122)\ni \u2212 gi)\n\n(cid:80)L\u22121\n\n+ (\u02c6gL\n\n+ (gL\n\n(12)\n\n(cid:124)\n\n(cid:124)\n\n\u02c6gL\n\nL\n\nFG\n\nbias\n\nnoise\n\nE[(\u02c6gL\n\ni \u2212 gi)2] \u2248 E[(\u02c6gL\n\nThis involves the full gradient (FG), a bias term and a stochastic noise term. We want to minimize\nthe statistical error between the full gradient and the smoothed gradient by an optimal choice of L.\nWe will show this the optimal choice is determined by a tradeoff between variance and bias.\nFor the following analysis, we need to compute expectation values with respect to realizations of\nour algorithm, which is a stochastic process that generates a sequence of \u03bbi\u2019s. Those expectation\nvalues are denoted by E[\u00b7]. Notably, not only the minibatches Bi are random variables under this\nexpectation, but also the entire sequences \u03bb1, \u03bb2, ... . Therefore, one needs to keep in mind that even\nthe full gradients gi = g(\u03bbi) are random variables and can be studied under this expectation.\nWe \ufb01nd that the mean squared error of the smoothed stochastic gradient dominantly decomposes\ninto a mean squared bias and a noise term:\n\nTo see this, consider the mean squared error of the smoothed stochastic gradient with respect to the\nfull gradient, E[(\u02c6gL\ni + gL\n\ni \u2212 gi)2(cid:3) .\n\n(cid:125)\n(cid:123)(cid:122)\ni \u2212 gL\ni \u2212 gi)2(cid:3) = E(cid:2)(\u02c6gL\ni )2(cid:3) + 2 E(cid:2)(\u02c6gL\ni \u2212 gi)2], adding and subtracting gL\ni :\n(cid:80)L\u22121\nE(cid:104)\nL\u22121(cid:88)\nE(cid:2)(\u02c6gL\n(cid:105) E(cid:2)(gL\ni \u2212 gi)(cid:3) = 0. Therefore, the cross-term\n\n(cid:125)\n(cid:123)(cid:122)\ni \u2212 gi)2]\ni \u2212 gi)(cid:3) + E(cid:2)(gL\n(cid:105)\n\ni \u2212 gi)(cid:3) =\n(cid:105) \u2248 E(cid:104)\n\nWe encounter a cross-term, which we argue to be negligible. In de\ufb01ning \u2206 \u02c6Si = ( \u02c6Si \u2212 Si) we \ufb01nd\nthat (\u02c6gL\n\nThe \ufb02uctuations of the suf\ufb01cient statistics \u2206 \u02c6Si is a random variable with mean zero, and the ran-\ni \u2212 gi) enters only via \u03bbi. One can assume a very small statistical correlation between\ndomness of (gL\n\nthose two terms, E(cid:104)\n\nj=0 \u2206Si\u2212j. Therefore,\n\nE(cid:2)(\u02c6gL\n\n+ E[(gL\n\ni \u2212 gi)\n\ni \u2212 gi)\n\n\u2206 \u02c6Si\u2212j(gL\n\n\u2206 \u02c6Si\u2212j(gL\n\ni \u2212 gL\n\ni )(gL\n\ni \u2212 gL\n\ni )(gL\n\ni \u2212 gL\n\ni ) = 1\nL\n\ni \u2212 gL\n\ni \u2212 gL\n\nmean squared bias\n\n\u2206 \u02c6Si\u2212j\n\ni )2]\n\n1\nL\n\nj=0\n\n(13)\n\nvariance\n\n(cid:124)\n\n(cid:124)\n\n.\n\ncan be expected to be negligible. We con\ufb01rmed this fact empirically in our numerical experiments:\nthe top row of Fig. 1 shows that the sum of squared bias and variance is barely distinguishable from\nthe squared error.\nBy construction, all bias comes from the suf\ufb01cient statistics:\n\nE[(gL\n\ni \u2212 gi)2] = E\n\n(cid:20)(cid:16) 1\n\nL\n\n(cid:80)L\u22121\nj=0 (Si\u2212j \u2212 Si)\n\n(cid:17)2(cid:21)\n\n.\n\n(14)\n\nAt this point, little can be said in general about the bias term, apart from the fact that it should shrink\nwith the learning rate. We will explore it empirically in the next section. We now consider the\nvariance term:\n\n(cid:20)(cid:16) 1\n\nL\n\n(cid:80)L\u22121\n\nj=0 \u2206 \u02c6Si\u2212j\n\n(cid:17)2(cid:21)\n\nE[(\u02c6gL\n\ni \u2212 gL\n\ni )2] = E\n\nE(cid:104)\n(\u2206 \u02c6Si\u2212j)2(cid:105)\n\nL\u22121(cid:88)\n\nj=0\n\nL\u22121(cid:88)\n\nj=0\n\n=\n\n1\nL2\n\nE[(\u02c6gi\u2212j \u2212 gi\u2212j)2].\n\n=\n\n1\nL2\n\n5\n\n\fFigure 1: Empirical test of the variance-bias tradeoff on 2,000 abstracts from the Arxiv repository\n(\u03c1 = 0.01, B = 300). Top row. For \ufb01xed L = 30 (left), L = 100 (middle), and L = 300\n(right), we compare the squared bias, variance, variance+bias and the squared error as a function\nof iterations. Depending on L, the variance or the bias give the dominant contribution to the error.\nBottom row. Squared bias (left), variance (middle) and squared error (right) for different values of\nL. Intermediate values of L lead to the smallest squared error and hence to the best tradeoff between\nsmall variance and small bias.\n\nThis can be reformulated as var(\u02c6gL\nlittle during those L successive updates, we can approximate var(\u02c6gi\u2212j) \u2248 var(\u02c6gi), which yields\n\nj=0 var(\u02c6gi\u2212j). Assuming that the variance changes\n\ni ) = 1\nL2\n\nvar(\u02c6gL\n\ni ) \u2248 1\nL\n\nvar(\u02c6gi).\n\n(15)\n\nThe smoothed gradient has therefore a variance that is approximately L times smaller than the\nvariance of the original stochastic gradient.\n\n(cid:80)L\u22121\n\nBias-variance tradeoff To understand and illustrate the effect of L in our optimization problem,\nwe used a small data set of 2000 abstracts from the Arxiv repository. This allowed us to compute\nthe full suf\ufb01cient statistics and the full gradient for reference. More details on the data set and the\ncorresponding parameters will be given below.\nWe computed squared bias (SB), variance (VAR) and squared error (SE) according to Eq. (13) for a\nsingle stochastic optimization run. More explicitly,\n\nK(cid:88)\n\nV(cid:88)\n\n(cid:0)\u02c6gL\n\n(cid:1)2\n\nK(cid:88)\n\nV(cid:88)\n\n(cid:0)\u02c6gL\n\n(cid:1)2\n\nK(cid:88)\n\nV(cid:88)\n\n(cid:0)gL\n\ni \u2212 gi\n\n(cid:1)2\n\nSBi =\n\nkv , VARi =\n\ni \u2212 gL\n\ni\n\nkv , SEi =\n\ni \u2212 gi\n\nkv .\n\n(16)\n\nk=1\n\nv=1\n\nk=1\n\nv=1\n\nk=1\n\nv=1\n\nIn Fig. 1, we plot those quantities as a function of iteration steps (time). As argued before, we arrive\nat a drastic variance reduction (bottom, middle) when choosing large values of L.\nIn contrast, the squared bias (bottom, left) typically increases with L. The bias shows a complex\ntime-evolution as it maintains memory of L previous steps. For example, the kinks in the bias\ncurves (bottom, left) occur at iterations 3, 10, 30, 100 and 300, i.e. they correspond to the values of\nL. Those are the times from which on the smoothed gradient looses memory of its initial state, typ-\nically carrying a large bias. The variances become approximately stationary at iteration L (bottom,\nmiddle). Those are the times where the initialization process ends and the queue Q in Algorithm 1\nhas reached its maximal length L. The squared error (bottom, right) is to a good approximation just\nthe sum of squared bias and variance. This is also shown in the top panel of Fig. 1.\n\n6\n\n\fDue to the long-time memory of the smoothed gradients, one can associate some \u201dinertia\u201d or \u201dmo-\nmentum\u201d to each value of L. The larger L, the smaller the variance and the larger the inertia. In\na non-convex optimization setup with many local optima as in our case, too much inertia can be\nharmful. This effect can be seen for the L = 100 and L = 300 runs in Fig. 1 (bottom), where the\nmean squared bias and error curves bend upwards at long times. Think of a marble rolling in a wavy\nlandscape: with too much momentum it runs the danger of passing through a good optimum and\neventually getting trapped in a bad local optimum. This picture suggests that the optimal value of\nL depends on the \u201druggedness\u201d of the potential landscape of the optimization problem at hand. Our\nempirical study suggest that choosing L between 10 and 100 produces the smallest mean squared\nerror.\n\nAside: connection to gradient averaging Our algorithm was inspired by various gradient aver-\naging schemes. However, we cannot easily used averaged gradients in SVI. To see the drawbacks of\ngradient averaging, let us consider L stochastic gradients \u02c6gi, \u02c6gi\u22121, \u02c6gi\u22122, ..., \u02c6gi\u2212L+1 and replace\n\n\u02c6gi \u2212\u2192 1\n\nj=0 \u02c6gi\u2212j.\n\nL\n\n(cid:80)L\u22121\nL\u22121(cid:88)\n\n\uf8eb\uf8ed\u03b7 +\n\nL\u22121(cid:88)\n\nj=0\n\n\uf8f6\uf8f8 .\n\n(17)\n\n(18)\n\nOne arrives at the following parameter update for \u03bbi:\n\n\u03bbi+1 = (1 \u2212 \u03c1i)\u03bbi + \u03c1i\n\n1\nL\n\nj=0\n\n\u02c6Si\u2212j \u2212 1\nL\n\n(\u03bbi\u2212j \u2212 \u03bbi)\n\nThis update can lead to the violation of optimization constraints, namely to a negative variational\nparameter \u03bb. Note that for L = 1 (the case of SVI), the third term is zero, guaranteeing positivity of\nthe update. This is no longer guaranteed for L > 1, and the gradient updates will eventually become\nnegative. We found this in practice. Furthermore, we \ufb01nd that there is an extra contribution to the\nbias compared to Eq. (14),\n\n(cid:20)(cid:16) 1\n\nL\n\n(cid:80)L\u22121\nj=0 (\u03bbi \u2212 \u03bbi\u2212j) + 1\n\nL\n\n(cid:80)L\u22121\nj=0 (Si\u2212j \u2212 Si)\n\n(cid:17)2(cid:21)\n\nE[(gL\n\ni \u2212 gi)2] = E\n\n.\n\n(19)\n\nHence, the averaged gradient carries an additional bias in \u03bb - it is the same term that may violate\noptimization constraints. In contrast, the variance of the averaged gradient is the same as the variance\nof the smoothed gradient. Compared to gradient averaging, the smoothed gradient has a smaller bias\nwhile pro\ufb01ting from the same variance reduction.\n\n3 Empirical study\n\nWe tested SVI for LDA, using the smoothed stochastic gradients, on three large corpora:\n\n\u2022 882K scienti\ufb01c abstracts from the Arxiv repository, using a vocabulary of 14K words.\n\u2022 1.7M articles from the New York Times, using a vocabulary of 8K words.\n\u2022 3.6M articles from Wikipedia, using a vocabulary of 7.7K words.\n\nWe set the minibatch size to B = 300 and furthermore set the number of topics to K = 100, and\nthe hyper-parameters \u03b1 = \u03b7 = 0.5. We \ufb01xed the learning rate to \u03c1 = 10\u22123. We also compared our\nresults to a decreasing learning rate and found the same behavior.\nFor a quantitative test of model \ufb01tness, we evaluate the predictive probability over the vocabulary [1].\nTo this end, we separate a test set from the training set. This test set is furthermore split into two\nparts: half of it is used to obtain the local variational parameters (i.e. the topic proportions by \ufb01tting\nLDA with the \ufb01xed global parameters \u03bb). The second part is used to compute the likelihoods of the\ncontained words:\np(wnew|wold, D) \u2248\n\nq(\u0398)q(\u03b2)d\u0398d\u03b2 = (cid:80)K\n\n(cid:90) (cid:16)(cid:80)K\n\nEq[\u03b8k]Eq[\u03b2k,wnew ].\n\nk=1 \u0398k\u03b2k,wnew\n\n(cid:17)\n\n(20)\n\nk=1\n\nWe show the predictive probabilities as a function of effective passes through the data set in Fig. 2\nfor the New York Times, Arxiv, and Wikipedia corpus, respectively. Effective passes through the\ndata set are de\ufb01ned as (minibatch size * iterations / size of corpus). Within each plot, we compare\n\n7\n\n\fFigure 2: Per-word predictive probabilitiy as a function of the effective number of passes through\nthe data (minibatch size * iterations / size of corpus). We compare results for the New York Times,\nArxiv, and Wikipedia data sets. Each plot shows data for different values of L. We used a constant\nlearning rate of 10\u22123, and set a time budget of 24 hours. Highest likelihoods are obtained for L\nbetween 10 and 100, after which strong bias effects set in.\n\ndifferent numbers of stored suf\ufb01cient statistics, L \u2208 {1, 10, 100, 1000, 10000,\u221e}. The last value\nof L = \u221e corresponds to a version of the algorithm where we average over all previous suf\ufb01cient\nstatistics, which is related to averaged gradients (AG), but which has a bias too large to compete\nwith small and \ufb01nite values of L. The maximal values of 30, 5 and 6 effective passes through the\nArxiv, New York Times and Wikipedia data sets, respectively, approximately correspond to a run\ntime of 24 hours, which we set as a hard cutoff in our study.\nWe obtain the highest held-out likelihoods for intermediate values of L. E.g., averaging only over\n10 subsequent suf\ufb01cient statistics results in much faster convergence to higher likelihoods at very\nlittle extra storage costs. As we discussed above, we attribute this fact to the best tradeoff between\nvariance and bias.\n\n4 Discussion and Conclusions\n\nSVI scales up Bayesian inference, but suffers from noisy stochastic gradients. To reduce the mean\nsquared error relative to the full gradient, we averaged the suf\ufb01cient statistics of SVI successively\nover L iteration steps. The resulting smoothed gradient is biased, however, and the performance\nof the method is governed by the competition between bias and variance. We argued theoretically\nand showed empirically that for a \ufb01xed time budget, intermediate values of the number of stored\nsuf\ufb01cient statistics L give the highest held-out likelihoods.\nProving convergence for our algorithm is still an open problem, which is non-trivial especially be-\ncause the variational objective is non-convex. To guarantee convergence, however, we can simply\nphase out our algorithm and reduce the number of stored gradients to one as we get close to conver-\ngence. At this point, we recover SVI.\n\nAcknowledgements We thank Laurent Charlin, Alp Kucukelbir, Prem Gopolan, Rajesh Ran-\nganath, Linpeng Tang, Neil Houlsby, Marius Kloft, and Matthew Hoffman for discussions. We\nacknowledge \ufb01nancial support by NSF CAREER NSF IIS-0745520, NSF BIGDATA NSF IIS-\n1247664, NSF NEURO NSF IIS-1009542, ONR N00014-11-1-0651, the Alfred P. Sloan founda-\ntion, DARPA FA8750-14-2-0009 and the NSF MRSEC program through the Princeton Center for\nComplex Materials Fellowship (DMR-0819860).\n\n8\n\n\fReferences\n[1] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational\n\ninference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[2] Prem Gopalan, Jake M Hofman, and David M Blei. Scalable recommendation with Poisson\n\nfactorization. Preprint, arXiv:1311.1704, 2013.\n\n[3] Prem K Gopalan and David M Blei. Ef\ufb01cient discovery of overlapping communities in massive\n\nnetworks. Proceedings of the National Academy of Sciences, 110(36):14534\u201314539, 2013.\n\n[4] Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mixed membership\nstochastic blockmodels. In Advances in Neural Information Processing Systems, pages 33\u201340,\n2009.\n\n[5] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. Uncer-\n\ntainty in Arti\ufb01cial Intelligence, 2013.\n\n[6] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Math-\n\nematical Statistics, pages 400\u2013407, 1951.\n\n[7] Chong Wang, Xi Chen, Alex Smola, and Eric Xing. Variance reduction for stochastic gradient\noptimization. In Advances in Neural Information Processing Systems, pages 181\u2013189, 2013.\n[8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-\nance reduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n[9] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Technical report, HAL 00860051, 2013.\n\n[10] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Pro-\n\ngramming, 120(1):221\u2013259, 2009.\n\n[11] Masa-Aki Sato. Online model selection based on the variational Bayes. Neural Computation,\n\n13(7):1649\u20131681, 2001.\n\n[12] Mirwaes Wahabzada and Kristian Kersting. Larger residuals, less work: Active document\nscheduling for latent Dirichlet allocation. In Machine Learning and Knowledge Discovery in\nDatabases, pages 475\u2013490. Springer, 2011.\n\n[13] Paul Tseng. An incremental gradient (-projection) method with momentum term and adaptive\n\nstepsize rule. SIAM Journal on Optimization, 8(2):506\u2013531, 1998.\n\n[14] Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent Dirichlet\n\nallocation. In Advances in Neural Information Processing Systems, pages 856\u2013864, 2010.\n\n[15] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and vari-\n\national inference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[16] Christopher M Bishop et al. Pattern Recognition and Machine Learning, volume 1. Springer\n\nNew York, 2006.\n\n[17] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. The Journal\n\nof Machine Learning Research, 3:993\u20131022, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1269, "authors": [{"given_name": "Stephan", "family_name": "Mandt", "institution": "Princeton University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}