{"title": "Stochastic Expectation Maximization with Variance Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 7967, "page_last": 7977, "abstract": "Expectation-Maximization (EM) is a popular tool for learning latent variable models, but the vanilla batch EM does not scale to large data sets because the whole data set is needed at every E-step. Stochastic Expectation Maximization (sEM) reduces the cost of E-step by stochastic approximation. However, sEM has a slower asymptotic convergence rate than batch EM, and requires a decreasing sequence of step sizes, which is difficult to tune. In this paper, we propose a variance reduced stochastic EM (sEM-vr) algorithm inspired by variance reduced stochastic gradient descent algorithms. We show that sEM-vr has the same exponential asymptotic convergence rate as batch EM. Moreover, sEM-vr only requires a constant step size to achieve this rate, which alleviates the burden of parameter tuning. We compare sEM-vr with batch EM, sEM and other algorithms on Gaussian mixture models and probabilistic latent semantic analysis, and sEM-vr converges significantly faster than these baselines.", "full_text": "Stochastic Expectation Maximization\n\nwith Variance Reduction\n\nJianfei Chen\u2020, Jun Zhu\u2020\u2217, Yee Whye Teh\u2021 and Tong Zhang\u00a7\n\n\u2020 Dept. of Comp. Sci. & Tech., BNRist Center, State Key Lab for Intell. Tech. & Sys.,\n\nInstitute for AI, THBI Lab, Tsinghua University, Beijing, 100084, China\n\n\u2021 Department of Statistics, University of Oxford\n\n\u00a7 Tencent AI Lab\n\n{chenjian14@mails,dcszj@}.tsinghua.edu.cn\n\ny.w.teh@stats.ox.ac.uk; tongzhang@tongzhang-ml.org\n\nAbstract\n\nExpectation-Maximization (EM) is a popular tool for learning latent variable\nmodels, but the vanilla batch EM does not scale to large data sets because the whole\ndata set is needed at every E-step. Stochastic Expectation Maximization (sEM)\nreduces the cost of E-step by stochastic approximation. However, sEM has a slower\nasymptotic convergence rate than batch EM, and requires a decreasing sequence of\nstep sizes, which is dif\ufb01cult to tune. In this paper, we propose a variance reduced\nstochastic EM (sEM-vr) algorithm inspired by variance reduced stochastic gradient\ndescent algorithms. We show that sEM-vr has the same exponential asymptotic\nconvergence rate as batch EM. Moreover, sEM-vr only requires a constant step size\nto achieve this rate, which alleviates the burden of parameter tuning. We compare\nsEM-vr with batch EM, sEM and other algorithms on Gaussian mixture models and\nprobabilistic latent semantic analysis, and sEM-vr converges signi\ufb01cantly faster\nthan these baselines.\n\n1\n\nIntroduction\n\nLatent variable models are an important class of models due to their wide applicability across machine\nlearning and statistics. Examples include factor analysis in psychology and the understanding of\nhuman cognition [32], hidden Markov models for modelling sequences, e.g. speech and language\n[29], and DNA [15], document and topic models [17, 4] and mixture models for density estimation\nand clustering [26]. Expectation Maximization (EM) [12] is a basic tool for maximum likelihood\nestimation for the parameters in latent variable models. It is an iterative algorithm with two steps:\nan E-step which calculates the expectation of suf\ufb01cient statistics under the latent variable posteriors\ngiven the current parameters, and an M-step which updates the parameters given the expectations.\nWith the phenomenal growth in big data sets in recent years, the basic batch EM (bEM) algorithm\nin [12] is quickly becoming infeasible because the whole data set is needed at every E-step. Capp\u00e9\nand Moulines [6] proposed a stochastic EM (sEM) algorithm for exponential family models, which\nreduces the time complexity for the E-step by approximating the full-batch expectation with an\nexponential moving average over minibatches of data. sEM has been adopted in many applications\nincluding natural language processing [24], topic modeling [16, 14] and hidden Markov models [5].\nHowever, sEM has a slow asymptotic convergence rate due to the high variance of each update.\n\u221a\nUnlike the original batch EM (bEM), which converges exponentially fast near a local optimum, the\nT ) for sEM, where T is the\ndistance towards a local optimum only decreases at the rate O(1/\n\n\u2217corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fnumber of iterations. Moreover, sEM requires a decreasing sequence of step sizes to converge. The\ndecay rate of step sizes is often dif\ufb01cult to tune.\nRecently, there has been much progress in accelerating stochastic gradient descent (SGD) by reducing\nthe variance of the stochastic gradients, including SAG, SAGA and SVRG [22, 20, 11]. These\nalgorithms achieve better convergence rates by utilizing infrequently computed batch gradients as\ncontrol variates. Such ideas have also been brought into gradient-based Bayesian learning algorithms,\nincluding stochastic variational inference [25], as well as stochastic gradient Markov-chain Monte-\nCarlo [13, 8, 7] (SGMCMC).\nIn this paper, we develop a variance reduced stochastic EM algorithm (sEM-vr). In each epoch, that is,\na full pass through the data set, our algorithm computes the full batch expectation as a control variate,\nand uses this to reduce the variance of minibatch updates in that epoch. Let E be the number of\nepochs and M be the number of minibatch iterations per epoch. We show that near a local optimum,\nour algorithm, with a constant step size, enjoys a convergence rate of O((M\u22121 log M )E/2) to the\noptimum. Like bEM, our convergence rate is exponential with respect to the number of epochs,\nand is asymptotically faster than sEM. We also show that our algorithm converges globally with a\nconstant step size, under stronger assumptions. Note that leveraging variance reduction ideas in sEM\nis not straightforward, since sEM is not a stochastic gradient descent algorithm but rather a stochastic\napproximation [21] algorithm. In particular, the proof techniques we utilize are different than those in\nstochastic gradient descent algorithms. We demonstrate our algorithm on Gaussian mixture models\nand probabilistic latent semantic analysis [18]. sEM-vr achieves signi\ufb01cantly faster convergence\ncomparing with sEM, bEM, and other gradient-based and Bayesian algorithms.\n\n2 Background\n\nWe review batch and stochastic EM algorithms in this section. Throughout the paper we focus on\nexponential family models with tractable E- and M-steps, which stochastic EM [6] is designed for.\n\n2.1 EM Algorithm\n\nThe EM algorithm is designed for models with some observed variable x and hidden variable h.\nWe assume an exponential family joint distribution p(x, h; \u03b8) = b(x, h) exp{\u03b7(\u03b8)(cid:62)\u03c6(x, h) \u2212 A(\u03b8)}\nparameterized by \u03b8. Given a data set of N ((cid:29) 1) observations X = {xi}N\ni=1, we want to obtain\na maximum likelihood estimation (MLE) of the parameter \u03b8, by maximizing the log marginal\np(xi, hi; \u03b8)d\u03b8, where the variables (xi, hi)\ni=1. Batch expectation-maximization (bEM) [12] optimizes the\n\nlikelihood L(\u03b8) := (cid:80)N\n\nare i.i.d. given \u03b8. Denote H = {hi}N\nlog marginal likelihood L(\u03b8) by constructing a lower bound of it:\n\ni=1 log p(xi; \u03b8) = (cid:80)N\n(cid:104)\n\ni=1 log(cid:82)\n\n(cid:105)\n\nL(\u03b8) \u2265 Q(\u03b8; \u02c6\u03b8) \u2212 Ep(H|X;\u02c6\u03b8)\n\nlog p(H|X, \u02c6\u03b8)\n\n,\n\n(1)\n\n\u03b7(\u03b8)(cid:62)F (\u02c6\u03b8) \u2212 A(\u03b8)\n\nQ(\u03b8; \u02c6\u03b8) := E\n\n+ constant,\n\np(H|X;\u02c6\u03b8) [log p(X, H; \u03b8)] = N\n\n(2)\ni=1 fi(\u02c6\u03b8) as the full-batch expected suf\ufb01cient statistics, and where\nwhere we de\ufb01ne F (\u02c6\u03b8) := 1\nN\nfi(\u02c6\u03b8) := E\np(hi|xi;\u02c6\u03b8) [\u03c6(xi, hi)] is the expected suf\ufb01cient statistics conditioned on observed datum xi.\nLet \u02c6\u03b8e be the estimated parameter at iteration or epoch e, where each epoch is a complete pass through\nthe data set. In the E-step, bEM tightens the bound in Eq. (1) by setting \u02c6\u03b8 = \u02c6\u03b8e, and computes the\nexpected suf\ufb01cient statistics F (\u02c6\u03b8e). In the M-step, bEM \ufb01nds a maximizer \u02c6\u03b8e+1 of the lower bound\nwith respect to \u03b8, by solving the optimization problem argmax\u03b8{\u03b7(\u03b8)(cid:62)F (\u02c6\u03b8) \u2212 A(\u03b8)}. The solution\nis denoted as R(F (\u02c6\u03b8)), and is assumed to be tractable. In summary, the bEM updates can be written\nsimply as\n\n(cid:80)N\n\nhi\n\n(cid:16)\n\n(cid:17)\n\nE-step: compute F (\u02c6\u03b8e), M-step: let \u02c6\u03b8e+1 = R(F (\u02c6\u03b8e)).\n\n(3)\n\nThe algorithm is also applicable to maximum a posteriori (MAP) estimation of parameters, with a\nconjugate prior p(\u03b8; \u03b1) = exp{\u03b7(\u03b8)(cid:62)\u03b1 \u2212 A(\u03b8)} with the hyperparameter \u03b1. Instead of L(\u03b8), MAP\n\nmaximizes L(\u03b8) + log p(\u03b8; \u03b1) \u2265 N \u03b7(\u03b8)(cid:62)(cid:16)\n\n(cid:17) \u2212 N A(\u03b8) + constant, and we still apply\n\n\u03b1/N + F (\u02c6\u03b8)\n\nEq. (3), but with fi(\u02c6\u03b8) := \u03b1/N + E\n\np(hi|xi;\u02c6\u03b8) [\u03c6(xi, hi)] instead.\n\n2\n\n\f2.2 Stochastic EM Algorithm\n\nWhen the data set is large, that is, N is large, computing F (\u02c6\u03b8t) in the E-step is too expensive because\nit needs a full pass though the entire data set. Stochastic EM (sEM) [6] avoids this by maintaining an\nexponentially moving average \u02c6st as an approximation of the full average F (\u02c6\u03b8t). At iteration t, sEM\npicks a single random datum i, and updates:\n\nwhere (\u03c1t) is a sequence of step sizes that satisfy(cid:80)\n\nE step: \u02c6st+1 = (1 \u2212 \u03c1t)\u02c6st + \u03c1tfi(\u02c6\u03b8t), M step: \u02c6\u03b8t+1 = R(\u02c6st+1),\n\nt < \u221e. We deliberately\nchoose different iteration indices e and t for bEM and sEM to emphasize their different time\ncomplexity per iteration. In practice, sEM can take a minibatch of data instead of a single datum per\niteration, but we stick to a single datum for cleaner presentation. The two sEM updates can be rolled\ninto a single update\n\nt \u03c12\n\nt \u03c1t = \u221e and(cid:80)\n\n\u02c6st+1 = (1 \u2212 \u03c1t)\u02c6st + \u03c1tfi(\u02c6st).\n\n(4)\nwhere for simplicity we have overloaded the notation with fi(s) := fi(R(s)). This \ufb01rst maps\ns, which can be interpreted as the estimated mean parameter of the model, into the parameters\n\u03b8 = R(s), before computing the required expected suf\ufb01cient statistics fi(\u03b8) under the posterior given\nobservation xi. Which of the two de\ufb01nitions should be clear from the type of its argument and we\nfeel this helps reduce notational burden on the reader. We similarly overload F (s) := F (R(s)) and\nL(s) := L(R(s)) accordingly, so we can also write bEM updates (Eq. 3) as simply \u02c6se+1 = F (\u02c6se).\nIntuitively, we want to \ufb01nd a stationary point s\u2217 under bEM iterations, i.e., s\u2217 = F (s\u2217). We can view\nbEM as a \ufb01xed-point algorithm, and sEM as a Robbins-Monro [30] algorithm to solve the equation\ns\u2217 = F (s\u2217).\nBecause of the cheap updates, sEM can converge faster than bEM on large data sets in the beginning.\nHowever, due to the variance of the estimator \u02c6st, sEM has a slower asymptotic convergence rate than\nbEM for \ufb01nite data sets. Speci\ufb01cally, let s\u2217 = F (s\u2217) be a stationary point, Cappe and Monlines [6]\nt \u03c1t = \u221e. In\ncontrast, Dempster et al. [12] showed that bEM converges as (cid:107)\u02c6sE \u2212 s\u2217(cid:107)2 \u2264 (1 \u2212 \u03bb)\u22122E (cid:107)\u02c6s0 \u2212 s\u2217(cid:107),\nwhere 1 \u2212 \u03bb \u2208 [0, 1) is a constant that is de\ufb01ned in Sec. 3.3. As long as the data set is \ufb01nite, the\nexponential rate of bEM is faster than sEM. 2 Moreover, sEM needs a decreasing sequence of step\nsizes to converge, whose decay rate is dif\ufb01cult to tune.\n\nshowed that E(cid:107)\u02c6sT \u2212 s\u2217(cid:107)2 = O(\u03c1T ) for sEM, which is at best O(T \u22121) since(cid:80)\n\n3 Variance Reduced Stochastic Expectation Maximization\n\nIn this section, we describe a variance reduced stochastic EM algorithm (sEM-vr), and develop the\ntheory for its convergence. sEM-vr enjoys an exponential convergence rate with a constant step size.\n\n3.1 Algorithm Description\n\nWe run the algorithm for E epochs and M minibatch iterations per epoch, so that there are T := M E\niterations in total. For simplicity we choose M = N and use minibatches of size 1, though our\nanalysis is not limited to this case. Each epoch has the same time complexity as bEM. We index\niteration t in epoch e as e, t. Let \u02c6se,t be the estimated suf\ufb01cient statistics at iteration e, t. Starting\nfrom the initial estimate \u02c6s0,0, sEM-vr performs the following updates in epoch e,\nStochastic EM with Variance Reduction\n1. Compute F (\u02c6se,0), and save F (\u02c6se,0) as well as \u02c6se,0\n2. For each iteration t = 1, . . . , M, randomly sample a datum i, and update\n\n\u02c6se,t+1 = (1 \u2212 \u03c1)\u02c6se,t + \u03c1 [fi(\u02c6se,t) \u2212 fi(\u02c6se,0) + F (\u02c6se,0)] .\n\n(5)\n\n3. Let \u02c6se+1,0 = \u02c6se,M .\nLet Ee,t and Vare,t be the expectation and variance over the random index i in iteration e, t. Comparing\nEq. (5) with Eq. (4), we observe that the sEM and sEM-vr updates have the same expectation\n\n2Without affecting the convergence rates, we slightly adjust the convergence theorems in [6, 12] to view\n\nthem in a uniformed way, see Appendix A for details.\n\n3\n\n\fEt [\u02c6st+1] = (1 \u2212 \u03c1)\u02c6st + \u03c1F (\u02c6st). However their variances are different: sEM has Vart [\u02c6st+1] =\nt Vart[fi(\u02c6st)], while sEM-vr has Vare,t [\u02c6se,t+1] = \u03c12Vare,t [fi(\u02c6se,t) \u2212 fi(\u02c6se,0)]. If the algorithm\n\u03c12\nconverges, i.e., the sequence (\u02c6se,t) converges to a point s\u2217, and fi(\u00b7) is continuous, the variance\nof sEM-vr will converge to zero, while that of sEM will remain positive. Therefore, sEM-vr has\nasymptotically smaller variance than sEM, and we will see that this leads to better asymptotic\nconvergence rates.\nThe time complexity of sEM-vr per epoch is the same as bEM and sEM, with a constant factor up\nto 3, for computing fi(\u02c6se,t), fi(\u02c6se,0) and F (\u02c6se,0). The space complexity also has a constant factor\nup to 3, for storing \u02c6se,0 and F (\u02c6se,0) along with \u02c6se,t. In practice, the difference is less than 3 times\nbecause the time and space costs for other aspects of the methods are the same, e.g. data storage.\n\n3.2 Related Works\n\nA possible alternative to sEM is Titterington\u2019s online algorithm [33], which replaces the exact M-step\nwith a gradient ascent step to optimize Q(\u03b8; \u02c6\u03b8), where the gradient is multiplied with the inverse\nFisher information of p(x, h; \u03b8). Titterington\u2019s algorithm is locally equivalent to sEM [6]. However,\nas argued by Capp\u00e9 and Moulines [6], Titterington\u2019s algorithm has several issues, including the Fisher\ninformation being expensive to compute in high dimensions, the need for explicit matrix inversion,\nand that the updated parameters are not guaranteed to be valid. Moreover, leveraging variance reduced\nstochastic gradient algorithms [20, 22, 11] for Titterington\u2019s algorithm is not straightforward as the\nFisher information matrix changes with \u03b8. Zhu et al. has proposed a variance reduced stochastic\ngradient EM algorithm [39]. There are also some theoretical analysis of EM algorithm for high\ndimensional data [3, 35].\nInstead of performing point estimation of parameters, Bayesian inference algorithms, including\nvariational inference (VI) and Markov-chain Monte-Carlo (MCMC), can also be adopted, to infer\nthe posterior distribution of parameters. Variance reducing techniques have also been applied to\nthese settings, including smoothed stochastic variational inference (SSVI) [25] and variance reduced\nstochastic gradient MCMC (VRSGMCMC) algorithms [13, 8, 7]. However, convergence guarantees\nfor SSVI have not been developed, while VRSGMCMC algorithms are typically much slower than\nsEM-vr due to the intrinsic randomness of MCMC. For example, the time complexity to converge to\nan \u0001-precision in terms of the 2-Wasserstein distance of the true posterior and the MCMC distribution\nis O(N +\u03ba3/2\nd/\u0001), where \u03ba is a condition number and d is the dimensionality of the parameters [7].\n\n\u221a\n\n3.3 Local Convergence Rate\nWe analyze the local convergence rate of a sequence {\u02c6se,t} of sEM-vr iterates to a stationary point s\u2217\nwith s\u2217 = F (s\u2217). Let \u03b8\u2217 := R(s\u2217) be the natural parameter corresponding to the mean parameter s\u2217.\nTheorem 1. If\n\n(a) The Hessian \u22072L(\u03b8\u2217) is negative de\ufb01nite, i.e., \u03b8\u2217 is a strict local maximum of L(\u03b8\u2217).\n(b) \u2200i, fi(s) is Lf -Lipschitz continuous, and F (s) is \u03b2f -smooth.\n(c) \u2200e, t,(cid:107)\u02c6se,t \u2212 s\u2217(cid:107) < \u03bb/\u03b2f , where 1 \u2212 \u03bb is the maximum eigenvalue of J\u2217 := \u2202F (s\u2217)/\u2202s\u2217.\n\nThen, for any step size \u03c1 \u2264 \u03bb/(32L2\n\nf ), we have\n\nE(cid:107)\u02c6sE,0 \u2212 s\u2217(cid:107)2 \u2264(cid:2)exp (\u2212M \u03bb\u03c1/4) + 32L2\nf \u03c1/\u03bb(cid:3)E (cid:107)\u02c6s0,0 \u2212 s\u2217(cid:107)2 .\nE(cid:107)\u02c6sE,0 \u2212 s\u2217(cid:107)2 \u2264(cid:2)(cid:0)1 + log(M/\u03ba2)(cid:1) \u03ba2/M(cid:3)E (cid:107)\u02c6s0,0 \u2212 s\u2217(cid:107)2 .\n\nIn particular, if \u03c1 = \u03c1\u2217 := 4 log(M/\u03ba2)/(\u03bbM ), where \u03ba2 := 128L2\n\nf /\u03bb2, then we have\n\n(6)\n\n(7)\n\nRemarks. Assumption (a) follows directly from the original EM paper (Theorem 4) [12]. [12]\nanalyzed the convergence only in an in\ufb01nitesimal neighbourhood of s\u2217, while Assumption (c) gives\nan explicit radius of convergence. Assumption (b) is new and required to control the variance and\nradius of convergence. Note also that we analyse the convergence of the mean parameters, while\n[12] analysed that for parameters. However they are equivalent if R(s) is Lipschitz continuous. In\nAppendix A.1 we show that negative de\ufb01nite \u22072L(\u03b8\u2217) in Assumption (a) implies that \u03bb > 0 in\nAssumption (c).\n\n4\n\n\fProof. We \ufb01rst analyze the convergence behavior at a speci\ufb01c epoch e, and omit the epoch index e\nfor concise notations. We further denote \u2206t := \u02c6st \u2212 s\u2217 for any t. By Eq. (5),\n\nEt (cid:107)\u2206t+1(cid:107)2 = Et (cid:107)(1 \u2212 \u03c1)\u02c6st + \u03c1F (\u02c6st) \u2212 s\u2217 + \u03c1 [fi(\u02c6st) \u2212 fi(\u02c6s0) \u2212 F (\u02c6st) + F (\u02c6s0)](cid:107)2\n=(cid:107)(1 \u2212 \u03c1)\u02c6st + \u03c1F (\u02c6st) \u2212 s\u2217(cid:107)2 + \u03c12Et (cid:107)fi(\u02c6st) \u2212 fi(\u02c6s0) \u2212 F (\u02c6st) + F (\u02c6s0)(cid:107)2 ,\n\n(8)\n\nwhere the second equality is due to Et [fi(\u02c6se,t) \u2212 fi(\u02c6se,0) + F (\u02c6se,0)] = F (\u02c6se,t). We have\n\n(cid:107)(1 \u2212 \u03c1)\u02c6st + \u03c1F (\u02c6st) \u2212 s\u2217(cid:107)2 = (cid:107)(1 \u2212 \u03c1)\u2206t + \u03c1(F (\u02c6st) \u2212 s\u2217) + \u03c1J\u2217\u2206t \u2212 \u03c1J\u2217\u2206t(cid:107)2\n\n\u2264(cid:2)(cid:107)(1 \u2212 \u03c1)\u2206t + \u03c1J\u2217\u2206t(cid:107) + \u03c1(cid:107)F (\u02c6st) \u2212 s\u2217 \u2212 J\u2217\u2206t(cid:107)(cid:3)2\n(1 \u2212 \u03c1\u03bb)(cid:107)\u2206t(cid:107) + (\u03c1/2)\u03b2f (cid:107)\u2206t(cid:107)2(cid:105)2\n\u2264(cid:104)\n\n= [1 \u2212 \u03c1 (\u03bb \u2212 \u03b2f (cid:107)\u2206t(cid:107) /2)]2 (cid:107)\u2206t(cid:107)2\n\n\u2264 (1 \u2212 \u03c1\u03bb/2)2 (cid:107)\u2206t(cid:107)2 \u2264 (1 \u2212 \u03c1\u03bb/2)(cid:107)\u2206t(cid:107)2 ,\n\n(9)\nwhere the second line utilizes triangular inequality, the third line utilizes (cid:107)(1 \u2212 \u03c1)I + \u03c1J\u2217(cid:107) \u2264\n1 \u2212 \u03c1 + \u03c1(1 \u2212 \u03bb) = 1 \u2212 \u03c1\u03bb,where (cid:107)\u00b7(cid:107) is the (cid:96)2 operator norm, and the smoothness in (b), which\nimplies (cid:107)F (\u02c6st) \u2212 s\u2217 \u2212 J\u2217(\u02c6st \u2212 s\u2217)(cid:107) \u2264 (\u03b2f /2)(cid:107)\u02c6st \u2212 s\u2217(cid:107)2. The last line utilizes (c).\nBy (b), F is Lf -Lipschitz and \u2200i, fi \u2212 F is 2Lf -Lipschitz continuous. Therefore\n\nEt (cid:107)fi(\u02c6st) \u2212 fi(\u02c6s0) \u2212 F (\u02c6st) + F (\u02c6s0)(cid:107)2 \u2264 4L2\n\nf (cid:107)\u02c6st \u2212 \u02c6s0(cid:107)2 \u2264 8L2\nCombining Eq. (8, 9, 10), and utilizing our assumption \u03c1 \u2264 \u03bb/(32L2\n\nE(cid:107)\u2206t+1(cid:107)2 \u2264(cid:0)1 \u2212 \u03c1\u03bb/2 + 8\u03c12L2\n\n(cid:1)(cid:107)\u2206t(cid:107)2 + 8\u03c12L2\n\nf (cid:107)\u22060(cid:107)2 .\nWe get Eq. (6, 7) by analyzing the sequence at+1 \u2264 (1 \u2212 \u0001\u03c1)at + c\u03c12a0, where at = E(cid:107)\u2206t(cid:107)2,\n\u0001 = \u03bb/4 and c = 8L2\n\nf (cid:107)\u22060(cid:107)2 \u2264 (1 \u2212 \u03c1\u03bb/4)(cid:107)\u2206t(cid:107)2 + 8\u03c12L2\n\nf . The analysis is in Appendix B.\n\nf\n\nf ((cid:107)\u2206t(cid:107)2 + (cid:107)\u22060(cid:107)2).\nf ), we have\n\n(10)\n\nComparison with bEM: As mentioned in Sec. 2.2, bEM has E(cid:107)\u02c6sE \u2212 s\u2217(cid:107)2 \u2264 (1\u2212\u03bb)2E (cid:107)\u02c6s0 \u2212 s\u2217(cid:107)2.\nThe distance decreases exponentially for both bEM and sEM-vr, but at different speeds. If M is large,\n\nsEM-vr (Eq. 7) converges much faster than bEM because(cid:0)1 + log(M/\u03ba2)(cid:1) \u03ba2/M (cid:28) (1 \u2212 \u03bb)2,\n\nthanks to its cheap stochastic updates.\nComparison with sEM: As mentioned in Sec. 2.2, sEM has E(cid:107)\u02c6sT \u2212 s\u2217(cid:107)2 = O(T \u22121), which is\nnot exponential, and is asymptotically slower than sEM-vr. The key difference is we can bound the\nvariance term for sEM-vr by (cid:107)\u02c6st \u2212 \u02c6s0(cid:107)2 in Eq. (10), so the variance goes to zero as (\u02c6se,t) converges.\nThe advantage of sEM-vr over sEM is especially signi\ufb01cant when E is large. Moreover, sEM requires\na decreasing sequence of step sizes to converge [6], which is more dif\ufb01cult to tune comparing with\nthe constant step size of sEM-vr.\n\n3.4 Global Convergence\n\nTheorem 1 only considers the case near a local maximum of the log marginal likelihood. We now\nshow that under stronger assumptions, there exists a constant step size, such that sEM-vr can globally\nconverge to a stationary point s\u2217 = F (s\u2217), one with \u2207L(s\u2217) = 0 [12].\nTheorem 2. Suppose\n\n(a) The natural parameter function \u03b7(\u03b8) is L\u03b7-Lipschitz, and fi(s) is Lf -Lipschitz for all i,\n(b) for any x and h, log p(x, h; \u03b8) is \u03b3-strongly-concave w.r.t. \u03b8.\n\nThen for any constant step size \u03c1 < \u03b3/ (M (M \u2212 1)L\u03b7Lf ), sEM-vr converges to a stationary point,\nstarting from any valid suf\ufb01cient statistics vector \u02c6s0,0.\n\nA suf\ufb01cient condition for (b) is the exponential family is canonical, i.e., \u03b7(\u03b8) = \u03b8, and we want the\nMAP estimation instead of MLE, where the prior log p(\u03b8) is \u03b3-strongly-concave. We leave the proof\nin Appendix C. The idea is \ufb01rst show that sEM-vr is a generalized EM (GEM) algorithm [36], which\nimproves E[Q(\u03b8; \u02c6\u03b8)] after each epoch, and then apply Wu\u2019s convergence theorem for GEM [36].\n\n5\n\n\fData set\nNIPS [1]\nNYTimes [1]\nWiki [38]\nPubMed [1]\n\nV\nD\n1.5k\n12k\n0.3m 102k\n3.6m 8k\n8.1m 141k\n\n|I|\n1.93m\n99m\n524m\n731m\n\nTable 1: Statistics of datasets for pLSA.\nk=thousands, m=millions.\n\nToy Gaussian Mixture.\n\nE(cid:107)\u02c6\u00b5t \u2212 \u00b5\u2217(cid:107)2, Right: log10 Vart[\u02c6\u00b5t]/\u03c12\n\nFigure 1:\nlog10\naxis: number of epochs.\n\nLeft:\nt , X-\n\n4 Applications and Experiments\n\nWe demonstrate the application of sEM-vr on a toy Gaussian mixture model and probabilistic latent\nsemantic analysis.\n\ni\n\n(cid:80)\nk hik log N (xi; \u00b5k, 1)} \u221d exp{(cid:80)\n\njoint likelihood is p(X, H|\u00b5) \u221d exp{(cid:80)\n(xi\u03b3i1(\u00b5), xi\u03b3i2(\u00b5), \u03b3i1(\u00b5), \u03b3i2(\u00b5)), and F (\u00b5) = 1/N(cid:80)\n\n4.1 Toy Gaussian Mixture\nWe \ufb01t a mixture of two Gaussians, p(x|\u00b5) = 0.2N (\u00b5, 1) + 0.8N (\u2212\u00b5, 1), with a single unknown\nparameter \u00b5. Let X = {xi}N\ni=1 be the data set, and hi \u2208 {1, 2} be the cluster assignment\nof xi. We write hik := I(hi = k) as a shortcut, where I(\u00b7) is the indicator function. The\ni \u03b7(\u00b5)(cid:62)\u03c6(xi, hi)},\nwhere the natural parameter \u03b7(\u00b5) = (\u00b5,\u2212\u00b5,\u2212\u00b52/2, \u00b52/2) and the suf\ufb01cient statistics \u03c6(xi, hi) =\n(xihi1, xihi2, hi1, hi2). Let \u03b3ik(\u00b5) = p(hi = k|xi, \u00b5) \u221d \u03c0iN (xi; \u00b5k, 1) for k \u2208 {1, 2} be\nthe posterior probabilities. The expected suf\ufb01cient statistics fi(\u00b5) = Ep(hi,xi|\u00b5)\u03c6(xi, hi) =\ni fi(\u00b5). The mapping from suf\ufb01cient\nstatistics to parameters is R(s) = (s1 \u2212 s2)/(s3 \u2212 s4). bEM, sEM, and sEM-vr updates are then\nde\ufb01ned respectively as Eq. (3), Eq. (4), and Eq. (5).\nWe construct a dataset of N = 10, 000 samples drawn from the model with \u00b5 = 0.5, and run bEM\nuntil convergence (to double precision) to obtain the MLE \u00b5\u2217. We then measure the convergence of\nE(cid:107)\u02c6\u00b5t \u2212 \u00b5\u2217(cid:107)2 as well as the variance term Vart[\u02c6\u00b5t]/\u03c12\nt for bEM, sEM, and sEM-vr with respective to\nthe number of epochs. Vart[\u02c6\u00b5t] is always quadratic with respect to the step size \u03c1t, so we divide it by\nt to cancel the effect of the step size, and just study the intrinsic variance. We tune the step size\n\u03c12\nmanually, and set \u03c1t = 3/(t + 10) for sEM and \u03c1 = 0.003 for sEM-vr.\nThe result is shown as Fig. 1. sEM converges faster than bEM in the \ufb01rst 8 epochs, and then it\nis outperformed by bEM, because sEM is asymptotically slower, as mentioned in Sec. 2.2. The\nconvergence curve of sEM-vr exhibits a staircase pattern. In the beginning of each epoch it converges\nvery fast because (cid:107)\u02c6se,t \u2212 \u02c6se,0(cid:107) is small, so the variance is small. The variance then becomes larger\nand the convergence slows down. Then we start a new epoch and compute a new F (\u02c6se,0), so that the\nconvergence is fast again. On the other hand, the variance of sEM remains constant.\n\n4.2 Probabilistic Latent Semantic Analysis\n\n4.2.1 Model and Algorithm\n\nProbabilistic Latent Semantic Analysis (pLSA) [18] represents text documents as mixtures of topics.\npLSA takes a list I of tokens, where each token i is represented by a pair of document and word IDs\n(di, vi), that indicates for the presence of a word vi in document di. Denote [n] = {1, . . . , n}, we\nhave di \u2208 [D] and vi \u2208 [V ]. pLSA assigns a latent topic zi \u2208 [K] for each token, and de\ufb01nes the joint\nd=1 and\n\u03c6 = {\u03c6k}K\nk=1. We have priors p(\u03b8d) = Dir(\u03b8d; K, \u03b1(cid:48)) and p(\u03c6k) = Dir(\u03c6k; V, \u03b2(cid:48)), where Dir(K, \u03b1)\nis a K-dimensional symmetric Dirichlet distribution with the concentration parameter \u03b1, and \ufb01nd\nZ p(W, Z|\u03b8, \u03c6) + log p(\u03b8) + log p(\u03c6). Only the updates are\npresented here and the derivation is in Appendix D. Let \u03b3ik(\u03b8, \u03c6) := p(zi = k|vi, \u03b8, \u03c6) \u221d \u03b8di,k\u03c6k,vi\n\u03b3ik(\u03b8, \u03c6), and\n\nlikelihood as p(I, Z|\u03b8, \u03c6) =(cid:81)\nan MAP estimation argmax\u03b8,\u03c6 log(cid:80)\nbe the posterior topic assignment of the token vi, bEM updates \u03b3dk(\u03b8, \u03c6) =(cid:80)\n\ni\u2208I Cat(zi; \u03b8di)Cat(vi; \u03c6zi), with the parameters \u03b8 = {\u03b8d}D\n\ni\u2208Id\n\n6\n\n012345678910302520151050sEMvrsEMbEM012345678910302520151050sEMvrsEM\f\u03b3kv(\u03b8, \u03c6) =(cid:80)\nM-step is \u03b8dk = (\u03b3dk +\u03b1)/((cid:80)\n\ni\u2208Iv\n\n\u03b3ik(\u03b8, \u03c6) in E-step, where Id = {(di, vi)|di = d} and Iv = {(di, vi)|vi = v}.\nv \u03b3kv +V \u03b2), where \u03b1 = \u03b1(cid:48)\u22121\n\nk \u03b3dk +K\u03b1), and \u03c6kv = (\u03b3kv +\u03b2)/((cid:80)\n\ni\u2208 \u02c6Iv\n\n|I|\n| \u02c6I|\n\ni\u2208 \u02c6Id\n\n|I|\n| \u02c6I|\n\n(cid:80)\n\n(cid:80)\n\nand \u03b2 = \u03b2(cid:48) \u2212 1. We distinguish (\u03b3ik, \u03b3dk, \u03b3vk) and (I,Id,Iv) by indices for simplicity.\nsEM approximates the full batch expected suf\ufb01cient statistics \u03b3dk and \u03b3kv with exponential moving av-\nerages \u02c6st,d,k and \u02c6st,k,v at iteration t, and updates \u02c6st+1,d,k = (1\u2212\u03c1t)\u02c6st,d,k +\u03c1t\n\u03b3ik(\u02c6\u03b8t, \u02c6\u03c6t),\nand \u02c6st+1,k,v = (1 \u2212 \u03c1t)\u02c6st,k,v + \u03c1t\n\u03b3ik(\u02c6\u03b8t, \u02c6\u03c6t), where we sample a minibatch \u02c6I \u2282 I of\ntokens per iteration, \u02c6Id, \u02c6Iv are de\ufb01ned in the same way as Id, Iv. \u02c6\u03b8t and \u02c6\u03c6t are computed in the\nM-step with \u02c6st,d,k and \u02c6st,k,v. This sEM algorithm is known as SCVB0 [16].\n|I|\n(\u03b3ik(\u02c6\u03b8e,t, \u02c6\u03c6e,t) \u2212 \u03b3ik(\u02c6\u03b8e,0, \u02c6\u03c6e,0)) +\nsEM-vr updates as \u02c6se,t+1,d,k = (1 \u2212 \u03c1)\u02c6se,t,d,k + \u03c1\n| \u02c6I|\n|I|\n\u03c1\u03b3dk(\u02c6\u03b8e,0, \u02c6\u03c6e,0), and \u02c6se,t+1,k,v = (1\u2212 \u03c1)\u02c6se,t,k,v + \u03c1\n(\u03b3ik(\u02c6\u03b8e,t, \u02c6\u03c6e,t)\u2212 \u03b3ik(\u02c6\u03b8e,0, \u02c6\u03c6e,0)) +\n| \u02c6I|\n\u03c1\u03b3kv(\u02c6\u03b8e,0, \u02c6\u03c6e,0), where \u03b3dk(\u02c6\u03b8e,0, \u02c6\u03c6e,0) and \u03b3kv(\u02c6\u03b8e,0, \u02c6\u03c6e,0) is computed by bEM per epoch. We\nhave pseudocode for sEM and sEM-vr in Appendix D.\nIf \u03b8 is integrated out instead of maximized, we recover an MAP estimation [14] of latent Dirichlet\nallocation (LDA) [4]. Many existing algorithms for LDA actually optimize the pLSA objective as\nan approximation of the LDA objective, including CVB0 [2, 31, 19], SCVB0 [16], BP-LDA [10],\nESCA [37], and WarpLDA [9]. This approximation works well in practice when the number of topics\nis small [2]. We have more discussions in Appendix D.1.\n\n(cid:80)\n(cid:80)\n\ni\u2208 \u02c6Id\ni\u2208 \u02c6Iv\n\n4.2.2 Experimental Settings\n\nWe compare sEM-vr with bEM and sEM (SCVB0), which is the start-of-the-art algorithm for pLSA,\non four datasets listed in Table 1. We also compare with two gradient based algorithms, stochastic\nmirror descent (SMD) [10] and reparameterized stochastic gradient descent (RSGD) as well as their\nvariants with SVRG-style [20] variance reduction, denoted as SMD-vr and RSGD-vr, despite their\nconvergence properties are unknown. Both SMD and RSGD replace the M-step with a stochastic\ngradient step. SMD updates as \u03b8dk \u221d \u03b8dk exp{\u03c1\u2207\u03b8dk Q} and \u03c6kv \u221d \u03c6kv exp{\u03c1\u2207\u03c6kv Q}, where Q\nis de\ufb01ned as Eq. (1). RSGD adopts the reparameterization \u03b8dk = exp \u03bbdk\n,\nv exp \u03c4kv\nand directly optimize Q w.r.t. \u03bb and \u03c4 by stochastic gradient descent. Derivations of SMD and RSGD\nare in Appendix D.6. All the algorithms are implemented in C++, and are highly-optimized and\nparallelized. The testing machine has two 12-core Xeon E5-2692v2 CPUs and 64GB main memory.\nWe assess the convergence of algorithms by the training objective log p(W|\u03b8, \u03c6) + log p(\u03b8|\u03b1(cid:48)) +\nlog p(\u03c6|\u03b2(cid:48)), i.e., logarithm of unnormalized posterior distribution p(\u03b8, \u03c6|W, \u03b1(cid:48), \u03b2(cid:48)). For each dataset\nand the number of topics K \u2208 {50, 100}, we \ufb01rst select the hyperparameters by a grid search\nK\u03b1 \u2208 {0.1, 1, 10, 100} and \u03b2 \u2208 {0.01, 0.1, 1}.3 Then, we do another grid search to choose the\nstep size. For sEM-vr, we choose \u03c1 \u2208 {0.01, 0.02, 0.05, 0.1, 0.2}, and for all other stochastic\nalgorithms, we set \u03c1t = a/(t + t0)\u03ba, and choose a \u2208 {10\u22127, . . . , 100}, t0 \u2208 {10, 100, 1000} and\n\u03ba \u2208 {0.5, 0.75, 1}.4 Finally, we repeat 5 runs with difference random seeds for each algorithm with\nits best step size. E is 20 for NIPS and NYTimes, and 5 for Wiki and PubMed. M is 50 for NIPS\nand 500 for all the other datasets.\n\nand \u03c6kv = exp \u03c4kv\n\nk exp \u03bbdk\n\n(cid:80)\n\n(cid:80)\n\n4.2.3 Results for pLSA\n\nWe plot the training objective against running time as \ufb01rst and second row of Fig. 2. We \ufb01nd that\ngradient-based algorithms and bEM are not competitive with sEM and sEM-vr, so we only report\ntheir results on NIPS, to make the distinction sEM and sEM-vr more clear. Full results and more\nexplanations of the slow convergence of gradient-based algorithms are available in Appendix D.6.\nDue to the reduced variance, sEM-vr consistently converges faster to better training objective than\nsEM and bEM on all the datasets, while the constant step size of sEM-vr is easier to tune than the\ndecreasing sequence of step sizes for sEM.\n\n3We \ufb01nd that all the algorithms have the same best hyperparameter con\ufb01guration.\n4We have tried constant step sizes for SMD-vr and RSGD-vr but found it worse than decreasing step sizes.\n\n7\n\n\fNIPS\n\nNYTimes\n\nWiki\n\nPubMed\n\nFigure 2: pLSA and LDA convergence results. X-axis is running time in seconds. First and second\nrow: pLSA with K = 50 and K = 100, y-axis is the training objective. Third row: LDA with\nK = 10, y-axis is the testing perplexity.\n\n4.3 Results for LDA\n\nAs discussed in Sec. 4.2.1, algorithms for pLSA also work well as approximate training algorithms\nfor LDA, if the number of topics is small. Therefore, we also evaluate our sEM-vr algorithm for\nLDA, with a small number of K = 10 topics. The training algorithm is exactly the same, but the\nevaluation metric is different. We hold out a small testing set, and report the testing perplexity,\ncomputed by the left-to-right algorithm [34] on the testing set. We compare with a state-of-the-art\nalgorithm, Gibbs online expectation maximization (GOEM) [14], which outperforms a wide range of\nalgorithms including SVI [17], hybrid variational-Gibbs [27], and SGRLD [28]. We also compare\nwith stochastic variational inference (SVI) [17] and its variance reduced variant SSVI [25].\nThe third row of Fig. 2 shows the results. We observed that sEM-vr converges the fastest on all the\ndatasets except NIPS, where sEM converges faster due to its cheaper iterations. sEM-vr always gets\nbetter results than sEM in the end. GOEM converges slower due to its high Monte-Carlo variance.\nSVI and SSVI converge slower due to their inexact mean \ufb01eld assumption and expensive iterations,\nincluding an inner loop for inferring the local latent variables and frequent evaluation of the expensive\ndigamma function. For a larger number of topics, such as 100, we \ufb01nd that GOEM performs the\nbest since it does not approximate LDA as pLSA, and does not make mean \ufb01eld assumptions as SVI\nand SSVI. Extending our algorithm to variational EM and Monte-Carlo EM, when the E-step is not\ntractable, is an interesting future direction.\n\n5 Conclusions and Discussions\n\n(cid:0)1 + log(M/\u03ba2)(cid:1)\u2212E local convergence rate, which is faster than both the (1 \u2212 \u03bb)\u22122E rate of\n\nWe propose a variance reduced stochastic EM (sEM-vr) algorithm.\n\nbatch EM and O(T \u22121) rate of plain stochastic EM (sEM). Unlike sEM, which requires a decreasing\nsequence of step sizes to converge, sEM-vr only requires a constant step size to achieve this local\n\nsEM-vr achieves a\n\n8\n\n1011001011.481.461.441.421.401.381.361.341e7sEMvrsEMbEMRSGDvrRSGDSMDvrSMD1011021038.328.308.288.268.248.228.208.181e8cvsEMsEM1024\u00d71016\u00d71013.663.643.623.603.583.563.541e9cvsEMsEM1022\u00d71023\u00d71025.965.945.925.905.885.865.845.825.801e9cvsEMsEM1011001011.5001.4751.4501.4251.4001.3751.3501.3251.3001e7sEMvrsEMbEMRSGDvrRSGDSMDvrSMD1011021038.2008.1758.1508.1258.1008.0758.0508.0258.0001e8cvsEMsEM1022\u00d71023\u00d71023.6003.5753.5503.5253.5003.4753.4503.4253.4001e9cvsEMsEM2\u00d71023\u00d71024\u00d71025.855.805.755.705.655.601e9cvsEMsEM1011001012000205021002150220022502300sEMvrsEMGOEMSSVISVI101102103600060256050607561006125615061756200sEMvrsEMGOEMSSVISVI101102103160016101620163016401650sEMvrsEMGOEMSSVISVI101102103510052005300540055005600sEMvrsEMGOEMSSVISVI\fconvergence rate as well as global convergence, under stronger assumptions. We compare sEM-vr\nagainst bEM, sEM and other gradient and Bayesian algorithms, on GMM and pLSA tasks, and \ufb01nd\nthat sEM-vr converges signi\ufb01cantly faster than these alternatives.\nAn interesting future direction is leveraging recent progress on variance reduced stochastic gradient\ndescent for non-convex optimization [23] to relax our assumptions on strongly-log-concavity, and\nextend sEM-vr to stochastic control variates, which works better on very large data sets. Extending\nour work to variational EM and Monte-Carlo EM is also interesting.\n\nAcknowledgments\n\nWe thank Chris Maddison, Adam Foster, and Jin Xu for proofreading. J.C. and J.Z. were supported\nby the National Key Research and Development Program of China (No.2017YFA0700904), NSFC\nprojects (Nos. 61620106010, 61621136008, 61332007), the MIIT Grant of Int. Man. Comp. Stan\n(No. 2016ZXFB00001), Tsinghua Tiangong Institute for Intelligent Computing, the NVIDIA NVAIL\nProgram and a Project from Siemens. YWT was supported by funding from the European Research\nCouncil under the European Union\u2019s Seventh Framework Programme (FP7/2007-2013) ERC grant\nagreement no. 617071, and from Tencent AI Lab through the Oxford-Tencent Collaboration on Large\nScale Machine Learning.\n\nReferences\n[1] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.\n\n[2] Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and\ninference for topic models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 27\u201334. AUAI Press, 2009.\n\n[3] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al. Statistical guarantees for the em\nalgorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77\u2013120,\n2017.\n\n[4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of\n\nmachine Learning research, 3(Jan):993\u20131022, 2003.\n\n[5] Olivier Capp\u00e9. Online em algorithm for hidden markov models. Journal of Computational and\n\nGraphical Statistics, 20(3):728\u2013749, 2011.\n\n[6] Olivier Capp\u00e9 and Eric Moulines. On-line expectation\u2013maximization algorithm for latent\ndata models. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n71(3):593\u2013613, 2009.\n\n[7] Niladri S Chatterji, Nicolas Flammarion, Yi-An Ma, Peter L Bartlett, and Michael I Jor-\ndan. On the theory of variance reduction for stochastic gradient monte carlo. arXiv preprint\narXiv:1802.05431, 2018.\n\n[8] Changyou Chen, Wenlin Wang, Yizhe Zhang, Qinliang Su, and Lawrence Carin. A convergence\nanalysis for a class of practical variance-reduction stochastic gradient mcmc. arXiv preprint\narXiv:1709.01180, 2017.\n\n[9] Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. Warplda: a cache ef\ufb01cient o (1)\nalgorithm for latent dirichlet allocation. Proceedings of the VLDB Endowment, 9(10):744\u2013755,\n2016.\n\n[10] Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, Xiaodong He, Jianfeng Gao, Xinying Song, and\nLi Deng. End-to-end learning of lda by mirror-descent back propagation over a deep architecture.\nIn Advances in Neural Information Processing Systems, pages 1765\u20131773, 2015.\n\n[11] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in neural\ninformation processing systems, pages 1646\u20131654, 2014.\n\n9\n\n\f[12] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete\ndata via the em algorithm. Journal of the royal statistical society. Series B (methodological),\npages 1\u201338, 1977.\n\n[13] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos, Alexander J\nSmola, and Eric P Xing. Variance reduction in stochastic gradient langevin dynamics. In\nAdvances in neural information processing systems, pages 1154\u20131162, 2016.\n\n[14] Christophe Dupuy and Francis Bach. Online but accurate inference for latent variable models\nwith local gibbs sampling. The Journal of Machine Learning Research, 18(1):4581\u20134625, 2017.\n\n[15] Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence\nanalysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.\n\n[16] James Foulds, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling. Stochastic\ncollapsed variational bayesian inference for latent dirichlet allocation. In Proceedings of the\n19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages\n446\u2013454. ACM, 2013.\n\n[17] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational\n\ninference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[18] Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth con-\nference on Uncertainty in arti\ufb01cial intelligence, pages 289\u2013296. Morgan Kaufmann Publishers\nInc., 1999.\n\n[19] Katsuhiko Ishiguro, Issei Sato, and Naonori Ueda. Averaged collapsed variational bayes\n\ninference. Journal of Machine Learning Research, 18(1):1\u201329, 2017.\n\n[20] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[21] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and\n\napplications, volume 35. Springer Science & Business Media, 2003.\n\n[22] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an\nIn Advances in Neural Information\n\nexponential convergence rate for \ufb01nite training sets.\nProcessing Systems, pages 2663\u20132671, 2012.\n\n[23] Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic\n\ngradient. In Arti\ufb01cial Intelligence and Statistics, pages 148\u2013156, 2017.\n\n[24] Percy Liang and Dan Klein. Online em for unsupervised models. In Proceedings of human\nlanguage technologies: The 2009 annual conference of the North American chapter of the\nassociation for computational linguistics, pages 611\u2013619. Association for Computational\nLinguistics, 2009.\n\n[25] Stephan Mandt and David Blei. Smoothed gradients for stochastic variational inference. In\n\nAdvances in Neural Information Processing Systems, pages 2438\u20132446, 2014.\n\n[26] Geoffrey McLachlan and David Peel. Finite mixture models. John Wiley & Sons, 2004.\n\n[27] David Mimno, Matt Hoffman, and David Blei. Sparse stochastic inference for latent dirichlet\n\nallocation. arXiv preprint arXiv:1206.6425, 2012.\n\n[28] Sam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics on the\nprobability simplex. In Advances in Neural Information Processing Systems, pages 3102\u20133110,\n2013.\n\n[29] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech\n\nrecognition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[30] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n10\n\n\f[31] Issei Sato and Hiroshi Nakagawa. Rethinking collapsed variational bayes inference for lda. In\n\nICML, 2012.\n\n[32] Charles Spearman and L. W. Jones. Human Ability. Macmillan, 1950.\n\n[33] D Michael Titterington. Recursive parameter estimation using incomplete data. Journal of the\n\nRoyal Statistical Society. Series B (Methodological), pages 257\u2013267, 1984.\n\n[34] Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation methods\nfor topic models. In Proceedings of the 26th annual international conference on machine\nlearning, pages 1105\u20131112. ACM, 2009.\n\n[35] Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High dimensional expectation-\nmaximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint\narXiv:1412.8729, 2014.\n\n[36] CF Jeff Wu. On the convergence properties of the em algorithm. The Annals of statistics, pages\n\n95\u2013103, 1983.\n\n[37] Manzil Zaheer, Michael Wick, Jean-Baptiste Tristan, Alex Smola, and Guy Steele. Exponential\nstochastic cellular automata for massively parallel inference. In Arti\ufb01cial Intelligence and\nStatistics, pages 966\u2013975, 2016.\n\n[38] Aonan Zhang, Jun Zhu, and Bo Zhang. Sparse online topic models. In Proceedings of the 22nd\n\ninternational conference on World Wide Web, pages 1489\u20131500. ACM, 2013.\n\n[39] Rongda Zhu, Lingxiao Wang, Chengxiang Zhai, and Quanquan Gu. High-dimensional variance-\nreduced stochastic gradient expectation-maximization algorithm. In International Conference\non Machine Learning, pages 4180\u20134188, 2017.\n\n11\n\n\f", "award": [], "sourceid": 4930, "authors": [{"given_name": "Jianfei", "family_name": "Chen", "institution": "RealAI"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}]}