{"title": "On the Global Convergence of (Fast) Incremental Expectation Maximization Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 2837, "page_last": 2847, "abstract": "The EM algorithm is one of the most popular algorithm for inference in latent data models. The original formulation of the EM algorithm does not scale to large data set, because the whole data set is required at each iteration of the algorithm. To alleviate this problem, Neal and Hinton [1998] have proposed an incremental version of the EM (iEM) in which at each iteration the conditional expectation of the latent data (E-step) is updated only for a mini-batch of observations. Another approach has been proposed by Cappe and Moulines [2009] in which the E-step is replaced by a stochastic approximation step, closely related to stochastic gradient. In this paper, we analyze incremental and stochastic version of the EM algorithm as well as the variance reduced-version of [Chen et al., 2018] in a common unifying framework. We also introduce a new version incremental version, inspired by the SAGA algorithm by Defazio et al. [2014]. We establish non-asymptotic convergence bounds for global convergence. Numerical applications are presented in this article to illustrate our findings.", "full_text": "On the Global Convergence of (Fast) Incremental\n\nExpectation Maximization Methods\n\nBelhal Karimi\n\nCMAP, \u00b4Ecole Polytechnique\n\nPalaiseau, France\n\nHoi-To Wai\n\nThe Chinese University of Hong Kong\n\nShatin, Hong Kong\n\nbelhal.karimi@polytechnique.edu\n\nhtwai@se.cuhk.edu.hk\n\nEric Moulines\n\nCMAP, \u00b4Ecole Polytechnique\n\nPalaiseau, France\n\nMarc Lavielle\nINRIA Saclay\n\nPalaiseau, France\n\neric.moulines@polytechnique.edu\n\nmarc.lavielle@inria.fr\n\nAbstract\n\nThe EM algorithm is one of the most popular algorithm for inference in latent\ndata models. The original formulation of the EM algorithm does not scale to large\ndata set, because the whole data set is required at each iteration of the algorithm.\nTo alleviate this problem, Neal and Hinton [1998] have proposed an incremental\nversion of the EM (iEM) in which at each iteration the conditional expectation of\nthe latent data (E-step) is updated only for a mini-batch of observations. Another\napproach has been proposed by Capp\u00b4e and Moulines [2009] in which the E-step is\nreplaced by a stochastic approximation step, closely related to stochastic gradient.\nIn this paper, we analyze incremental and stochastic version of the EM algorithm\nas well as the variance reduced-version of [Chen et al., 2018] in a common unify-\ning framework. We also introduce a new version incremental version, inspired by\nthe SAGA algorithm by Defazio et al. [2014]. We establish non-asymptotic con-\nvergence bounds for global convergence. Numerical applications are presented in\nthis article to illustrate our \ufb01ndings.\n\n1\n\nIntroduction\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:8) \u2212 log g(yi; \u03b8)(cid:9) ,\n\nn(cid:88)\n\ni=1\n\nMany problems in machine learning pertain to tackling an empirical risk minimization of the form\n\n(1)\n\nLi(\u03b8) :=\n\n1\nn\n\nL(\u03b8) := R(\u03b8) + L(\u03b8) with L(\u03b8) =\n\nmin\n\u03b8\u2208\u0398\nwhere {yi}n\ni=1 are the observations, \u0398 is a convex subset of Rd for the parameters, R : \u0398 \u2192 R is a\nsmooth convex regularization function and for each \u03b8 \u2208 \u0398, g(y; \u03b8) is the (incomplete) likelihood of\neach individual observation. The objective function L(\u03b8) is possibly non-convex and is assumed to\nbe lower bounded L(\u03b8) > \u2212\u221e for all \u03b8 \u2208 \u0398. In the latent variable model, g(yi; \u03b8), is the marginal\nZ f (zi, yi; \u03b8)\u00b5(dzi), where\n{zi}n\ni=1 are the (unobserved) latent variables. We consider the setting where the complete data\nlikelihood belongs to the curved exponential family, i.e.,\n\nof the complete data likelihood de\ufb01ned as f (zi, yi; \u03b8), i.e. g(yi; \u03b8) =(cid:82)\n\nf (zi, yi; \u03b8) = h(zi, yi) exp(cid:0)(cid:10)S(zi, yi)| \u03c6(\u03b8)(cid:11) \u2212 \u03c8(\u03b8)(cid:1) ,\n\n(2)\nwhere \u03c8(\u03b8), h(zi, yi) are scalar functions, \u03c6(\u03b8) \u2208 Rk is a vector function, and S(zi, yi) \u2208 Rk is the\ncomplete data suf\ufb01cient statistics. Latent variable models are widely used in machine learning and\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsurrogate function is computed as \u03b8 (cid:55)\u2192 Q(\u03b8, \u03b8k\u22121) = (cid:80)n\n\u2212(cid:82)\n\nstatistics; examples include mixture models for density estimation, clustering document, and topic\nmodelling; see [McLachlan and Krishnan, 2007] and the references therein.\nThe basic \u201dbatch\u201d EM (bEM) method iteratively computes a sequence of estimates {\u03b8k, k \u2208 N}\nwith an initial parameter \u03b80. Each iteration of bEM is composed of two steps. In the E-step, a\ni=1 Qi(\u03b8, \u03b8k\u22121) where Qi(\u03b8, \u03b8(cid:48)) :=\nZ log f (zi, yi; \u03b8)p(zi|yi; \u03b8(cid:48))\u00b5(dzi) such that p(zi|yi; \u03b8) := f (zi, yi; \u03b8)/g(yi, \u03b8) is the condi-\ntional probability density of the latent variables zi given the observations yi. When f (zi, yi; \u03b8)\nfollows the curved exponential family model, the E-step amounts to computing the conditional\nexpectation of the complete data suf\ufb01cient statistics,\n\nn(cid:88)\n\ns(\u03b8) =\n\n1\nn\n\n(cid:90)\n\nsi(\u03b8) where\n\nsi(\u03b8) =\n\ni=1\n\nZ\n\nS(zi, yi)p(zi|yi; \u03b8)\u00b5(dzi) .\n\n(3)\n\nIn the M-step, the surrogate function is minimized producing a new \ufb01t of the parameter \u03b8k =\narg max\u03b8\u2208\u0398 Q(\u03b8, \u03b8k\u22121). The EM method has several appealing features \u2013 it is monotone where\nthe likelihood do not decrease at each iteration, invariant with respect to the parameterization, nu-\nmerically stable when the optimization set is well de\ufb01ned, etc. The EM method has been the subject\nof considerable interest since its formalization in [Dempster et al., 1977].\nWith the sheer size of data sets today, the bEM method is not applicable as the E-step (3) involves\na full pass over the dataset of n observations. Several approaches based on stochastic optimization\nhave been proposed to address this problem. Neal and Hinton [1998] proposed (but not analyzed)\nan incremental version of EM, referred to as the iEM method. Capp\u00b4e and Moulines [2009] de-\nveloped the online EM (sEM) method which uses a stochastic approximation procedure to track\nthe suf\ufb01cient statistics de\ufb01ned in (3). Recently, Chen et al. [2018] proposed a variance reduced\nsEM (sEM-VR) method which is inspired by the SVRG algorithm popular in stochastic convex op-\ntimization [Johnson and Zhang, 2013]. The applications of the above stochastic EM methods are\nnumerous, especially with the iEM and sEM methods; e.g., [Thiesson et al., 2001] for inference with\nmissing data, [Ng and McLachlan, 2003] for mixture models and unsupervised clustering, [Hinton\net al., 2006] for inference of deep belief networks, [Hofmann, 1999] for probabilistic latent semantic\nanalysis, [Wainwright et al., 2008, Blei et al., 2017] for variational inference of graphical models\nand [Ablin et al., 2018] for Independent Component Analysis.\nThis paper focuses on the theoretical aspect of stochastic EM methods by establishing novel non-\nasymptotic and global convergence rates for them. Our contributions are as follows.\n\n\u2022 We offer two complementary views for the global convergence of EM methods \u2013 one fo-\ncuses on the parameter space, and one on the suf\ufb01cient statistics space. On one hand, the\nEM method can be studied as an majorization-minimization (MM) method in the parameter\nspace. On the other hand, the EM method can be studied as a scaled-gradient method in\nthe suf\ufb01cient statistics space.\n\u2022 Based on the two views described, we derive non-asymptotic convergence rate for stochas-\ntic EM methods. First, we show that the iEM method [Neal and Hinton, 1998] is a special\ninstance of the MISO framework [Mairal, 2015], and takes O(n/\u0001) iterations to \ufb01nd an \u0001-\nstationary point to the ML estimation problem. Second, the sEM-VR method [Chen et al.,\n2018] is an instance of variance reduced stochastic scaled-gradient method, which takes\nO(n2/3/\u0001) iterations to \ufb01nd to an \u0001-stationary point.\n\n\u2022 Lastly, we develop a Fast Incremental EM (\ufb01EM) method based on the SAGA algorithm\n[Defazio et al., 2014, Reddi et al., 2016b] for stochastic optimization. We show that the new\nmethod is again a scaled-gradient method with the same iteration complexity as sEM-VR.\nThis new method offers trade-off between storage cost and computation complexity.\n\nImportantly, our results capitalizes on the ef\ufb01ciency of stochastic EM methods applied on large\ndatasets, and we support the above \ufb01ndings using numerical experiments.\n\nPrior Work Since the empirical risk minimization problem (1) is typically non-convex, most prior\nwork studying the convergence of EM methods considered either the asymptotic and/or local behav-\niors. For the classical study, the global convergence to a stationary point (either a local minimum or a\nsaddle point) of the bEM method has been established by Wu et al. [1983] (by making the arguments\n\n2\n\n\fdeveloped in Dempster et al. [1977] rigorous). The global convergence is a direct consequence of\nthe EM method to be monotone. It is also known that in the neighborhood of a stationary point and\nunder regularity conditions, the local rate of convergence of the bEM is linear and is given by the\namount of missing information [McLachlan and Krishnan, 2007, Chapters 3 and 4].\nThe convergence of the iEM method was \ufb01rst tackled by Gunawardana and Byrne [2005] exploiting\nthe interpretation of the method as an alternating minimization procedure under the information\ngeometric framework developed in [Csisz\u00b4ar and Tusn\u00b4ady, 1984]. Although the EM algorithm is\npresented as an alternation between the E-step and M-step, it is also possible to take a variational\nperspective on EM to view both steps as maximization steps. Nevertheless, Gunawardana and Byrne\n[2005] assume that the latent variables take only a \ufb01nite number of values and the order in which\nthe observations are processed remains the same from one pass to the other.\nMore recently, the local but non-asymptotic convergence of EM methods has been studied in several\nworks. These results typically require the initializations to be within a neighborhood of an isolated\nstationary point and the (negated) log-likelihood function to be strongly convex locally. Such con-\nditions are either dif\ufb01cult to verify in general or have been derived only for speci\ufb01c models; see for\nexample [Wang et al., 2015, Xu et al., 2016, Balakrishnan et al., 2017] and the references therein.\nThe local convergence of sEM-VR method has been studied in [Chen et al., 2018, Theorem 1] but\nunder a pathwise global stability condition. The authors\u2019 work [Karimi et al., 2019] provided the \ufb01rst\nglobal non-asymptotic analysis of the online (stochastic) EM method [Capp\u00b4e and Moulines, 2009].\nIn comparison, the present work analyzes the variance reduced variants of EM method. Lastly, it\nis worthwhile to mention that Zhu et al. [2017] analyzed a variance reduced gradient EM method\nsimilar to [Balakrishnan et al., 2017].\n\n2 Stochastic Optimization Techniques for EM methods\nLet k \u2265 0 be the iteration number. The kth iteration of a generic stochastic EM method is composed\nof two sub-steps \u2014 \ufb01rstly,\n\nsE-step : \u02c6s(k+1) = \u02c6s(k) \u2212 \u03b3k+1\n\nwhich is a stochastic version of the E-step in (3). Note {\u03b3k}\u221e\nS (k+1) is a proxy for s( \u02c6\u03b8(k)), and s is de\ufb01ned in (3). Secondly, the M-step is given by\n\n(cid:0)\u02c6s(k) \u2212 S (k+1)(cid:1),\n(cid:8) R(\u03b8) + \u03c8(\u03b8) \u2212(cid:10)\u02c6s(k+1) | \u03c6(\u03b8)(cid:11)(cid:9),\n\n(4)\nk=1 \u2208 [0, 1] is a sequence of step sizes,\n\n(5)\n\nM-step: \u02c6\u03b8(k+1) = \u03b8(\u02c6s(k+1)) := arg min\n\n\u03b8\u2208\u0398\n\n(cid:90)\n\nZ\n\nwhich depends on the suf\ufb01cient statistics in the sE-step. The stochastic EM methods differ in\nthe way that S (k+1) is computed. Existing methods employ stochastic approximation or variance\nreduction without the need to fully compute s( \u02c6\u03b8(k)). To simplify notations, we de\ufb01ne\n\ns(k)\ni\n\n:= si( \u02c6\u03b8(k)) =\n\nS(zi, yi)p(zi|yi; \u02c6\u03b8(k))\u00b5(dzi)\n\nand\n\ns((cid:96)) := s( \u02c6\u03b8((cid:96))) =\n\n1\nn\n\ns((cid:96))\ni\n\n.\n\n(6)\n\nIf S (k+1) = s(k) and \u03b3k+1 = 1, (4) reduces to the E-step in the classical bEM method. To formally\n\ndescribe the stochastic EM methods, we let ik \u2208(cid:74)1, n(cid:75) be a random index drawn at iteration k and\ni = max{k(cid:48) : ik(cid:48) = i, k(cid:48) < k} be the iteration index such that i \u2208(cid:74)1, n(cid:75) is last drawn prior to\n\n\u03c4 k\niteration k. The proxy S (k+1) in (4) is drawn as:\n\nn(cid:88)\n\ni=1\n\n(iEM [Neal and Hinton, 1998])\n\n(sEM [Capp\u00b4e and Moulines, 2009])\n\n(sEM-VR [Chen et al., 2018])\n\n(cid:0)s(k)\nS (k+1) = s((cid:96)(k)) +(cid:0)s(k)\n\nS (k+1) = S (k) + 1\nS (k+1) = s(k)\n\nik\n\nik\n\nn\n\n\u2212 s\n\n(\u03c4 k\nik\nik\n\n)\n\n\u2212 s((cid:96)(k))\n\nik\n\nik\n\n(cid:1)\n(cid:1)\n\n(7)\n\n(8)\n\n(9)\n\nThe stepsize is set to \u03b3k+1 = 1 for the iEM method; \u03b3k+1 = \u03b3 is constant for the sEM-VR method.\nIn the original version of the sEM method, the sequence of step \u03b3k+1 is a diminishing step size.\nMoreover, for iEM we initialize with S (0) = s(0); for sEM-VR, we set an epoch size of m and\nde\ufb01ne (cid:96)(k) := m(cid:98)k/m(cid:99) as the \ufb01rst iteration number in the epoch that iteration k is in.\n\n3\n\n\f\ufb01EM Our analysis framework can handle a new, yet natural application of a popular variance\nreduction technique to the EM method. The new method, called \ufb01EM, is developed from the SAGA\nmethod [Defazio et al., 2014] in a similar vein as in sEM-VR.\nFor iteration k \u2265 0, the \ufb01EM method draws two indices independently and uniformly as ik, jk \u2208\n(cid:74)1, n(cid:75). In addition to \u03c4 k\nj = {k(cid:48) : jk(cid:48) = j, k(cid:48) < k} to be\nthe iteration index where the sample j \u2208 (cid:74)1, n(cid:75) is last drawn as jk prior to iteration k. With the\n\ni which was de\ufb01ned w.r.t. ik, we de\ufb01ne tk\n\n= s(0), we use a slightly different update rule from SAGA inspired by [Reddi\n\ninitialization S (0)\net al., 2016b], as described by the following recursive updates\n= S (k)\n\n(cid:1), S (k+1)\n\n+(cid:0)s(k)\n\nS (k+1) = S (k)\n\n\u2212 s\n\n)\n\n+ n\u22121(cid:0)s(k)\n\njk\n\n\u2212 s\n\n(tk\njk\njk\n\n)\n\n(tk\nik\nik\n\nik\n\n(cid:1).\n\n(10)\n\ni\n\ni=1 s(tk\ni )\n\n= n\u22121(cid:80)n\n\nwhere we set a constant step size as \u03b3k+1 = \u03b3.\nIn the above, the update of S (k+1) corresponds\nto an unbiased estimate of s(k), while the up-\ndate for S (k+1) maintains the structure that\nS (k)\nfor any k \u2265 0. The\ntwo updates of (10) are based on two differ-\nent and independent indices ik, jk that are ran-\n\ndomly drawn from(cid:74)n(cid:75). This is used for our fast\n\nconvergence analysis in Section 3.\nWe summarize the iEM, sEM-VR, sEM, \ufb01EM\nmethods in Algorithm 1. The random termina-\ntion number (11) is inspired by [Ghadimi and\nLan, 2013] which enables one to show non-\nasymptotic convergence to stationary point for\nnon-convex optimization. Due to their stochas-\ntic nature, the per-iteration complexity for all\nthe stochastic EM methods are independent of\nn, unlike the bEM method. They are thus ap-\nplicable to large datasets with n (cid:29) 1.\n\n2.1 Example: Gaussian Mixture Model\n\nAlgorithm 1 Stochastic EM methods.\n1: Input: initializations \u02c6\u03b8(0) \u2190 0, \u02c6s(0) \u2190 s(0),\n2: Set the terminating iteration number, K \u2208\n\nKmax \u2190 max. iteration number.\n{0, . . . , Kmax \u2212 1}, as a discrete r.v. with:\n\nP (K = k) =\n\n.\n\n(11)\n\n\u03b3k(cid:80)Kmax\u22121\n\n(cid:96)=0\n\n\u03b3(cid:96)\n\n3: for k = 0, 1, 2, . . . , K do\n4:\n\nDraw index ik \u2208 (cid:74)1, n(cid:75) uniformly (and\njk \u2208(cid:74)1, n(cid:75) for \ufb01EM).\n\nCompute the surrogate suf\ufb01cient statistics\nS (k+1) using (8) or (7) or (9) or (10).\nCompute \u02c6s(k+1) via the sE-step (4).\nCompute \u02c6\u03b8(k+1) via the M-step (5).\n\n5:\n\n6:\n7:\n8: end for\n9: Return: \u02c6\u03b8(K).\n\nWe discuss an example of learning a Gaussian Mixture Model (GMM) from a set of n observations\n{yi}n\ni=1. We focus on a simpli\ufb01ed setting where there are M components of unit variance and\nunknown means, the GMM is parameterized by \u03b8 = ({\u03c9m}M\u22121\nm=1) \u2208 \u0398 = \u2206M \u00d7 RM ,\nwhere \u2206M \u2286 RM\u22121 is the reduced M-dimensional probability simplex [see (29)]. We use the\nm \u2212 log Dir(\u03c9; M, \u0001) where \u03b4 > 0 and Dir(\u00b7; M, \u0001) is the M\npenalization R(\u03b8) = \u03b4\n2\ndimensional symmetric Dirichlet distribution with concentration parameter \u0001 > 0. Furthermore, we\n\nm=1 ,{\u00b5m}M\nuse zi \u2208(cid:74)M(cid:75) as the latent label. The complete data log-likelihood is given by\n\n(cid:80)M\nM(cid:88)\n1{m=zi}(cid:2)log(\u03c9m) \u2212 \u00b52\n\n1{m=zi}\u00b5myi + constant,\n\n(12)\n\nm/2(cid:3) +\n\nlog f (zi, yi; \u03b8) =\n\nM(cid:88)\n\nm=1 \u00b52\n\nm=1\n\nm=1\n\nwhere 1{m=zi} = 1 if m = zi; otherwise 1{m=zi} = 0. The above can be rewritten in the\nsame form as (2), particularly with S(yi, zi) \u2261 (s(1)\n) and \u03c6(\u03b8) \u2261\n(\u03c6(1)\n\ni,M\u22121, s(3)\n\ni,M\u22121, s(2)\n\ni,1 , ..., s(1)\n\ni,1 , ..., s(2)\n\nM\u22121(\u03b8), \u03c6(3)(\u03b8)) such that\n\n1 (\u03b8), ..., \u03c6(1)\n\nM\u22121(\u03b8), \u03c6(2)\n\ni\n\nm/2} \u2212 {log(1 \u2212(cid:80)M\u22121\n\nj=1 \u03c9j) \u2212 \u00b52\n\nM /2},\n\ns(1)\ni,m = 1{zi=m}, \u03c6(1)\ns(2)\ni,m = 1{zi=m}yi, \u03c6(2)\n\nand \u03c8(\u03b8) = \u2212{log(1 \u2212(cid:80)M\u22121\n\n1 (\u03b8), ..., \u03c6(2)\nm (\u03b8) = {log(\u03c9m) \u2212 \u00b52\nm (\u03b8) = \u00b5m,\nm=1 \u03c9m) \u2212 \u00b52\n\ns(3)\ni = yi, \u03c6(3)(\u03b8) = \u00b5M ,\n\n(13)\n\nM /2\u03c32}. To evaluate the sE-step, the conditional expec-\ntation required by (6) can be computed in closed form, as they depend on E \u02c6\u03b8(k)[1{zi=m}|y = yi]\nand E \u02c6\u03b8(k)[yi1{zi=m}|y = yi]. Moreover, the M-step (5) solves a strongly convex problem and can\n\n4\n\n\fbe computed in closed form. Given a suf\ufb01cient statistics s \u2261 (s(1), s(2), s(3)), the solution to (5) is:\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\n(1 + \u0001M )\u22121(cid:0)s(1)\n(cid:0)(s(1)\n(cid:0)1 \u2212(cid:80)M\u22121\n\n1 + \u03b4)\u22121s(2)\nm=1 s(1)\n\nM\u22121 + \u0001(cid:1)(cid:62)\nm + \u03b4(cid:1)\u22121(cid:0)s(3) \u2212(cid:80)M\u22121\n\nM\u22121 + \u03b4)\u22121s(2)\nM\u22121\nm=1 s(2)\n\n1 + \u0001, . . . , s(1)\n\n1 , . . . , (s(1)\n\n(cid:1)(cid:62)\n(cid:1)\n\nm\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8 .\n\n\u03b8(s) =\n\n(14)\n\nThe next section presents the main results of this paper for the convergence of stochastic EM meth-\nods. We shall use the above example on GMM to illustrate the required assumptions.\n\n3 Global Convergence of Stochastic EM Methods\n\nWe establish non-asymptotic rates for the global convergence of the stochastic EM methods. We\nshow that the iEM method is an instance of the incremental MM method; while sEM-VR, \ufb01EM\nmethods are instances of variance reduced stochastic scaled gradient methods. As we will see, the\nlatter interpretation allows us to establish fast convergence rates of sEM-VR and \ufb01EM methods.\nDetailed proofs for the theoretical results in this section are relegated to the appendix.\nFirst, we list a few assumptions which will enable the convergence analysis performed later in this\nsection. De\ufb01ne:\n\ni=1 \u03b1isi : si \u2208 conv {S(z, yi) : z \u2208 Z} , \u03b1i \u2208 [\u22121, 1], i \u2208(cid:74)1, n(cid:75)} ,\n\n(15)\nwhere conv{A} denotes the closed convex hull of the set A. From (15), we observe that the iEM,\nsEM-VR, and \ufb01EM methods generate \u02c6s(k) \u2208 S for any k \u2265 0. Consider:\nH1. The sets Z, S are compact. There exists constants CS, CZ such that:\n\nS := {(cid:80)n\nCS := maxs,s(cid:48)\u2208S (cid:107)s \u2212 s(cid:48)(cid:107) < \u221e, CZ := maxi\u2208(cid:74)1,n(cid:75)(cid:82)\n\nZ |S(z, yi)|\u00b5(dz) < \u221e.\n\n(16)\n\n\u03c6(\u03b8(cid:48))(cid:107) \u2264 C\u03c6.\n\nH1 depends on the latent data model used and can be satis\ufb01ed by several practical models. For\ninstance, the GMM in Section 2.1 satis\ufb01es (16) as the suf\ufb01cient statistics are composed of indicator\n\u03ba(\u03b8(cid:48)) the\nfunctions and observations. Other examples can also be found in Section 4. Denote by J\u03b8\nJacobian of the function \u03ba : \u03b8 (cid:55)\u2192 \u03ba(\u03b8) at \u03b8(cid:48) \u2208 \u0398. Consider:\nH2. The function \u03c6 is smooth and bounded on int(\u0398), i.e., the interior of \u0398. For all \u03b8, \u03b8(cid:48) \u2208 int(\u0398)2,\n(cid:107) J\u03b8\n\n\u03c6(\u03b8) \u2212 J\u03b8\nwe have(cid:12)(cid:12)p(z|yi; \u03b8) \u2212 p(z|yi; \u03b8(cid:48))(cid:12)(cid:12) \u2264 Lp (cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107).\nH3. The conditional distribution is smooth on int(\u0398). For any i \u2208(cid:74)1, n(cid:75), z \u2208 Z, \u03b8, \u03b8(cid:48) \u2208 int(\u0398)2,\nH4. For any s \u2208 S, the function \u03b8 (cid:55)\u2192 L(s, \u03b8) := R(\u03b8) + \u03c8(\u03b8) \u2212(cid:10)s| \u03c6(\u03b8)(cid:11) admits a unique global\n\n\u03c6(\u03b8(cid:48))(cid:107) \u2264 L\u03c6 (cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107) and (cid:107) J\u03b8\n\n\u03c6(\u03b8(s)) is full rank and \u03b8(s) is L\u03b8-Lipschitz.\n\nminimum \u03b8(s) \u2208 int(\u0398). In addition, J\u03b8\nUnder H1, the assumptions H2 and H3 are standard for the curved exponential family distribu-\ntion and the conditional probability distributions, respectively; H4 can be enforced by designing a\nstrongly convex regularization function R(\u03b8) tailor made for \u0398. For instance, the penalization for\nGMM in Section 2.1 ensures \u03b8(k) is unique and lies in int(\u2206M )\u00d7 RM , which can further imply the\nsecond statement in H4. We remark that for H3, it is possible to de\ufb01ne the Lipschitz constant Lp in-\ndependently for each data yi to yield a re\ufb01ned characterization. We did not pursue such assumption\nto keep the notations simple.\nDenote by H\u03b8\n\nL(s, \u03b8) the Hessian w.r.t to \u03b8 for a given value of s of the function \u03b8 (cid:55)\u2192 L(s, \u03b8) =\n\nR(\u03b8) + \u03c8(\u03b8) \u2212(cid:10)s| \u03c6(\u03b8)(cid:11), and de\ufb01ne\n\n\u03c6(\u03b8(s))\n\nL(s, \u03b8(s))\n\nB(s) := J\u03b8\n\n(17)\nH5. It holds that \u03c5max := sups\u2208S (cid:107) B(s)(cid:107) < \u221e and 0 < \u03c5min := inf s\u2208S \u03bbmin(B(s)). There exists\na constant LB such that for all s, s(cid:48) \u2208 S2, we have (cid:107) B(s) \u2212 B(s(cid:48))(cid:107) \u2264 LB (cid:107)s \u2212 s(cid:48)(cid:107).\nAgain, H5 is satis\ufb01ed by practical models. For GMM in Section 2.1, it can be veri\ufb01ed by deriving\nthe closed form expression for B(s) and using H1; also see the other example in Section 4. The\nderivation is, however, technical and will be relegated to the supplementary material.\n\n\u03c6(\u03b8(s))(cid:62).\nJ\u03b8\n\n(cid:16)\n\nH\u03b8\n\n(cid:17)\u22121\n\n5\n\n\fUnder H1, we have (cid:107)\u02c6s(k)(cid:107) < \u221e since S is compact. On the other hand, under H4, the EM methods\ngenerate \u02c6\u03b8(k) \u2208 int(\u0398) for any k \u2265 0. Overall, these assumptions ensure that the EM methods\noperate in a \u2018nice\u2019 set throughout the optimization process.\n\n3.1\n\nIncremental EM method\n\nWe show that the iEM method is a special case of the MISO method [Mairal, 2015] utilizing the\nmajorization minimization (MM) technique. The latter is a common technique for handling non-\nconvex optimization. We begin by de\ufb01ning a surrogate function that majorizes Li:\n\nQi(\u03b8; \u03b8(cid:48)) := \u2212\n\n{log f (zi, yi; \u03b8) \u2212 log p(zi|yi; \u03b8(cid:48))} p(zi|yi; \u03b8(cid:48))\u00b5(dzi) .\n\n(18)\n\n(cid:90)\n\nZ\n\nThe second term inside the bracket is a constant that does not depend on the \ufb01rst argument \u03b8. Since\nf (zi, yi; \u03b8) = p(zi|yi; \u03b8)g(yi; \u03b8), for all \u03b8(cid:48) \u2208 \u0398, we get Qi(\u03b8(cid:48); \u03b8(cid:48)) = \u2212 log g(yi; \u03b8(cid:48)) = Li(\u03b8(cid:48)). For\nall \u03b8, \u03b8(cid:48) \u2208 \u0398, applying the Jensen inequality shows\n\nQi(\u03b8, \u03b8(cid:48)) \u2212 Li(\u03b8) =\n\nlog\n\np(zi|yi; \u03b8(cid:48))\u00b5(dzi) \u2265 0\n\n(19)\n\np(\u00b7|yi; \u03b8) and p(\u00b7|yi; \u03b8(cid:48)). Hence, for all i \u2208 (cid:74)1, n(cid:75), Qi(\u03b8; \u03b8(cid:48)) is a majorizing surrogate to Li(\u03b8),\n\nwhich is the Kullback-Leibler divergence between the conditional distribution of the latent data\ni.e., it satis\ufb01es for all \u03b8, \u03b8(cid:48) \u2208 \u0398, Qi(\u03b8; \u03b8(cid:48)) \u2265 Li(\u03b8) with equality when \u03b8 = \u03b8(cid:48). For the special case\nof curved exponential family distribution, the M-step of the iEM method is expressed as\n\n(cid:90)\n\np(zi|yi; \u03b8(cid:48))\np(zi|yi; \u03b8)\n\n\u02c6\u03b8(k+1) \u2208 arg min\u03b8\u2208\u0398\n= arg min\u03b8\u2208\u0398\n\n(cid:8) R(\u03b8) + n\u22121(cid:80)n\n(cid:110)\nR(\u03b8) + \u03c8(\u03b8) \u2212(cid:10)n\u22121(cid:80)n\n\ni=1 Qi(\u03b8; \u02c6\u03b8(\u03c4 (k+1)\ni=1 s(\u03c4 k+1\n\ni\n\ni\n\n))(cid:9)\n(cid:111)\n| \u03c6(\u03b8)(cid:11))\n\n)\n\n.\n\ni\n\n(20)\n\nThe iEM method can be interpreted through the MM technique \u2014 in the M-step, \u02c6\u03b8(k+1) minimizes\nan upper bound of L(\u03b8), while the sE-step updates the surrogate function in (20) which tightens\nthe upper bound. Importantly, the error between the surrogate function and Li is a smooth function:\nLemma 1. Assume H1, H2, H3, H4. Let ei(\u03b8; \u03b8(cid:48)) := Qi(\u03b8; \u03b8(cid:48))\u2212Li(\u03b8). For any \u03b8, \u00af\u03b8, \u03b8(cid:48) \u2208 \u03983, we\nhave (cid:107)\u2207ei(\u03b8; \u03b8(cid:48)) \u2212 \u2207ei( \u00af\u03b8; \u03b8(cid:48))(cid:107) \u2264 Le (cid:107)\u03b8 \u2212 \u00af\u03b8(cid:107), where Le := C\u03c6CZ Lp +CS L\u03c6.\nFor non-convex optimization such as (1), it has been shown [Mairal, 2015, Proposition 3.1] that\nthe incremental MM method converges asymptotically to a stationary solution of a problem. We\nstrengthen their result by establishing a non-asymptotic rate, which is new to the literature.\nTheorem 1. Consider the iEM algorithm, i.e., Algorithm 1 with (7). Assume H1, H2, H3, H4. For\nany Kmax \u2265 1, it holds that\n\nE[(cid:107)\u2207L( \u02c6\u03b8(K))(cid:107)2] \u2264 n\n\n2 Le\nKmax\n\nwhere Le is de\ufb01ned in Lemma 1 and K is a uniform random variable on(cid:74)0, Kmax \u2212 1(cid:75) [cf. (11)]\n\nindependent of the {ik}Kmax\nk=0 .\nWe remark that under suitable assumptions, our analysis in Theorem 1 also extends to several non-\nexponential family distribution models.\n\n(21)\n\nE(cid:2)L( \u02c6\u03b8(0)) \u2212 L( \u02c6\u03b8(Kmax))(cid:3),\n\n3.2 Stochastic EM as Scaled Gradient Methods\n\nWe interpret the sEM-VR and \ufb01EMmethods as scaled gradient methods on the suf\ufb01cient statistics\n\u02c6s, tackling a non-convex optimization problem. The bene\ufb01t of doing so is that we are able to demon-\nstrate a faster convergence rate for these methods through motivating them as variance reduced\noptimization methods. The latter is shown to be more effective when handling large datasets [Reddi\net al., 2016b,a, Allen-Zhu and Hazan, 2016] than traditional stochastic/deterministic optimization\nmethods. To set our stage, we consider the minimization problem:\n\nmin\ns\u2208S\n\nV (s) := L(\u03b8(s)) = R(\u03b8(s)) +\n\n1\nn\n\nLi(\u03b8(s)),\n\n(22)\n\nn(cid:88)\n\ni=1\n\nwhere \u03b8(s) is the unique map de\ufb01ned in the M-step (5). We \ufb01rst show that the stationary points of\n(22) has a one-to-one correspondence with the stationary points of (1):\n\n6\n\n\fLemma 2. For any s \u2208 S, it holds that\n\n(23)\n\n\u2207sV (s) = Js\n\n\u03b8(s)(cid:62)\u2207\u03b8L(\u03b8(s)).\n\nAssume H4. If s(cid:63) \u2208 {s \u2208 S : \u2207sV (s) = 0}, then \u03b8(s(cid:63)) \u2208(cid:8)\u03b8 \u2208 \u0398 : \u2207\u03b8L(\u03b8) = 0(cid:9). Conversely,\nif \u03b8\u2217 \u2208(cid:8)\u03b8 \u2208 \u0398 : \u2207\u03b8L(\u03b8) = 0(cid:9), then s\u2217 = s(\u03b8\u2217) \u2208 {s \u2208 S : \u2207sV (s) = 0}.\n(cid:3) = \u02c6s(k) \u2212 s(k) = \u02c6s(k) \u2212 s(\u03b8(\u02c6s(k))),\n\nThe next lemmas show that the update direction, \u02c6s(k) \u2212 S (k+1), in the sE-step (4) of sEM-VR and\n\ufb01EM methods is a scaled gradient of V (s). We \ufb01rst observe the following conditional expectation:\n(24)\n\nwhere Fk is the \u03c3-algebra generated by {i0, i1, . . . , ik} (or {i0, j0, . . . , ik, jk} for \ufb01EM).\nThe difference vector s \u2212 s(\u03b8(s)) and the gradient vector \u2207sV (s) are correlated, as we observe:\nLemma 3. Assume H4,H5. For all s \u2208 S,\n\nE(cid:2)\u02c6s(k) \u2212 S (k+1)|Fk\n(cid:10)\u2207V (s)| s \u2212 s(\u03b8(s))(cid:11) \u2265(cid:13)(cid:13)s \u2212 s(\u03b8(s))(cid:13)(cid:13)2 \u2265 \u03c5\u22122\n\nmax(cid:107)\u2207V (s)(cid:107)2,\n\n\u03c5\u22121\n\nmin\n\n(25)\n\nCombined with (24), the above lemma shows that the update direction in the sE-step (4) of sEM-VR\nand \ufb01EM methods is a stochastic scaled gradient where \u02c6s(k) is updated with a stochastic direction\nwhose mean is correlated with \u2207V (s).\nFurthermore, the expectation step\u2019s operator and the objective function in (22) are smooth functions:\n\nLemma 4. Assume H1, H3, H4, H5. For all s, s(cid:48) \u2208 S and i \u2208(cid:74)1, n(cid:75), we have\n\n(cid:107)si(\u03b8(s)) \u2212 si(\u03b8(s(cid:48)))(cid:107) \u2264 Ls (cid:107)s \u2212 s(cid:48)(cid:107), (cid:107)\u2207V (s) \u2212 \u2207V (s(cid:48))(cid:107) \u2264 LV (cid:107)s \u2212 s(cid:48)(cid:107),\n\n(26)\n\nwhere Ls := CZ Lp L\u03b8 and LV := \u03c5max\nThe following theorem establishes the (fast) non-asymptotic convergence rates of sEM-VR and \ufb01EM\nmethods, which are similar to [Reddi et al., 2016b,a, Allen-Zhu and Hazan, 2016]:\nTheorem 2. Assume H1, H3, H4, H5. Denote Lv = max{LV , Ls} with the constants in Lemma 4.\n\u2022 Consider the sEM-VR method, i.e., Algorithm 1 with (9). There exists a universal constant\nLvn2/3 and the epoch\n\n\u00b5 \u2208 (0, 1) (independent of n) such that if we set the step size as \u03b3 = \u00b5\u03c5min\nlength as m =\n\nmin+\u00b5 , then for any Kmax \u2265 1 that is a multiple of m, it holds that\n\nn\n2\u00b52\u03c52\n\n(cid:0)1 + Ls\n\n(cid:1) + LB CS.\n\nE[(cid:107)\u2207V (\u02c6s(K))(cid:107)2] \u2264 n\n\n2\n3\n\n2Lv\n\n\u00b5Kmax\n\n\u03c52\nmax\n\u03c52\nmin\n\nE[V (\u02c6s(0)) \u2212 V (\u02c6s(Kmax))].\n\n(27)\n\n\u2022 Consider the \ufb01EM method, i.e., Algorithm 1 with (10). Set \u03b3 = \u03c5min\n\nmax{6, 1 + 4\u03c5min}. For any Kmax \u2265 1, it holds that\n\n\u03b1Lvn2/3 such that \u03b1 =\n\nE(cid:2)V (\u02c6s(0)) \u2212 V (\u02c6s(Kmax))(cid:3).\n\n(28)\n\nE[(cid:107)\u2207V (\u02c6s(K))(cid:107)2] \u2264 n\n\n2\n3\n\n\u03b12Lv\nKmax\n\n\u03c52\nmax\n\u03c52\nmin\n\nWe recall that K in the above is a uniform and independent r.v. chosen from(cid:74)Kmax \u2212 1(cid:75) [cf. (11)].\n\nIn the supplementary materials, we also provide a local convergence analysis for \ufb01EM method which\nshows that the latter can achieve linear rate of convergence locally under a similar set of assumptions\nused in [Chen et al., 2018] for sEM-VR method.\nComparing iEM, sEM-VR, and \ufb01EM Note that by (23) in Lemma 2, if (cid:107)\u2207sV (\u02c6s)(cid:107)2 \u2264 \u0001, then\n(cid:107)\u2207\u03b8L(\u03b8(\u02c6s))(cid:107)2 = O(\u0001), and vice versa, where the hidden constant is independent of n. In other\nwords, the rates for iEM, sEM-VR, \ufb01EM methods in Theorem 1 and 2 are comparable.\nImportantly, the theorems show an intriguing comparison \u2013 to attain an \u0001-stationary point with\n(cid:107)\u2207\u03b8L(\u03b8(\u02c6s))(cid:107)2 \u2264 \u0001 or (cid:107)\u2207sV (\u02c6s)(cid:107)2 \u2264 \u0001, the iEM method requires O(n/\u0001) iterations (in expec-\ntation) while the sEM-VR, \ufb01EM methods require only O(n 2\n3 /\u0001) iterations (in expectation). This\ncomparison can be surprising since the iEM method is a monotone method as it guarantees decrease\nin the objective value; while the sEM-VR, \ufb01EM methods are non-monotone. Nevertheless, it aligns\nwith the recent analysis on stochastic variance reduction methods on non-convex problems. In the\nnext section, we con\ufb01rm the theory by observing a similar behavior numerically.\n\n7\n\n\f4 Numerical Examples\n\n4.1 Gaussian Mixture Models\nAs described in Section 2.1, our goal is to \ufb01t a GMM model to a set of n observations {yi}n\ni=1\nwhose distribution is modeled as a Gaussian mixture of M components, each with a unit variance.\nwith \u03c9 = {\u03c9m}M\u22121\n\u00b5 = {\u00b5m}M\n\nLet zi \u2208(cid:74)M(cid:75) be the latent labels, the complete log-likelihood is given in (12), where \u03b8 := (\u03c9, \u00b5)\n\u0398 = {\u03c9m, m = 1, ..., M \u2212 1 : \u03c9m \u2265 0, (cid:80)M\u22121\n\nm=1 are the mixing weights with the convention \u03c9M = 1 \u2212(cid:80)M\u22121\n\nm=1 are the means. The constraint set on \u03b8 is given by\n\nm=1 \u03c9m \u2264 1} \u00d7 {\u00b5m \u2208 R, m = 1, ..., M}.\n\nm=1 \u03c9m and\n\n(29)\n\nIn the following experiments of synthetic data, we generate samples from a GMM model with M =\n2 components with two mixtures with means \u00b51 = \u2212\u00b52 = 0.5, see Appendix G.1 for details of the\nimplementation and satisfaction of model assumptions for GMM inference. We aim at verifying the\ntheoretical results in Theorem 1 and 2 of the dependence on sample size n.\n\nFixed sample size We use n = 104 synthetic samples and run the bEM method until convergence\n(to double precision) to obtain the ML estimate \u00b5(cid:63). We compare the bEM, sEM, iEM, sEM-VR and\n\ufb01EM methods in terms of their precision measured by |\u00b5 \u2212 \u00b5(cid:63)|2. We set the stepsize of the sEM as\n\u03b3k = 3/(k + 10), and the stepsizes of the sEM-VR and the \ufb01EM to a constant stepsize proportional\nto 1/n2/3 and equal to \u03b3 = 0.003. The left plot of Figure 1 shows the convergence of the precision\n|\u00b5 \u2212 \u00b5\u2217|2 for the different methods against the epoch(s) elapsed (one epoch equals n iterations).\nWe observe that the sEM-VR and \ufb01EM methods outperform the other methods, supporting our\nanalytical results.\nVarying sample size We compare the number of iterations required to reach a precision of 10\u22123 as\na function of the sample size from n = 103 to n = 105. We average over 5 independent runs for each\nmethod using the same stepsizes as in the \ufb01nite sample size case above. The right plot of Figure\n1 con\ufb01rms that our \ufb01ndings in Theorem 1 and 2 are sharp. It requires O(n/\u0001) (resp. O(n 2\n3 /\u0001))\niterations to \ufb01nd a \u0001-stationary point for the iEM (resp. sEM-VR and \ufb01EM) method.\n\nFigure 1: Performance of stochastic EM methods for \ufb01tting a GMM. (Left) Precision (|\u00b5(k) \u2212 \u00b5(cid:63)|2)\nas a function of the epoch elapsed. (Right) Number of iterations to reach a precision of 10\u22123.\n\n4.2 Probabilistic Latent Semantic Analysis\n\nThe second example considers probabilistic Latent Semantic Analysis (pLSA) whose aim is to clas-\n\ni=1 where each token is a\npair of document and word yi = (y(d)\nappears in document y(d)\n.\nThe goal of pLSA is to classify the documents into K topics, which is modeled as a latent variable\n\nsify documents into a number of topics. We are given a collection of documents(cid:74)D(cid:75) with terms\nfrom a vocabulary(cid:74)V(cid:75). The data is summarized by a list of tokens {yi}n\nzi \u2208(cid:74)K(cid:75) associated with each token [Hofmann, 1999].\nd,k }(cid:74)K\u22121(cid:75)\u00d7(cid:74)D(cid:75) and \u03b8(w|t) = {\u03b8(w|t)\nas \u2014 for each d \u2208(cid:74)D(cid:75), \u03b8(t|d)\n\nTo apply stochastic EM methods for pLSA, we de\ufb01ne \u03b8 := (\u03b8(t|d), \u03b8(w|t)) as the parameter variable,\nk,v }(cid:74)K(cid:75)\u00d7(cid:74)V \u22121(cid:75). The constraint set \u0398 is given\nwhere \u03b8(t|d) = {\u03b8(t|d)\n\u00b7,k \u2208 \u2206V , where \u2206K, \u2206V\n\nd,\u00b7 \u2208 \u2206K and for each k \u2208(cid:74)K(cid:75), we have \u03b8(w|t)\n\n) which indicates that y(w)\n\ni\n\ni\n\n, y(w)\n\ni\n\ni\n\n8\n\n\fare the (reduced dimension) K, V -dimensional probability simplex; see (108) in the supplementary\nmaterial for the precise de\ufb01nition. Furthermore, denote \u03b8(t|d)\nand \u03b8(w|t)\nan additive constant term):\n\nd,k for each d \u2208(cid:74)D(cid:75),\nfor each k \u2208(cid:74)K(cid:75), the complete log likelihood for (yi, zi) is (up to\n\nk=1 \u03b8(t|d)\n\nk,V = 1 \u2212(cid:80)V \u22121\n(cid:96)=1 \u03b8(w|t)\nD(cid:88)\nK(cid:88)\n\nk,(cid:96)\n\nlog f (zi, yi; \u03b8) =\n\nd,K = 1\u2212(cid:80)K\u22121\nV(cid:88)\nK(cid:88)\n\nlog(\u03b8(w|t)\n\nlog(\u03b8(t|d)\n\nd,k )1{k,d}(zi, y(d)\n\ni\n\n) +\n\nk,v )1{k,v}(zi, y(w)\n\ni\n\n). (30)\n\nk=1\n\nd=1\n\nk=1\n\nv=1\n\nThe penalization function is designed as\n\nR(\u03b8(t|d), \u03b8(w|t)) = \u2212 log Dir(\u03b8(t|d); K, \u03b1(cid:48)) \u2212 log Dir(\u03b8(w|t); V, \u03b2(cid:48)),\n\n(31)\nsuch that we ensure \u03b8(s) \u2208 int(\u0398). We can apply the stochastic EM methods described in Section 2\non the pLSA problem. The implementation details are provided in Appendix G.2, therein we also\nverify the model assumptions required by our convergence analysis for pLSA.\n\nExperiment We compare the stochastic EM methods on two FAO (UN Food and Agricul-\nture Organization) datasets [Medelyan, 2009]. The \ufb01rst (resp. second) dataset consists of 103\n(resp. 10.5 \u00d7 103) documents and a vocabulary of size 300. The number of topics is set to K = 10\nand the stepsizes for the \ufb01EM and sEM-VR are set to \u03b3 = 1/n2/3 while the stepsize for the sEM\nis set to \u03b3k = 1/(k + 10). Figure 1 shows the evidence lower bound (ELBO) as a function of the\nnumber of epochs for the datasets. Again, the result shows that \ufb01EM and sEM-VR methods achieve\nfaster convergence than the competing EM methods, af\ufb01rming our theoretical \ufb01ndings.\n\nFigure 2: ELBO of the stochastic EM methods on FAO datasets as a function of number of epochs\nelapsed. (Left) small dataset with 103 documents. (Right) large dataset with 10.5 \u00d7 103 documents.\n\n5 Conclusion\n\nThis paper studies the global convergence for stochastic EM methods. Particularly, we focus on the\ninference of latent variable model with exponential family distribution and analyze the convergence\nof several stochastic EM methods. Our convergence results are global and non-asymptotic, and\nwe offer two complimentary views on the existing stochastic EM methods \u2014 one interprets iEM\nmethod as an incremental MM method, and one interprets sEM-VR and \ufb01EM methods as scaled\ngradient methods. The analysis shows that the sEM-VR and \ufb01EM methods converge faster than the\niEM method, and the result is con\ufb01rmed via numerical experiments.\n\nAcknowledgement\n\nBK and HTW contributed equally to this work. HTW\u2019s work is supported by the CUHK Direct\nGrant #4055113.\n\n9\n\n\fReferences\nP. Ablin, A. Gramfort, J.-F. Cardoso, and F. Bach. EM algorithms for ICA. arXiv preprint\n\narXiv:1805.10054, 2018.\n\nZ. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In International\n\nConference on Machine Learning, pages 699\u2013707, 2016.\n\nS. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From\npopulation to sample-based analysis. Ann. Statist., 45(1):77\u2013120, 02 2017. doi: 10.1214/\n16-AOS1435.\n\nD. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational Inference: A Review for Statisticians.\nJournal of the American statistical Association, 112(518):859\u2013877, JUN 2017. ISSN 0162-1459.\ndoi: {10.1080/01621459.2017.1285773}.\n\nO. Capp\u00b4e and E. Moulines. On-line expectation\u2013maximization algorithm for latent data models.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):593\u2013613, 2009.\n\nJ. Chen, J. Zhu, Y. W. Teh, and T. Zhang. Stochastic expectation maximization with variance reduc-\n\ntion. In Advances in Neural Information Processing Systems, pages 7978\u20137988, 2018.\n\nI. Csisz\u00b4ar and G. Tusn\u00b4ady. Information geometry and alternating minimization procedures. Statist.\nDecisions, suppl. 1:205\u2013237, 1984. ISSN 0721-2631. Recent results in estimation theory and\nrelated topics.\n\nA. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support\nIn Advances in neural information processing\n\nfor non-strongly convex composite objectives.\nsystems, pages 1646\u20131654, 2014.\n\nA. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the\nEM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1\u201338,\n1977.\n\nS. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic pro-\n\ngramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\nA. Gunawardana and W. Byrne. Convergence theorems for generalized alternating minimization\n\nprocedures. Journal of Machine Learning Research, 6:2049\u20132073, 2005.\n\nG. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural\n\ncomputation, 18(7):1527\u20131554, 2006.\n\nT. Hofmann. Probabilistic latent semantic indexing.\n\nIn Proceedings of the 22Nd Annual In-\nternational ACM SIGIR Conference on Research and Development in Information Retrieval,\nSIGIR \u201999, pages 50\u201357, New York, NY, USA, 1999. ACM.\ndoi:\n10.1145/312624.312649.\n\nISBN 1-58113-096-1.\n\nR. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduc-\n\ntion. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\nB. Karimi, B. Miasojedow, E. Moulines, and H.-T. Wai. Non-asymptotic analysis of biased stochas-\n\ntic approximation schemes. In Conference on Learning Theory, 2019.\n\nJ. Mairal. Incremental majorization-minimization optimization with application to large-scale ma-\n\nchine learning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\nG. McLachlan and T. Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons,\n\n2007.\n\n10\n\n\fO. Medelyan. Human-competitive automatic topic indexing. PhD thesis, The University of Waikato,\n\n2009.\n\nR. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and\n\nother variants. In Learning in graphical models, pages 355\u2013368. Springer, 1998.\n\nS. Ng and G. McLachlan. On the choice of the number of blocks with the incremental EM algorithm\nfor the \ufb01tting of normal mixtures. Statistics and Computing, 13(1):45\u201355, FEB 2003. ISSN 0960-\n3174. doi: {10.1023/A:1021987710829}.\n\nS. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex\n\noptimization. In International conference on machine learning, pages 314\u2013323, 2016a.\n\nS. J. Reddi, S. Sra, B. P\u00b4oczos, and A. Smola. Fast incremental method for nonconvex optimization.\n\narXiv preprint arXiv:1603.06159, 2016b.\n\nB. Thiesson, C. Meek, and D. Heckerman. Accelerating EM for large databases. Machine Learning,\n\n45(3):279\u2013299, 2001. ISSN 0885-6125. doi: {10.1023/A:1017986506241}.\n\nM. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational infer-\n\nence. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\nZ. Wang, Q. Gu, Y. Ning, and H. Liu. High dimensional em algorithm: Statistical optimization\nand asymptotic normality. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-\nnett, editors, Advances in Neural Information Processing Systems 28, pages 2521\u20132529. Curran\nAssociates, Inc., 2015.\n\nC. J. Wu et al. On the convergence properties of the EM algorithm. The Annals of statistics, 11(1):\n\n95\u2013103, 1983.\n\nJ. Xu, D. J. Hsu, and A. Maleki. Global analysis of expectation maximization for mixtures of two\ngaussians. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 29, pages 2676\u20132684. Curran Associates, Inc., 2016.\n\nR. Zhu, L. Wang, C. Zhai, and Q. Gu. High-dimensional variance-reduced stochastic gradient\nIn Proceedings of the 34th International Conference on\n\nexpectation-maximization algorithm.\nMachine Learning-Volume 70, pages 4180\u20134188. JMLR. org, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1610, "authors": [{"given_name": "Belhal", "family_name": "Karimi", "institution": "Ecole Polytechnique"}, {"given_name": "Hoi-To", "family_name": "Wai", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Eric", "family_name": "Moulines", "institution": "Ecole Polytechnique"}, {"given_name": "Marc", "family_name": "Lavielle", "institution": "Inria & Ecole Polytechnique"}]}