{"title": "Global Convergence of Least Squares EM for Demixing Two Log-Concave Densities", "book": "Advances in Neural Information Processing Systems", "page_first": 4794, "page_last": 4802, "abstract": "This work studies the location estimation problem for a mixture of two rotation invariant log-concave densities. We demonstrate that Least Squares EM, a variant of the EM algorithm, converges to the true location parameter from a randomly initialized point. Moreover, we establish the explicit convergence rates and sample complexity bounds, revealing their dependence on the signal-to-noise ratio and the tail property of the log-concave distributions. Our analysis generalizes previous techniques for proving the convergence results of Gaussian mixtures, and highlights that an angle-decreasing property is sufficient for establishing global convergence for Least Squares EM.", "full_text": "Global Convergence of Least Squares EM\nfor Demixing Two Log-Concave Densities\n\nWei Qian, Yuqian Zhang, Yudong Chen\n\nSchool of Operations Research and Information Engineering\n\nCornell University\n\n{wq34,yz2557,yudong.chen}@cornell.edu\n\nAbstract\n\nThis work studies the location estimation problem for a mixture of two rotation\ninvariant log-concave densities. We demonstrate that Least Squares EM, a variant\nof the EM algorithm, converges to the true location parameter from a randomly\ninitialized point. Moreover, we establish the explicit convergence rates and sample\ncomplexity bounds, revealing their dependence on the signal-to-noise ratio and the\ntail property of the log-concave distributions. Our analysis generalizes previous\ntechniques for proving the convergence results of Gaussian mixtures, and highlights\nthat an angle-decreasing property is suf\ufb01cient for establishing global convergence\nfor Least Squares EM.\n\n1\n\nIntroduction\n\nOne important problem in statistics and machine learning [18, 24] is learning a \ufb01nite mixture of\ndistributions. In the parametric setting in which the functional form of the density is known, this\nproblem is to estimate parameters (e.g., mean and covariance) that specify the distribution of each\nmixture component. The parameter estimation problem for mixture models is inherently non-convex,\nposing challenges for both computation and analysis. While many algorithms have been proposed,\nrigorous performance guarantees are often elusive. One exception is the Gaussian Mixture Model\n(GMM) for which much theoretical progress has been made in recent years. The goal of this paper is\nto study algorithmic guarantees for a much broader class of mixture models, namely the mixture of\nlog-concave distributions. This class includes may common distributions1 and is interesting from\nboth modelling and theoretical perspectives [2, 3, 6, 12, 26, 23].\nWe focus on the Expectation Maximization (EM) algorithm [11], which is one of the most popular\nmethods for estimating mixture models. Understanding the convergence property of EM is highly non-\ntrivial due to the non-convexity of the negative log-likelihood function. The work in [4] developed\na general framework for establishing local convergence to the true parameter. Proving global\nconvergence of EM is more challenging, even in the simplest setting with a mixture of two Gaussians\n(2GMM). The recent work in [10, 28] considered balanced 2GMM with known covariance matrix and\nshowed for the \ufb01rst time that EM converges to the true location parameter using random initialization.\nSubsequent work established global convergence results for a mixture of two truncated Gaussians [19],\ntwo linear regressions (2MLR) [17, 16], and two one-dimensional Laplace distributions [5].\nAll the above results (with the exception of [5]) rely on the explicit density form and the speci\ufb01c\nproperties of the Gaussian distribution.\nIn particular, under the the Gaussian assumption, the\nM-step in the EM algorithm has a closed-form expression, which allows a direct analysis of the\nconvergence behavior of the algorithm. However, for a general log-concave distribution the M-step\nno longer admits a closed-form solution, and this poses signi\ufb01cant challenges for analysis. To address\n\n1Familiar examples of log-concave distributions include Gaussian, Laplace, Gamma, and Logistics [3].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthis dif\ufb01culty, we consider a modi\ufb01cation of the standard EM algorithm, called Least Squares EM\n(LS-EM), to learn the location parameter of a mixture of two log-concave distributions. This algorithm\nadmits an explicit update, which is computationally simple.\nAs the main result of this paper, we show that for a mixture of rotation invariant log-concave\ndistribution, LS-EM converges to the true location parameter with a random initialization. Moreover,\nwe provide explicit convergence rates and sample complexity bounds, which depend on the signal-to-\nnoise ratio as well as the tail property of the distribution. As the functional form of the true density\nmay be unknown, we further establish a robustness property of LS-EM when using a mis-speci\ufb01ed\ndensity. As a special case, we show that using a Gaussian density, LS-EM globally converges to a\nsolution close to the true parameter whenever the variance of the true log-concave density is moderate.\n\nTechnical Contributions We generalize the sensitivity analysis in [10] to a broad class of log-\nconcave distributions. In the process, we demonstrate that log-concavity and rotation invariance of the\ndistribution are the only properties required to guarantee the global convergence of LS-EM. Moreover,\nour analysis highlights the fundamental role of an angle-decreasing property in establishing the\nconvergence of LS-EM to the true location parameter in the high dimension settings. Note the `2\ndistance contraction, upon which the previous convergence results were built, no longer holds for\ngeneral log-concave distributions.\n\nOrganization In Section 2, we describe the parameter estimation problem for a mixture of log-\nconcave distributions and review related work. In Section 3, we derive the LS-EM algorithm and\nelucidate its connection with classical EM. Analysis of the global convergence of LS-EM is provided\nin Section 4 for the population setting, in Section 5 for the \ufb01nite-sample setting, and in Section 6 for\nthe model mis-speci\ufb01cation setting. The paper concludes with a discussion of future directions in\nSection 7. Some details of the proofs and numerical experiments are deferred to the Appendix.\nNotations We use x 2 R and x 2 Rd to denote scalars and vectors, respectively; X 2 R and\nX 2 Rd to denote scalar and vector random variables, respectively. The i-th coordinate of x (or X)\nis xi (or Xi), and the j-th data point is denoted by xj or X j. The Euclidean norm in Rd is k\u00b7k 2.\nFor two vectors \u21b5,  2 Rd, we use \\(\u21b5, ) 2 (0,\u21e1 ) to denote the angle between them and h\u21b5, i\nto denote their inner product. Finally, I d is the d-by-d identity matrix.\n\n2 Problem Setup\n\nIn this section, we set up the model for a mixture of log-concave distributions and discuss the\ncorresponding location estimation problem.\n\n2.1 Data Generating Model\nLet F be a class of rotation invariant log-concave densities in Rd de\ufb01ned as follows:\n\nF =\u21e2f :f (x) =\n\nZ f (x) dx = 1,Z x2\n\n1\nCg\n\nexp  g(kxk2), g is convex and strictly increasing on [0,1),\n\n(1)\n\ni f (x) dx = 1,8i 2 [d],\nd f x\n\nwhere we may assume g(0) = 0 without loss of generality.2 It can be veri\ufb01ed that each f 2F\nhas mean 0 and covariance matrix I d. For each f 2F , we may generate a location-scale family\n , which has mean  and covariance matrix 2I d.\nconsisting of the densities f,(x) := 1\nWe assume that each data point is generated from D(\u21e4, ), a balanced mixture of two symmetric\nmembers of the above log-concave location-scale family:\n\nD(\u21e4, ) :=\n\n1\n2\n\nf\u21e4, +\n\n1\n2\n\nf\u21e4,.\n\n(2)\n\n2Note that x 7! g(kxk2) is a convex function, as it is the composition of a convex function and a convex\nincreasing function. The normalization constant Cg can be computed explicitly by Cg = Chdvd with Ch =\nR 1\ntd1 exp(g(t)) dt, where vd := \u21e1d/2\n\n(d/2+1) is the volume of a unit ball in Rd.\n\n0\n\n2\n\n\fThroughout this paper, we denote the signal-to-noise ratio (SNR) by\n\n\u2318 := k\u21e4k2/.\n\nIt is sometimes useful to view the above model as an equivalent latent variable model: for each i 2 [n]\nan unobserved label Zi 2{ 1, 2} is \ufb01rst generated according to P(Z = 1) = P(Z = 2) = 1/2, and\nthen the data point X i is sampled from the corresponding mixture component, i.e., from f\u21e4, if\nZi = 1 and from f\u21e4, otherwise.\nExamples: Below are several familiar examples of one-dimensional log-concave distributions\nf / exp(g) from F:\n\nGaussian distribution. When r = 1, it corresponds to the Laplace distribution.\n\n1. Polynomial distributions: g(x) /| x|r with r  1. When r = 2, it corresponds to the\n2. Logistic distribution: g(x) / log(e|x|/2 + e|x|/2).\n\nThese distributions can be generalized to higher dimension by replacing |x| with kxk2. In Appendix B,\nwe provide a review of some elementary properties of log-concave distributions.\n\n2.2 Location Estimation and the EM Algorithm\nWe assume that  is known, and our goal is to estimate the location parameter \u21e4 given data\nX 1, X 2, . . . , X n 2 Rd sampled i.i.d. from the mixture distribution D(\u21e4, ) in (2). We \ufb01rst\nconsider this problem for a given log-concave family for which the base density f (equivalently, g) is\nknown. The case with an unknown f is discussed in Section 6.\nSince the negative log-likelihood function of the mixture (2) is non-convex, computing the standard\nMLE for \u21e4 involves a non-convex optimization problem. EM is a popular iterative method for\ncomputing the MLE, consisting of an expectation (E) step and a maximization (M) step. In a standard\nimplementation of EM, the E-step computes the conditional distribution of the labels Zi under\nthe current estimate of \u21e4, and the M-step computes a new estimate by maximizing the expected\nlog-likelihood based on the E-step. The LS-EM algorithm we consider, described in Section 3 to\nfollow, is a variant of the standard EM algorithm with a modi\ufb01ed M-step.\n\n2.3 Convergence of EM and Related Work\nDespite the popularity and empirical success of the EM algorithm, our understanding of its theoretical\nproperty is far from complete. Due to the non-convexity of negative log-likelihood functions, EM is\nonly guaranteed to converge to a stationary point [27]. Quantitative convergence results only began\nto emerge in recent years. The work [4] proposed a general framework for establishing the local\nconvergence of EM when initialized near the true parameter, with applications to 2GMM, 2MLR,\nand regression with missing coef\ufb01cients. Extensions to multiple components are considered in [29].\nBeyond local convergence, it is known that the likelihood function of GMM may have bad local\noptima when there are more than two components, and EM fails to \ufb01nd the true parameter without a\ncareful initialization [14]. Analysis of the global convergence of EM has hence been focused on the\ntwo component setting, as is done in this paper. The work in [10] showed that EM converges from a\nrandom initialization for 2GMM. Subsequent work in [17, 16, 28, 19] established similar results in\nother settings, most of which involve Gaussian models. An exception is [5], which proved the global\nconvergence of EM for a mixture of 2 Laplace distributions and derived an explicit convergence rate,\nbut only in the one-dimensional population (in\ufb01nite sample) setting. In general, properties of EM for\nmixtures of other distributions are much less understood, which is the problem we target at in this\npaper.\nThe log-concave family we consider is a natural and \ufb02exible generalization of Gaussian. This family\nincludes many common distributions, and has broad applications in economics [2, 3], reliability\ntheory [6] and sampling analysis [12]; see [26, 23] for a further review. Existing work on estimating\nlog-concave distributions and mixtures has mostly considered the non-parametric setting [26, 9,\n15, 21, 9]; these methods are \ufb02exible but typically more computational and data intensive than\nthe parameteric approach we consider. Other approaches of learning general mixtures include\nspectral [1, 22] and tensor methods [13, 8]; EM is often applied to the output of these methods.\n\n3\n\n\f3 The Least Squares EM Algorithm (LS-EM)\n\nThe M-step in the standard EM involves maximizing the conditional log-likelihood. For GMM, the\nM-step is equivalent to solving a least-squares problem. For a mixture of log-concave distributions,\nthe M-step is to solve a convex optimization problem, and this optimization problem does not admit a\nclosed form solution in general. This introduces complexity in both computation and analysis.\nWe instead consider Least Squares EM, a variant of EM that solves a least-squares problem in the\nM-step even for non-Gaussian mixtures. To elucidate the algorithmic property, we consider LS-EM\nin the population setting, where we have access to an in\ufb01nite number of data points from the mixture\ndistribution D(\u21e4, ). The \ufb01nite sample version is discussed in Section 5.\nEach iteration of the population LS-EM algorithm consists of the following two steps:\n\u2022 E-step: Compute the conditional probabilities of the label Z 2{ 1, 2} given the current location\n\nestimate :\n\np1\n,(X) :=\n\nf,(X)\n\nf,(X) + f,(X)\n\n,\n\np2\n,(X) :=\n\nf,(X)\n\nf,(X) + f,(X)\n\n.\n\n(3)\n\n\u2022 Least-squares M-step: Update the location estimate  via weighted least squares:\n\n+ = argmin\n\nb\n\nEX\u21e0D(\u21e4,)\u21e5p1\n=EX\u21e0D(\u21e4,)X tanh 1\n\n2\n\n2 + p2\n\n,(X)kX  bk2\ng\u2713 1\nkX + k2\u25c6 \n\n2\u21e4\n,(X)kX + bk2\nkX  k2\u25c6! := M (\u21e4, ).\ng\u2713 1\n1\n2\n\n(4)\n\nIn (4), we minimize the sum of squared distances of X to each component\u2019s location, weighted\nby the conditional probability of X belonging to that component. One may interpret LS-EM as a\nsoft version of the K-means algorithm: instead of associating each X exclusively with one of the\ncomponents, we assign a probability, which is computed using the log-concave density.\n\n3.1 Connection to Standard EM\nIn contrast to LS-EM, the M-step in the standard EM algorithm involves maximizing the weighted\nlog-likelihood function (or minimizing the weighted negative log-likelihood function):\n\nStandard M-step:\n\nargmax\n\nb\n\nQ(b | ) := EX\u21e0D(\u21e4,)\u21e5p1\n\n,(X) log fb,(X) + p2\n\n,(X) log fb,(X)\u21e4 .\n\n(5)\n\nThe standard EM iteration, consisting of (3) and (5), corresponds to a minorization-maximization\nprocedure for \ufb01nding the MLE under the statistical setting (2). In particular, the function Q(\u00b7 |\n) above is a lower bound of the (marginal) log-likelihood function of (2), and the standard M-\nstep (5) \ufb01nds the maximizer of this lower bound. In general, this maximization can only be solved\napproximately. For example, the \u201cgradient EM\u201d algorithm considered in [4] performs one gradient\nascent step on the Q(\u00b7 | ) function.\nThe least-squares M-step (4) may also be viewed as an approximation to the standard M-step (5), as\nwe observe numerically (see Appendix H.3) that the LS-EM update + satis\ufb01es\n\nQ(+ | ) > Q( | )\n\nif  6= \u21e4.\n\n(6)\n\nThis implies that the least-squares M-step \ufb01nds an improved solution + (compared to the previous\niterate ) of the function Q(\u00b7 | ).\n4 Analysis of Least Squares EM\n\nIn this section, we analyze the convergence behavior of the LS-EM update (4) in the population\nsetting. We \ufb01rst consider the one dimensional case (d = 1) in Section 4.1 and establish the global\nconvergence of LS-EM, extending the techniques in [10] for 2GMM to log-concave mixtures. In\n\n4\n\n\fSection 4.2, we prove global convergence in the multi-dimensional case (d > 1). In this setting,\nthe LS-EM update is not contractive in `2, so the analysis requires the new ingredient of an angle\ndecreasing property.\n\nkX  k2;\nFor convenience, we introduce the shorthand F,(X) := g 1\nwhen  = 1, we simply write F \u2318 F,1. Since the integrand in (4) is an even function of X, the\nupdate (4) can be simpli\ufb01ed to an equivalent form by integrating over one component of the mixture:\n(7)\n\nkX + k2  g 1\n\n+ = M (\u21e4, ) = EX\u21e0f\u21e4, X tanh0.5F,(X).\n\nThroughout the section, we refer to the technical conditions permitting the interchange of differentia-\ntion and integration as the regularity condition. This condition is usually satis\ufb01ed by log-concave\ndistributions \u2014 a detailed discussion is provided in Appendix E.\n\nwhere the contraction factor\n\n4.1 One Dimensional Case (d = 1)\nFor one dimensional log-concave mixtures, the behavior of LS-EM is similar to that of EM algorithm\nfor 2GMM: there exist only 3 \ufb01xed points, 0, \u21e4 and \u21e4, among which 0 is non-attractive. Conse-\nquently, LS-EM converges to the true parameter (\u21e4 or \u21e4) from any non-zero initial solution 0.\nThis is established in the following theorem.\nTheorem 4.1 (Global Convergence, 1D). Suppose that f 2F satis\ufb01es the regularity condition. The\nLS-EM update (4),  7! M (\u21e4, ), has exactly three \ufb01xed points: 0, \u21e4 and \u21e4. Moreover, the\nfollowing one-step bound holds:\n\n|M (\u21e4, )  sign(\u21e4)\u21e4|\uf8ff \uf8ff(\u21e4,, ) \u00b7  sign(\u21e4)\u21e4,\n\uf8ff(\u21e4,, ) := EX\u21e0fmin(||,|\u21e4|),\u21e51  tanh0.5Fmin(||,|\u21e4|),(X)\u21e4\n\nsatis\ufb01es 0 <\uf8ff (\u21e4,, ) < 1 when  62 {0, \u21e4,\u21e4}.\nWe prove this theorem in Appendix C.1. The crucial property used in the proof is the self-consistency\nof the LS-EM update (4), namely M (,  ) =  for all . This property allows us to extend the\nsensitivity analysis technique for 2GMM to general log-concave distributions.\nIt can be further shown that the contraction factor \uf8ff(\u21e4,, ) becomes smaller as the iterate moves\ncloser to the true \u21e4 (see Lemma C.2). We thus obtain the following corollary on global convergence\nat a geometric rate. Without loss of generality, assume \u21e4 > 0.\nCorollary 4.2 (t-step Convergence Rate, 1D). Suppose that f 2F satis\ufb01es the regularity condition.\nLet t denote the output of LS-EM after t iterations, starting from 0 6= 0. The following holds:\n/log \uf8ff(\u21e4, 0, )\u2318 iterations\nIf 0 is in (0, 0.5\u21e4) or (1.5\u21e4,1), running LS-EM for O\u21e3log 0.5\u21e4\n\n|t  sign(0\u21e4)\u21e4|\uf8ff \uf8ff(\u21e4, 0, )t \u00b70  sign(0\u21e4)\u21e4.\n\noutputs a solution in (0.5\u21e4, 1.5\u21e4). In addition, if 0 is in (0.5\u21e4, 1.5\u21e4), running LS-EM for\nO (Cf (\u2318) log(1/\u270f)) iterations outputs an \u270f-close estimate of \u21e4, where Cf (\u2318) > 0 is a constant\ndepending only on f and the SNR \u2318 := \u21e4/.\n\n|0\u21e4|\n\nSpecial cases We provide explicit convergence rates for mixtures of some common log-concave\ndistributions. In the following, we assume \u21e4 > 0 and   0 without loss of generality, and set\nz := min(,  \u21e4).\n\u2022 Gaussian: \uf8ff(\u21e4,, ) \uf8ff exp  z2/22 and Cf (\u2318) = max1, 1\n\u23182.\nand Cf (\u2318) = max1, 1\n\u2318.\n\u2022 Laplace: \uf8ff(\u21e4,, ) \uf8ff 2 exp(\n1+exp(2\n\u2022 Logistic: \uf8ff(\u21e4,, ) \uf8ff\n\n4 exp( \u21e1z\np3\n1+exp( 2\u21e1z\np3\n\n)+2 exp( \u21e1z\np3\n\np2\n z)\np2\n z)\n\n)\n\n) and Cf (\u2318) = max1, 1\n\u2318.\n\nSee Appendix C.2 for the proofs of the above results. Note that the convergence rate depends on\nthe signal-to-noise ratio \u2318 as well as the asymptotic growth rate  \u2318 f of the log-density function\ng =  log f. In the above examples, \uf8ff(\u21e4,, ) \u21e1 exp (c(min(\u21e4, )/)), where  = 1 for\nLaplace and Logistic, and  = 2 for Gaussian.\n\n5\n\n\f4.2 High Dimensional Case (d > 1)\n\nExtension to higher dimensions is more challenging for log-concave mixtures than for Gaussian\nmixtures. Unlike Gaussian, a log-concave distribution with diagonal covariance need not have\nindependent coordinates. A more severe challenge arises due to the fact that LS-EM is not contractive\nin `2 distance for general log-concave mixtures. This phenomenon, proved in the lemma below,\nstands in sharp contrast to the Gaussian mixture setting.\nLemma 4.3 (Non-contraction in `2). Consider a log-concave density of the form g(x) / kxkr\n2 with\nr  1. When r 2 [1, 2], 0 is the only \ufb01xed point of LS-EM in the direction ortoghonal to \u21e4. When\nr 2 (2,1), there exists a \ufb01xed point other than 0 in the orthogonal direction. Consequently, when\nr > 2, there exists  such that kM (\u21e4, )  \u21e4k2 > k  \u21e4k2.\nWe prove Lemma 4.3 in Appendix D.3. The lemma shows that it is fundamentally impossible to\nprove global convergence of LS-EM solely based on `2 distance, which was the approach taken\nin [10] for Gaussian mixtures.\nDespite the above challenges, we show af\ufb01rmatively that LS-EM converges globally to \u00b1\u21e4 for\nmixtures of rotation-invariant log-concave distributions, as long as the initial iterate is not orthogonal\nto \u21e4 (a measure zero set).\nAs the \ufb01rst step, we use rotation invariance to show that the LS-EM iterates stay in a two-dimensional\nspace. The is done in the following lemma, which is proved in Appendix D.1.\nLemma 4.4 (LS-EM is 2-Dimensional). The LS-EM update satis\ufb01es: M (\u21e4, ) 2 span(, \u21e4).\nMoreover, if \\(, \u21e4) = 0 or \\(, \u21e4) = \u21e1/2, then M (\u21e4, ) 2 span().\nWe now establish the asymptotic global convergence property of LS-EM.\nTheorem 4.5 (Global Convergence, d-Dimensional). Suppose that f 2F satis\ufb01es the regularity\ncondition. The LS-EM algorithm converges to sign(h0, \u21e4i)\u21e4 from any randomly initialized point\n0 that is not orthogonal to \u21e4.\n\nThe theorem is proved using a novel sensitivity analysis that shows decrease in angle rather than in `2\ndistance. The proof does not depend on the explicit form of the density, but only log-concavity and\nrotation invariance. We sketch the main ideas of proof below, deferring the details to Appendix D.2.\n\nProof Sketch. Let 0 be the initial point that is not orthogonal to \u21e4. Without loss of generality, we\nassume h0, \u21e4i > 0. Consequently, all the future iterates satisfy ht, \u21e4i > 0 (see Lemma D.4).\nIf 0 is in the span of \u21e4 (i.e., 0 parallels \u21e4), Lemma D.3 ensures that the iterates remain in the\ndirection of \u21e4 and converge to \u21e4. On the other hand, if 0 is not in the span of \u21e4, we make use of\nthe following two key properties of the LS-EM update + = M (\u21e4, ):\n1. Angle Decreasing Property (Lemma D.2): Whenever \\, \u21e4 2 (0, \u21e1\ndecreases the iterate\u2019s angle toward \u21e4, i.e., \\(+, \u21e4) < \\(, \u21e4) ;\n2. Local Contraction Region (Corollary D.7): there is a local region around \u21e4 such that if any\niterate falls in that region, all the future iterates remain in that region.\n\n2 ), the LS-EM update strictly\n\nSince the sequence of LS-EM iterates is bounded, it must have accumulation points. Using the angle\ndecreasing property and the continuity of M (\u21e4, ) in the second variable , we show that all the\naccumulation points must be in the direction of \u21e4. In view of the dynamics of the 1-dimensional\ncase (Theorem 4.1), we can further show that the set of accumulation points must fall into one of\nthe following three possibilities: {0}, {\u21e4}, or {0, \u21e4}. Below we argue that {0} and {0, \u21e4} are\nimpossible by contradiction.\n\u2022 If {0} is the set of accumulation points, the sequence of non-zero iteratest would converge\nto 0 and stay in a neighborhood of 0 after some time T ; in this case, Lemma D.8 states that the\nnorm of the iterates is bounded away from zero in the limit and hence they cannot converge to 0.\n\u2022 If {0, \u21e4} is the set of accumulation points, then there is at least one iterate in the local region\nof \u21e4; by the local contraction region property above, all the future iterates remain close to \u21e4.\nTherefore, 0 cannot be another accumulation point.\n\nWe conclude that \u21e4 is the only accumlation point, which LS-EM must converge to.\n\n6\n\n\f5 Finite Sample Analysis\n\nIn this section, we consider the \ufb01nite sample setting in which we are given n data points X i sampled\ni.i.d. from D(\u21e4, ). Using the equivalent expression (7) for the population LS-EM update, and\nreplacing the expectation with the sample average, we obtain the \ufb01nite-sample LS-EM update:3\n\n+\n\n=\n\n1\nn\n\nnXi=1\n\nX i tanh(0.5F,(X i)),\n\nwhere X i i.i.d.\u21e0 f\u21e4,.\n\n(8)\n\ne\n\nOne approach to extend the population results (in Section 4) to this case is by coupling the pop-\n+. To this end, we make use of the fact that\nlog-concave distributions are automatically sub-exponential (see Lemma F.2), so the random vari-\ni=1 are i.i.d. sub-exponential for each coordinate j. Therefore, the\n\nj tanh(0.5F,(X i)) n\n\nulation update + with the \ufb01nite-sample update e\nablesX i\n +k2 = eOp(k\u21e4k2\nconcentration bound ke\nerror of eOpd/n.\n\n+\n\ngence properties of the population LS-EM carry over to the \ufb01nite-sample case, modulo a statistical\n\n2 + 2)d/n holds, and we expect that the conver-\n\n(9)\n\nThe above argument is made precise in following proposition for the one-dimensional case, which is\nproved in Appendix F.1.\nProposition 5.1 (1-d Finite Sample). Suppose the density function f 2F satis\ufb01es the regularity\ncondition. With  2 R being the current estimate, the \ufb01nite-sample LS-EM update (8) satis\ufb01es the\nfollowing bound with probability at least 1  :\n\n|e+  \u21e4|\uf8ff \uf8ff(\u21e4,, ) \u00b7 |  \u21e4| + (\u21e4 + Cf ) \u00b7 O r 1\n\nn\n\n1\n\n! ,\n\nlog\n\n(1+Cf /\u2318)2\n\nwhere \uf8ff(\u21e4,, ) is contraction factor de\ufb01ned in Theorem 4.1 and Cf is the Orlicz 1 norm (i.e.,\nthe sub-exponential parameter) of a random variable with density f 2F .\nUsing Proposition 5.1, we further deduce the global convergence of LS-EM in the \ufb01nite sample\ncase, which parallels the population result in Corollary 4.2. We present this result assuming sample\nsplitting, i.e., each iteration uses a fresh, independent set of samples. This assumption is standard\nin \ufb01nite-sample analysis of EM [4, 29, 10, 28, 17, 16]. In this setting, we establish the following\nquantitative convergence guarantee for LS-EM initialized at any non-zero 0. Without loss of\ngenerality, 0, \u21e4 > 0.\nThe convergence has two stages. In the \ufb01rst stage, the LS-EM iterates enter a local neighborhood\naround \u21e4, regardless of whether 0 is close to or far from 0. This is the content of the result below.\nProposition 5.2 (First Stage: Escape from 0 and 1). Suppose the initial point 0 is in (0, 0.5\u21e4) [\n(1.5\u21e4,1). After T = O\u21e3log 0.25\u21e4\n/log \uf8ff(\u21e4, min(0, 0.5\u21e4), )\u2318 iterations, with N/T =\n\u2318 fresh samples per iteration, LS-EM outputs a solution eT 2\n\u2326\u21e3\n(1\uf8ff(\u21e4,min(0,0.5\u21e4),))2 log 1\n/log \uf8ff(\u21e4, min(0, 0.5\u21e4), )\u2318.\n(0.5\u21e4, 1.5\u21e4) with probability at least 1   \u00b7 O\u21e3log 0.25\u21e4\n\nWithin this local neighborhood, the LS-EM iterates converge to \u21e4 geometrically, up to a statistical\nerror determined by the sample size. This second stage convergence result is given below.\nProposition 5.3 (Second Stage: Local Convergence). The following holds for any \u270f> 0. Sup-\npose 0 2 (0.5\u21e4, 1.5\u21e4). After T = O (log \u270f/log \uf8ff(\u21e4, 0.5\u21e4, )) iterations, with N/T =\n ) fresh samples per iteration, LS-EM outputs a solution eT satisfying\n\u270f2(1\uf8ff(\u21e4,0.5\u21e4,))2 log 1\n\u2326(\n|eT  \u21e4|\uf8ff \u270f\u21e4 with probability at least 1   \u00b7 O (log \u270f/log \uf8ff(\u21e4, 0.5\u21e4, )).\nWe prove Propositions 5.2 and 5.3 in Appendix F.2.\nLet us parse the above results in the special cases of Gaussian, Laplace and Logistic, assuming\nthat  = 1 for simplicity. In Section 4.1 we showed that \uf8ff(\u21e4,, ) = exp ( min(,  \u21e4)),\n3This expression is for analytic purpose only. To actually implement LS-EM, we use samples X i from the\n\n(\u21e4+Cf /\u2318)2\n\n|0\u21e4|\n\n|0\u21e4|\n\nmixture distribution D(\u21e4, ), which is equivalent to (8).\n\n7\n\n\fwhere  \u2318 f is the growth rate of the log density  log f. Consequently, the \ufb01rst stage requires\n\nO1/(min(0, \u21e4)) iterations withe\u23261/(min(0, \u21e4))2 samples per iteration, and the second\nstage requires O (log(1/\u270f)/\u2318) iterations withe\u23261/\u270f2\u23182 samples per iteration, where \u2318 := \u21e4/\n\nis the SNR. As can be seen, we have better iteration and sample complexities with a larger \u2318  1\n(larger separation between the components) and a larger  (lighter tail of the components).\nIn contrast, in the low SNR regime with \u2318< 1, the sample complexity actually becomes worse for a\nlarger  (lighter tails). Indeed, low SNR means that two components are close in location when  = 1.\nIf their tails are lighter, then it becomes more likely that the mixture density (f\u21e4, + f\u21e4,)/2 has\na unique mode at 0 instead of two modes at \u00b1\u21e4. In this case, the mixture problem becomes harder\nas it is more dif\ufb01cult to distinguish between the two components.\nIn the higher dimensional setting, we can similarly show coupling in `2 (i.e., bounding ke\n\n +k2)\nvia sub-exponential concentration. However, extending the convergence results above to d > 1 is\nmore subtle, due to the issue of `2 non-contraction (see Lemma 4.3). Addressing this issue would\nrequire coupling in a different metric (e.g., in angle\u2014see [17, 28]); we leave this to future work.\n\n+\n\n6 Robustness Under Model Mis-speci\ufb01cation\n\n+\n\n(10)\n\nIn practice, it is sometimes dif\ufb01cult to know a priori the exact parametric form of a log-concave\ndistribution that generates the data. This motivates us to consider the following scenario: the data\nis from the mixture D(\u21e4, ) in (2) with a true log-concave distribution f 2F and unknown\nlocation parameter \u21e4, but we run LS-EM assuming some other log-concave distribution bf (\u00b7) =\nC1\nexp(bg(k\u00b7k 2)) 2F . Using the same symmetry argument as in deriving (7), we obtain the\nbg\nfollowing expression for the mis-speci\ufb01ed LS-EM update in the population case:\n= cM (\u21e4, ) := EX\u21e0f\u21e4, X tanh0.5bF,(X),\nb\nwhere bF,(X) :=bg 1\nkX + k2 bg 1\n\\(b\nWe provide such a result focusing on the setting in which bf is Gaussian, that is, we \ufb01t a Gaussian\nmixture to a true mixture of log concave distributions. In this setting, we can show that mis-speci\ufb01ed\nLS-EM has only 3 \ufb01xed points {\u00b1, 0} (Lemma G.1). Moreover, we can bound the distance between\n and the true \u21e4, thereby establishing the following convergence result:\nProposition 6.1 (Fit with 2GMM). Under the above one dimensional setting with Gaussian bf, the\nfollowing holds for some absolute constant C0 > 0: If \u2318  C0, then the LS-EM algorithm with a\nnon-zero initialization point 0 converges to a solution  satisfying sign() = sign(0) and\n\nMany properties of the LS-EM update are preserved in this mis-speci\ufb01cation setting. In particular,\nusing the same approach as in Lemma4.4 and Lemma D.2, we can show that the mis-speci\ufb01ed\nLS-EM update is also a two dimensional object and satis\ufb01es the same strict angle decreasing property\n, \u21e4) < \\(, \u21e4). Therefore, to study the convergence behavior of mis-speci\ufb01ed LS-EM, it\n\nsuf\ufb01ces to understand the one-dimensional case (i.e., along the \u21e4 direction).\n\nkX  k2.\n\n+\n\n  sign(0\u21e4)\u21e4 \uf8ff 10.\n\nWe prove this proposition in Appendix G.1. The proposition establishes a robust property of LS-EM:\neven in the mis-speci\ufb01ed setting, LS-EM still converges globally. Moreover, when the SNR \u2318 is high\n(i.e., small noise level ), the \ufb01nal estimation error is small and scales linearly with .\n\n7 Discussion\n\nIn this paper, we have established the global convergence of the Least Squares EM algorithm for a\nmixture of two log-concave densities. Our theoretical results are proved under the following three\nassumptions on the densities: (i) rotation invariance, (ii) log-concavity, and (iii) monotone increasing\nof the log density g with respect to the norm; cf. Equation (1). As we discuss in greater details in\nAppendix H, all these assumptions appear to be essential under our current framework of analysis.\n\n8\n\n\fMoreover, empirical results suggest that the log-convexity assumption cannot be relaxed completely:\nFigure 1 provides an example where the LS-EM algorithm may converge to 0 (an undesired solution)\nwith constant probability when the log-concavity property is violated. See Appendix H for additional\nnumerical results.\n\nFigure 1: The base distribution class f is proportional to exp(|x|0.25). This is log-convex instead\nof log-concave. The ground truth location parameter is set to be 1. We initialize the LS-EM iterates\nat  = 0.1, 0.2, 0.3, 0.4, 0.5, and it is seen that all converge to 0 after some iterations.\n\nFor future work, an immediate direction is to establish quantitative global convergence guarantees in\nhigh dimensions for both the population and \ufb01nite sample cases; doing so would require generalizing\nthe angle convergence property in [17] to log-concave distributions. In light of the discussion above,\nit is also of interest to investigate to what extent the rotation invariance assumption can be relaxed, as\nmany interesting log-concave distributions are skewed. It is also interesting to consider mixtures with\nmultiple components.\n\nAcknowledgement\nW. Qian and Y. Chen are partially supported by NSF grants CCF-1657420 and CCF-1704828.\n\n9\n\n\f", "award": [], "sourceid": 2674, "authors": [{"given_name": "Wei", "family_name": "Qian", "institution": "Cornell Univeristy"}, {"given_name": "Yuqian", "family_name": "Zhang", "institution": "Cornell University"}, {"given_name": "Yudong", "family_name": "Chen", "institution": "Cornell University"}]}