{"title": "Near-Optimal-Sample Estimators for Spherical Gaussian Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 1395, "page_last": 1403, "abstract": "Many important distributions are high dimensional, and often they can be modeled as Gaussian mixtures. We derive the first sample-efficient polynomial-time estimator for high-dimensional spherical Gaussian mixtures. Based on intuitive spectral reasoning, it approximates mixtures of $k$ spherical Gaussians in $d$-dimensions to within$\\ell_1$ distance $\\epsilon$ using $\\mathcal{O}({dk^9(\\log^2 d)}/{\\epsilon^4})$ samples and $\\mathcal{O}_{k,\\epsilon}(d^3\\log^5 d)$ computation time. Conversely, we show that any estimator requires $\\Omega\\bigl({dk}/{\\epsilon^2}\\bigr)$ samples, hence the algorithm's sample complexity is nearly optimal in the dimension. The implied time-complexity factor \\mathcal{O}_{k,\\epsilon}$ is exponential in $k$, but much smaller than previously known. We also construct a simple estimator for one-dimensional Gaussian mixtures that uses $\\tilde\\mathcal{O}(k /\\epsilon^2)$ samples and $\\tilde\\mathcal{O}((k/\\epsilon)^{3k+1})$ computation time.", "full_text": "Near-Optimal-Sample Estimators for Spherical\n\nGaussian Mixtures\n\nJayadev Acharya\u2217\n\nMIT\n\njayadev@mit.edu\n\nAshkan Jafarpour, Alon Orlitsky, Ananda Theertha Suresh\n\n{ashkan, alon, asuresh}@ucsd.edu\n\nUC San Diego\n\nAbstract\n\nMany important distributions are high dimensional, and often they can be modeled\nas Gaussian mixtures. We derive the \ufb01rst sample-ef\ufb01cient polynomial-time esti-\nmator for high-dimensional spherical Gaussian mixtures. Based on intuitive spec-\ntral reasoning, it approximates mixtures of k spherical Gaussians in d-dimensions\n\nto within (cid:96)1 distance \u0001 using O(dk9(log2 d)~\u00014) samples and Ok,\u0001(d3 log5 d)\ncomputation time. Conversely, we show that any estimator requires \u2126\u0001dk~\u00012\u0001\nmension. The implied time-complexity factorOk,\u0001 is exponential in k, but much\nuses \u0303O(k~\u00012) samples and \u0303O((k~\u0001)3k+1) computation time.\n\nsmaller than previously known.\nWe also construct a simple estimator for one-dimensional Gaussian mixtures that\n\nsamples, hence the algorithm\u2019s sample complexity is nearly optimal in the di-\n\n1\n\nIntroduction\n\n1.1 Background\n\nMeaningful information often resides in high-dimensional spaces: voice signals are expressed in\nmany frequency bands, credit ratings are in\ufb02uenced by multiple parameters, and document topics\nare manifested in the prevalence of numerous words. Some applications, such as topic modeling\nand genomic analysis consider data in over 1000 dimensions [31, 14]. Typically, information can\nbe generated by different types of sources: voice is spoken by men or women, credit parameters\ncorrespond to wealthy or poor individuals, and documents address topics such as sports or politics.\nIn such cases the overall data follow a mixture distribution [26, 27]. Mixtures of high-dimensional\ndistributions are therefore central to the understanding and processing of many natural phenomena.\nMethods for recovering the mixture components from the data have consequently been extensively\nstudied by statisticians, engineers, and computer scientists.\nInitially, heuristic methods such as expectation-maximization were developed [25, 21]. Over the\npast decade, rigorous algorithms were derived to recover mixtures of d-dimensional spherical Gaus-\nsians [10, 18, 4, 8, 29] and general Gaussians [9, 2, 5, 19, 22, 3]. Many of these algorithms consider\n\nmixtures where the (cid:96)1 distance between the mixture components is 2\u2212 od(1), namely approaches\n\nthe maximum of 2 as d increases. They identify the distribution components in time and samples\nthat grow polynomially in d. Recently, [5, 19, 22] showed that the parameters of any k-component\nd-dimensional Gaussian mixture can be recovered in time and samples that grow as a high-degree\npolynomial in d and exponentially in k.\nA different approach that avoids the large component-distance requirement and the high time and\nsample complexity, considers a slightly relaxed notion of approximation, sometimes called PAC\nlearning [20], or proper learning, that does not approximate each mixture component, but instead\n\n\u2217Author was a student at UC San Diego at the time of this work\n\n1\n\n\fderives a mixture distribution that is close to the original one. Speci\ufb01cally, given a distance bound\n\n\u0001> 0, error probability \u03b4> 0, and samples from the underlying mixture f, where we use boldface\nsuch that D(f , \u02c6f)\u2264 \u0001 with probability\u2265 1\u2212 \u03b4, where D(\u22c5,\u22c5) is some given distance measure, for\n\nletters for d-dimensional objects, PAC learning seeks a mixture estimate \u02c6f with at most k components\n\nexample (cid:96)1 distance or KL divergence.\nAn important and extensively studied special case of Gaussian mixtures is mixture of spherical-\nGaussians [10, 18, 4, 8, 29], where for each component the d coordinates are distributed indepen-\ndently with the same variance, though possibly with different means. Note that different components\ncan have different variances. Due to their simple structure, spherical-Gaussian mixtures are easier to\nanalyze and under a minimum-separation assumption have provably-practical algorithms for clus-\ntering and parameter estimation. We consider spherical-Gaussian mixtures as they are important on\ntheir own and form a natural \ufb01rst step towards learning general Gaussian mixtures.\n\n1.2 Sample complexity\n\nReducing the number of samples required for learning is of great practical signi\ufb01cance. For example,\nin topic modeling every sample is a whole document, in credit analysis every sample is a person\u2019s\ncredit history, and in genetics, every sample is a human DNA. Hence samples can be very scarce\nand obtaining them can be very costly. By contrast, current CPUs run at several Giga Hertz, hence\nsamples are typically much more scarce of a resource than time.\nFor one-dimensional distributions, the need for sample-ef\ufb01cient algorithms has been broadly recog-\nnized. The sample complexity of many problems is known quite accurately, often to within a con-\n\nstant factor. For example, for discrete distributions over{1, . . . ,s}, an approach was proposed in [23]\nand its modi\ufb01cations were used in [28] to estimate the probability multiset using \u0398(s~ log s) sam-\nples. Learning one-dimensional m-modal distributions over{1, . . . ,s} requires \u0398(m log(s~m)~\u00013)\ntone hazard rate, and unimodal) over{1, . . . ,s} can be learned withO(k~\u00014),O(k log(s~\u0001)~\u00014), and\nO(k log(s)~\u00014) samples, respectively, and these bounds are tight up to a factor of \u0001 [6].\nexample, to learn a mixture of k= 2 spherical Gaussians, existing estimators useO(d12) samples,\n\nUnlike the 1-dimensional case, in high dimensions, sample complexity bounds are quite weak. For\n\nsamples [11]. Similarly, one-dimensional mixtures of k structured distributions (log-concave, mono-\n\nand this number increases exponentially with k [16]. We close this gap by constructing estimators\nwith near-linear sample complexity.\n\n1.3 Previous and new results\n\nOur main contribution is PAC learning d-dimensional spherical Gaussian mixtures with near-linear\nsamples. In the process of deriving these results we also prove results for learning one-dimensional\nGaussians and for \ufb01nding which distribution in a class is closest to the one generating samples.\nd-dimensional Gaussian mixtures\nSeveral papers considered PAC learning of discrete- and Gaussian-product mixtures. [17] considered\nmixtures of two d-dimensional Bernoulli products where all probabilities are bounded away from 0.\n\nThey showed that this class is PAC learnable in \u0303O(d2~\u00014) time and samples, where the \u0303O notation\nmixtures are PAC learnable in \u0303O\u0001(d~\u0001)2k2(k+1)\u0001 time. Although they did not explicitly mention\nsample complexity, their algorithm uses \u0303O\u0001(d~\u0001)4(k+1)\u0001 samples. [16] generalized these results\nmeans is bounded by B times the standard deviation, are PAC learnable in \u0303O\u0001(dB~\u0001)2k2(k+1)\u0001 time,\nand can be shown to use \u0303O\u0001(dB~\u0001)4(k+1)\u0001 samples. These algorithms consider the KL divergence\n\nhides logarithmic factors.\n[15] eliminated the probability constraints and generalized the results\nfrom binary to arbitrary discrete alphabets and from 2 to k mixture components, showing that these\n\nto Gaussian products and showed that mixtures of k Gaussians, where the difference between the\n\nbetween the distribution and its estimate, but it can be shown that the (cid:96)1 distance would result in\nsimilar complexities. It can also be shown that these algorithms or their simple modi\ufb01cations have\nsimilar time and sample complexities for spherical Gaussians as well.\nOur main contribution for this problem is to provide an algorithm that PAC learns mixtures of\nspherical-Gaussians in (cid:96)1 distance with number of samples nearly-linear, and running time polyno-\n\n2\n\n\fmial in the dimension d. Speci\ufb01cally, in Theorem 11 we show that mixtures of k spherical-Gaussian\ndistributions can be learned using\n\nn=O\u0004 dk9\nO\u0002n2d log n+ d\u0002 k7\n\n\u00014\n\nlog2 d\n\u03b4\n\n\u0004=Ok,\u0001\u0003d log2 d\n\u0003\n2\u0002= \u0303Ok,\u0001(d3).\n\u0002 k2\n\n\u03b4\n\nsamples and in time\n\n\u03b4\n\n\u00013\n\n\u00013 log2 d\n\nlog2 d\n\u03b4\n\nConversely, Theorem 2 shows that any algorithm for PAC learning a mixture of k spherical Gaus-\n\nIn that sense, for small k, the time complexity we derive is comparable to the best such algo-\nrithms one can hope for. Additionally, the exponential dependence on k in the time complexity\n\n)k2~2, signi\ufb01cantly lower than the dO(k3) dependence in previous results.\n\nIn addition, their time complexity signi\ufb01cantly improves on previously known ones.\nOne-dimensional Gaussian mixtures\nTo prove the above results we derive two simpler results that are interesting on their own. We\n\nRecall that for similar problems, previous algorithms used \u0303O\u0001(d~\u0001)4(k+1)\u0001 samples. Furthermore,\nrecent algorithms typically construct the covariance matrix [29, 16], hence require\u2265 nd2 time.\nis d( k7\nsians requires \u2126(dk~\u00012) samples, hence our algorithms are nearly sample optimal in the dimension.\nconstruct a simple estimator that learns mixtures of k one-dimensional Gaussians using \u0303O(k\u0001\u22122)\nsamples and in time \u0303O((k~\u0001)3k+1). We note that independently and concurrently with this work [12]\nshowed that mixtures of two one-dimensional Gaussians can be learnt with \u0303O(\u0001\u22122) samples and in\ntimeO(\u0001\u22125). Combining with some of the techniques in this paper, they extend their algorithm to\nmixtures of k Gaussians, and reduce the exponent to 3k\u2212 1.\nLet d(f ,F) be the smallest (cid:96)1 distance between a distribution f and any distribution in a collection\nF. The popular SCHEFFE estimator [13] takes a surprisingly smallO(logF) independent samples\nfrom an unknown distribution f and timeO(F2) to \ufb01nd a distribution inF whose distance from f\nis at most a constant factor larger than d(f ,F). In Lemma 1, we reduce the time complexity of the\nScheffe algorithm fromO(F2) to \u0303O(F), helping us reduce the running time of our algorithms.\n\nA detailed analysis of several such estimators are provided in [1] and here we outline a proof for one\nparticular estimator for completeness.\n\n1.4 The approach and technical contributions\n\nGiven the above, our goal is to construct a small class of distributions such that one of them is \u0001-close\nto the underlying distribution.\nConsider for example mixtures of k components in one dimension with means and variances\nbounded by B. Take the collection of all mixtures derived by quantizing the means and variances of\nall components to \u0001m accuracy, and quantizing the weights to \u0001w accuracy. It can be shown that if\n\n\u0001m, \u0001w\u2264 \u0001~k2 then one of these candidate mixtures would beO(\u0001)-close to any mixture, and hence\nto the underlying one. There are at most(B~\u0001m)2k\u22c5(1~\u0001w)k=(B~\u0001)\u0303O(k) candidates and running\n\nSCHEFFE on these mixtures would lead to an estimate. However, this approach requires a bound on\nthe means and variances. We remove this requirement on the bound, by selecting the quantizations\nbased on samples and we describe it in Section 3.\nIn d dimensions, consider spherical Gaussians with the same variance and means bounded by B.\nAgain, take the collection of all distributions derived by quantizing the means of all components\nin all coordinates to \u0001m accuracy, and quantizing the weights to \u0001w accuracy. It can be shown that\nfor d-dimensional Gaussian to get distance \u0001 from the underlying distribution, it suf\ufb01ces to take\n\n\u0001m, \u0001w\u2264 \u00012~poly(dk). There are at most(B~\u0001m)dk\u22c5(1~\u0001w)k= 2\n\u0303O\u0001(dk) possible combinations of\ncomplexity \u0303O(dk). To reduce the dependence on d, one can approximate the span of the k mean\ncollection of size 2O(k2), with SCHEFFE sample complexity of justO(k2). [15, 16] constructs the\n\nvectors. This reduces the problem from d to k dimensions, allowing us to consider a distribution\n\nthe k mean vectors and weights. Hence SCHEFFE implies an exponential-time algorithm with sample\n\nsample correlation matrix and uses k of its columns to approximate the span of mean vectors. This\n\n3\n\n\fapproach requires the k columns of the sample correlation matrix to be very close to the actual\ncorrelation matrix, requiring a lot more samples.\nWe derive a spectral algorithm that approximates the span of the k mean vectors using the top k\neigenvectors of the sample covariance matrix. Since we use the entire covariance matrix instead of\njust k columns, a weaker concentration suf\ufb01ces and the sample complexity can be reduced.\nUsing recent tools from non-asymptotic random matrix theory [30], we show that the span of the\n\nmeans can be approximated with \u0303O(d) samples. This result allows us to address most \u201creasonable\u201d\n\ndistributions, but still there are some \u201ccorner cases\u201d that need to be analyzed separately. To address\nthem, we modify some known clustering algorithms such as single-linkage, and spectral projections.\nWhile the basic algorithms were known before, our contribution here, which takes a fair bit of effort\nand space, is to show that judicious modi\ufb01cations of the algorithms and rigorous statistical analysis\nyield polynomial time algorithms with near-linear sample complexity. We provide a simple and\n\npractical spectral algorithm that estimates all such mixtures inOk,\u0001(d log2 d) samples.\n\nThe paper is organized as follows. In Section 2, we introduce notations, describe results on the\nScheffe estimator, and state a lower bound. In Sections 3 and 4, we present the algorithms for one-\ndimensional and d-dimensional Gaussian mixtures respectively. Due to space constraints, most of\nthe technical details and proofs are given in the appendix.\n\n2 Preliminaries\n\n2.1 Notation\n\nthe (cid:96)1 distance between two distributions.\n\n2.2 Selection from a pool of distributions\n\ntures and then perform Maximum Likelihood test using the samples to output a distribution [15, 17].\nOur algorithm also obtains a set of distributions containing at least one that is close to the underlying\n\nFor arbitrary product distributions p1, . . . , pk over a d dimensional space let pj,i be the distribution\nof pj over coordinate i, and let \u00b5j,i and \u03c3j,i be the mean and variance of pj,i respectively. Let\n\nf=(w1, . . . , wk, p1, . . . , pk) be the mixture of these distributions with mixing weights w1, . . . , wk.\nWe denote estimates of a quantity x by \u02c6x. It can be empirical mean or a more complex estimate.\u22c5\ndenotes the spectral norm of a matrix and\u22c52 is the (cid:96)2 norm of a vector. We use D(\u22c5,\u22c5) to denote\nMany algorithms for learning mixtures over the domainX \ufb01rst obtain a small collectionF of mix-\nin (cid:96)1 distance. The estimation problem now reduces to the following. Given a classF of distribu-\ntions and samples from an unknown distribution f, \ufb01nd a distribution inF that is close to f. Let\nD(f ,F) def= minfi\u2208F D(f , fi).\nThe well-known Scheffe\u2019s method [13] usesO(\u0001\u22122 logF) samples from the underlying distribution\nf, and in timeO(\u0001\u22122F2T logF) outputs a distribution inF with (cid:96)1 distance of at most 9.1\u22c5\nmax(D(f ,F), \u0001) from f, where T is the time required to compute the probability of an x\u2208X by\na distribution inF. A naive application of this algorithm requires time quadratic in the number of\ndistributions inF. We propose a variant of this, that works in near linear time. More precisely,\n\u0001 independent samples\nLemma 1 (Appendix B). Let \u0001> 0. For some constant c, given c\nfrom a distribution f, with probability\u2265 1\u2212\u03b4, the output \u02c6f of MODIFIED SCHEFFE satis\ufb01es D(\u02c6f , f)\u2264\n1000\u22c5 max(D(f ,F), \u0001). Furthermore, the algorithm runs in timeO\u0001FT log(F~\u03b4)\ncomponent mixtures in d-dimensions, T=O(dk) andF= \u0303Ok,\u0001(d2).\nUsing Fano\u2019s inequality, we show an information theoretic lower bound of \u2126(dk~\u00012) samples to\n\nSeveral such estimators have been proposed in the past [11, 12]. A detailed analysis of the estimator\npresented here was studied in [1]. We outline a proof in Appendix B for completeness. Note that\nthe constant 1000 in the above lemma has not been optimized. For our problem of estimating k\n\n2.3 Lower bound\n\n\u00012 log\u0001F\n\n\u03b4\n\n\u00012\n\n\u0001.\n\nlearn k-component d-dimensional spherical Gaussian mixtures for any algorithm. More precisely,\n\n4\n\n\fGaussian mixtures to (cid:96)1 distance \u0001 with probability\u2265 1~2 requires \u2126(dk~\u00012) samples.\n\nTheorem 2 (Appendix C). Any algorithm that learns all k-component d-dimensional spherical\n\n3 Mixtures in one dimension\n\ndef= N(\u00b5i, \u03c32\n\n\u0001\n\n2n\n\n.\n\n2n\n\nThe above lemma states that given samples from a Gaussian distribution, there would be a sample\nclose to the mean and there would be two samples that are about a standard deviation apart. Hence,\n\nOver the past decade estimation of one dimensional distributions has gained signi\ufb01cant atten-\ntion [24, 28, 11, 6, 12, 7]. We provide a simple estimator for learning one dimensional Gaussian\nmixtures using the MODIFIED SCHEFFE estimator. Formally, given samples from f, a mixture of\nGaussian distributions pi\n\nmeans or the variances of the components. While we do not use the one dimensional algorithm in\nthe d-dimensional setting, it provides insight to the usage of the MODIFIED SCHEFFE estimator and\nmay be of independent interest. As stated in Section 1.4, our quantizations are based on samples and\nis an immediate consequence of the following observation for samples from a Gaussian distribution.\n\ni) with weights w1, w2, . . . wk, our goal is to \ufb01nd a mixture\n\u02c6f =( \u02c6w1, \u02c6w2, . . . \u02c6wk, \u02c6p1, \u02c6p2, . . . \u02c6pk) such that D(f, \u02c6f)\u2264 \u0001. We make no assumption on the weights,\nLemma 3 (Appendix D.1). Given n independent samples x1, . . . , xn from N(\u00b5, \u03c32), with probabil-\nity\u2265 1\u2212 \u03b4 there are two samples xj, xk such thatxj\u2212 \u00b5\u2264 \u03c3 7 log 2~\u03b4\nandxj\u2212 xk\u2212 \u03c3\u2264 2\u03c3 7 log 2~\u03b4\nif we consider the set of all Gaussians N(xj,(xj\u2212 xk)2)\u2236 1\u2264 j, k\u2264 n, then that set would contain\nLemma 4 (Appendix D.2). Given n\u2265 120k log(4k~\u03b4)\nS={N(xj,(xj\u2212 xk)2)\u2236 1\u2264 j, k\u2264 n} and W={0, \u0001\nF def= {( \u02c6w1, \u02c6w2, . . . , \u02c6wk, \u02c6p1, \u02c6p2, . . . \u02c6pk)\u2236 \u02c6pi\u2208 S,\u22001\u2264 i\u2264 k\u22121, \u02c6wi\u2208 W, \u02c6wk= 1\u2212( \u02c6w1+. . . \u02c6wk\u22121)\u2265 0}\nbe a set of n2k(2k~\u0001)k\u22121\u2264 n3k\u22121 candidate distributions. There exists \u02c6f\u2208F such that D(f, \u02c6f)\u2264 \u0001.\nRunning the MODIFIED SCHEFFE algorithm on the above set of candidatesF yields a mixture that\nCorollary 5 (Appendix D.3). Let n\u2265 c\u22c5 k\n\u0003 , and returns a mixture \u02c6f such that D(f, \u02c6f)\u2264 1000\u0001\nruns in timeO\u0003\u0002 k log(k~\u0001\u03b4)\nk2 log(k~\u0001\u03b4)\nwith probability\u2265 1\u2212 2\u03b4.\n\na Gaussian close to the underlying one. The same holds for mixtures and for a Gaussian mixture\nand we can create the set of candidate mixtures as follows.\n\n2k . . . , 1} be a set of weights. Let\n\nsamples from a mixture f of k Gaussians. Let\n2k , 2\u0001\n\nis close to the underlying one. By Lemma 1 and the above lemma we obtain\n\n\u00012 log k\n\n\u0001\u03b4 for some constant c. There is an algorithm that\n\n\u00023k\u22121\n\n\u0001\n\n\u00012\n\n[12] considered the one dimensional Gaussian mixture problem for two component mixtures. While\nthe process of identifying the candidate means is same for both the papers, the process of identifying\nthe variances and proof techniques are different.\n\n4 Mixtures in d dimensions\n\nAlgorithm LEARN k-SPHERE learns mixtures of k spherical Gaussians using near-linear samples.\nFor clarity and simplicity of proofs, we \ufb01rst prove the result when all components have the same\n\nvariance \u03c32, i.e., pi= N(\u00b5i, \u03c32Id) for 1\u2264 i\u2264 k. A modi\ufb01cation of this algorithm works for com-\n\nponents with different variances. The core ideas are same and we discuss the changes in Section 4.3.\nThe algorithm starts out by estimating \u03c32 and we discuss this step later. We estimate the means in\nthree steps, a coarse single-linkage clustering, recursive spectral clustering and search over span of\nmeans. We now discuss the necessity of these steps.\n\n4.1 Estimating the span of means\n\nA simple modi\ufb01cation of the one dimensional algorithm can be used to learn mixtures in d di-\nmensions, however, the number of candidate mixtures would be exponential in d, the number of\ndimensions. As stated in Section 1.4, given the span of the mean vectors \u00b5i, we can grid the k\ndimensional span to the required accuracy \u0001g and use MODIFIED SCHEFFE, to obtain a polynomial\n\n5\n\n\ftime algorithm. One of the natural and well-used methods to estimate the span of mean vectors is\nusing the correlation matrix [29]. Consider the correlation-type matrix,\n\nX(i)X(i)t\u2212 \u03c32Id.\n\nn\n\nS= 1\nnQ\ni=1\nE[S]= kQ\nj=1\nj=1 wj\u00b5j\u00b5j\n\nt.\n\nwj\u00b5j\u00b5j\n\nt, and the expected fraction\n\nof samples from pj is wj. Hence\n\nWhile the above intuition is well understood, the number of samples necessary for convergence\n\nFor a sample X from a particular component j, E[XXt]= \u03c32Id+ \u00b5j\u00b5j\nTherefore, as n\u2192\u221e, S converges to\u2211k\nis not well studied. We wish \u0303O(d) samples to be suf\ufb01cient for the convergence irrespective of the\nExample 6. Consider the special case, d= 1, k= 2, \u03c32= 1, w1= w2= 1~2, and mean differences\n\u00b51\u2212 \u00b52= L\u00e2 1. Given this prior information, one can estimate the average of the mixture, that\nyields(\u00b51+ \u00b52)~2. Solving equations obtained by \u00b51+ \u00b52 and \u00b51\u2212 \u00b52= L yields \u00b51 and \u00b52. The\nvariance of the mixture is 1+ L2~4> L2~4. With additional Chernoff type bounds, one can show\nHence, estimating the means to high precision requires n\u2265 L2, i.e., the higher separation, the more\n\nvalues of the means. However this is not true when the means are far apart. In the following example\nwe demonstrate that the convergence of averages can depend on their separation.\n\n\u00b51+ \u00b52\u2212 \u02c6\u00b51\u2212 \u02c6\u00b52\u2248 \u0398\u0001L~\u221a\nn\u0001 .\n\nthat given n samples the error in estimating the average is\n\nt, and its top k eigenvectors span the means.\n\nsamples are necessary if we use the sample mean.\n\n\u221a\n\nalgorithm divides the cluster into two nonempty clusters without any mis-clustering. For technical\nreasons similar to the above example, we \ufb01rst use a coarse clustering algorithm that ensures that the\n\nA similar phenomenon happens in the convergence of the correlation matrices, where the variances\nof quantities of interest increase with separation. In other words, for the span to be accurate the\nnumber of samples necessary increases with the separation. To overcome this, a natural idea is to\ncluster the Gaussians such that the component means in the same cluster are close and then estimate\nthe span of means, and apply SCHEFFE on the span within each cluster.\nFor clustering, we use another spectral algorithm. Even though spectral clustering algorithms are\nstudied in [29, 2], they assume that the weights are strictly bounded away from 0, which does\nnot hold here. We use a simple recursive clustering algorithm that takes a cluster C with average\n\n\u00b5(C). If there is a component in the cluster such that\nwi\u00b5i\u2212 \u00b5(C)2 is \u2126(log(n~\u03b4)\u03c3), then the\nmean separation of any two components within each cluster is \u0303O(d1~4\u03c3).\nOur algorithm thus comprises of(i) variance estimation(ii) a coarse clustering ensuring that means\nare within \u0303O(d1~4\u03c3) of each other in each cluster(iii) a recursive spectral clustering that reduces\nthe mean separation toO(\u0001\nk3 log(n~\u03b4)\u03c3)(iv) estimating the span of mean within each cluster,\nand(v) quantizing the means and running MODIFIED SCHFEE on the resulting candidate mixtures.\nTo simplify the bounds and expressions, we assume that d> 1000 and \u03b4 \u2265 min(2n2e\u2212d~10, 1~3).\nFor smaller values of \u03b4, we run the algorithm with error 1~3 and repeat itO(log 1\n) times to choose\na set of candidate mixturesF\u03b4. By the Chernoff-bound with error\u2264 \u03b4,F\u03b4 contains a mixture \u0001-close\nto f. Finally, we run MODIFIED SCHEFFE onF\u03b4 to obtain a mixture that is close to f. By the union\nbound and Lemma 1, the error of the new algorithm is\u2264 2\u03b4.\nVariance estimation: Let \u02c6\u03c3 be the variance estimate from step 1. If X(1) and X(2) are two samples\nfrom the components i and j respectively, then X(1)\u2212X(2) is distributed N(\u00b5i\u2212\u00b5j, 2\u03c32Id). Hence\nfor large d,X(1)\u2212 X(2)2\ngiven k+ 1 samples, two of them are from the same component. Therefore, the minimum pairwise\n\n2 concentrates around 2d\u03c32+\u0001\u0001\u00b5i\u2212 \u00b5j\u0001\u00012\n\nWe now describe the steps stating the performance of each step of Algorithm LEARN k-SPHERE.\n\n. By the pigeon-hole principle,\n\n4.2 Sketch of correctness\n\n2\n\n\u03b4\n\n6\n\n\fdistance between k+ 1 samples is close to 2d\u03c32. This is made precise in the next lemma which\nLemma 7 (Appendix E.1). Given n samples from the k-component mixture, with probability 1\u2212 2\u03b4,\n\u02c6\u03c32\u2212 \u03c32\u2264 2.5\u03c32\n\nstates that \u02c6\u03c32 is a good estimate of the variance.\n\nlog(n2~\u03b4)~d.\n\n\u0001\n\n\u03b4\n\n\u0001\n\nAlgorithm LEARN k-SPHERE\n\n\u03b4\n\n2. Coarse single-linkage clustering: Start with each sample as a cluster,\n\nfrom each component will be in the same cluster and the maximum distance between two components\n\n)1~4\u03c3). More precisely,\n\u00011~4.\n\nCoarse single-linkage clustering: The second step is a single-linkage routine that clusters mixture\ncomponents with far means. Single-linkage is a simple clustering scheme that starts out with each\ndata point as a cluster, and at each step merges the two nearest clusters to form a larger cluster. The\nalgorithm stops when the distance between clusters is larger than a pre-speci\ufb01ed threshold.\nSuppose the samples are generated by a one-dimensional mixture of k components that are far,\nthen with high probability, when the algorithm generates k clusters all the samples within a cluster\n\nall the n samples concentrate around their respective means and the separation between any two\nsamples from different components would be larger than the largest separation between any two\nsamples from the same component. Hence for a suitable value of threshold, single-linkage correctly\nidenti\ufb01es the clusters. For d-dimensional Gaussian mixtures a similar property holds, with minimum\n\nare generated by a single component. More precisely, if\u2200i, j \u2208[k],\u00b5i\u2212 \u00b5j= \u2126(\u03c3 log n), then\nseparation \u2126((d log n\nLemma 8 (Appendix E.2). After Step 2 of LEARN k-SPHERE, with probability\u2265 1\u22122\u03b4, all samples\nwithin each cluster is\u2264 10k\u03c3\u0001d log n2\nInput: n samples x(1), x(2), . . . , x(n) from f and \u0001.\n1. Sample variance: \u02c6\u03c32= mina\u2260b\u2236a,b\u2208[k+1]x(a)\u2212 x(b)2\n2~2d.\nd log(n2~\u03b4), merge them.\n\u2022 While\u2203 two clusters with squared-distance\u2264 2d\u02c6\u03c32+ 23\u02c6\u03c32\n3. Recursive spectral-clustering: While there is a cluster C withC \u2265 n\u0001~5k and spectral\nnorm of its sample covariance matrix\u2265 12k2 \u02c6\u03c32 log n3~\u03b4,\n\u2022 Use n\u0001~8k2 of the samples to \ufb01nd the largest eigenvector and discard these samples.\nis> 3\u02c6\u03c3\nlog(n2k~\u03b4) creating new clusters.\n4. Exhaustive search: Let \u0001g = \u0001~(16k3~2), L = 200\nG = {\u2212L, . . . ,\u2212\u0001g, 0, \u0001g, 2\u0001g, . . . L}. Let W = {0, \u0001~(4k), 2\u0001~(4k), . . . 1} and \u03a3 def= {\u03c32 \u2236\n\u221a\n128dk2),\u2200\u2212 L\u2032< i\u2264 L\u2032}.\n\u03c32= \u02c6\u03c32(1+ i\u0001~d\n\u2022 For each cluster C \ufb01nd its top k\u2212 1 eigenvectors u1, . . . uk\u22121. Let Span(C)={ \u02c6\u00b5(C)+\ni=1 gi \u02c6\u03c3ui\u2236 gi\u2208 G}.\n\u2211k\u22121\n\u2022 Let Span=\u222aC\u2236C\u2265 n\u0001\nSpan(C).\ni\u2208 W , \u03c3\u20322\u2208 \u03a3, \u02c6\u00b5i\u2208 Span,\n\u2022 For all w\u2032\nk\u22121, 1\u2212\u2211k\u22121\nadd{(w\u2032\n1, . . . , w\u2032\ni=1 w\u2032\n5. Run MODIFIED SCHEFFE onF and output the resulting distribution.\nponents with mean separationO(\u03c3d1~4 log n\nS(C) def= 1C\u0002Q\nx\u2208C\n\ni, N( \u02c6\u00b51, \u03c3\u20322), . . . , N( \u02c6\u00b5k, \u03c3\u20322)} toF.\n). We now recursively zoom into the clusters formed\n(x\u2212 \u02c6\u00b5(C))(x\u2212 \u02c6\u00b5(C))t\u0002\u2212 \u02c6\u03c32Id,\n\nand show that it is possible to cluster the components with much smaller mean separation. Note that\nsince the matrix is symmetric, the largest magnitude of the eigenvalue is the same as the spectral\nnorm. We \ufb01rst \ufb01nd the largest eigenvector of\n\n\u2022 Project the remaining samples on the largest eigenvector.\n\u2022 Perform single-linkage in the projected space (as before) till the distance between clusters\n\nRecursive spectral-clustering: The clusters formed at the beginning of this step consist of com-\n\nk4\u0001\u22121 log n2\n\n\u03b4 , L\u2032 = 32k\n\n\u0001\n\nlog n2~\u03b4\n\n\u0001\n\n, and\n\n\u0001\n\n\u0002\n\n5k\n\n\u03b4\n\n7\n\n\f\u0001\n\n\u0001\n\nC,\nremain in a single cluster.\n\n\u03b4 . After recursive clustering, with probability\n\nwhich is the sample covariance matrix with its diagonal term reduced by \u02c6\u03c32. We then project our\nsamples to this vector and if there are two components with means far apart, then using single-\nlinkage we divide the cluster into two. The following lemma shows that this step performs accurate\nclustering of components with well separated means.\nlog n3\n\nLemma 9 (Appendix E.3). Let n \u2265 c\u22c5 dk4\n\u2265 1\u2212 4\u03b4, the samples are divided into clusters such that for each component i within a cluster\n\u221a\nwi\u00b5i\u2212 \u00b5(C)2 \u2264 25\u03c3\nk3 log(n3~\u03b4) . Furthermore, all the samples from one component\n\u0002\n\u221a\nwi\u00b5i\u2212 \u00b5(C)2\u2264 25\u03c3\nmate of the span of \u00b5i\u2212 \u00b5(C) within each cluster. More precisely,\nLemma 10 (Appendix E.4). Let n\u2265 c\u22c5 dk9\n\u2265 1\u2212 7\u03b4, ifC\u2265 n\u0001~5k, then the projection of[\u00b5i\u2212 \u00b5(C)]~\u00b5i\u2212 \u00b5(C)2 on the space orthogonal\nto the span of top k\u2212 1 eigenvectors has magnitude\u2264\n\nExhaustive search and Scheffe: After step 3, all clusters have a small weighted radius\n\u03b4 . It can be shown that the eigenvectors give an accurate esti-\n\n\u03b4 for some constant c. After step 3, with probability\n\n\u00014 log2 d\n\nk3 log n3\n\n\u221a\n\n\u221a\n\nwi\u00b5i\u2212\u00b5(C) 2\n\n\u0001\u03c3\n\n.\n\n8\n\n2k\n\nWe now have accurate estimates of the spans of the cluster means and each cluster has components\nwith close means. It is now possible to grid the set of possibilities in each cluster to obtain a set of\ndistributions such that one of them is close to the underlying. There is a trade-off between a dense\ngrid to obtain a good estimation and the computation time required. The \ufb01nal step takes the sparsest\n\ngrid possible to ensure an error\u2264 \u0001. This is quantized below.\nTheorem 11 (Appendix E.5). Let n\u2265 c\u22c5 dk9\nSPHERE, with probability\u2265 1\u2212 9\u03b4, outputs a distribution \u02c6f such that D(\u02c6f , f)\u2264 1000\u0001. Furthermore,\n\u0002 k2\nthe algorithm runs in timeO\u0002n2d log n+ d\u0002 k7\n2\u0002.\n\n\u03b4 for some constant c. Then Algorithm LEARN k-\n\n\u00014 log2 d\n\n\u00013 log2 d\n\n\u03b4\n\nNote that the run time is calculated based on an ef\ufb01cient implementation of single-linkage clustering\nand the exponential term is not optimized.\n\ni\n\n4.3 Mixtures with unequal variances\n\ngorithm for learning mixtures with unequal variances are:\n\n1. In LEARN k-SPHERE, we \ufb01rst estimated the component variance \u03c3 and divided the samples\n\nWe generalize the results to mixtures with components having different variances. Let pi =\nN(\u00b5i, \u03c32\nId) be the ith component. The key differences between LEARN k-SPHERE and the al-\ninto clusters such that within each cluster the means are separated by \u0303O(d1~4\u03c3). We modify\nonly have mean separationO(d1~4\u03c3), but variances are also a factor at most 1+\u0303O\u00011~\u221a\nd\u0001 apart.\n2. Once the variances in each cluster are within a multiplicative factor of 1+ \u0303O\u00011~\u221a\nd\u0001 of each\nlows, though instead of having a single \u03c3\u2032 for all clusters, we can have a different \u03c3\u2032 for each\n\nother, it can be shown that the performance of the recursive spectral clustering step does not\nchange more than constants.\n\n3. After obtaining clusters with similar means and variances, the exhaustive search algorithm fol-\n\nthis step such that the samples are clustered such that within each cluster the components not\n\ncluster, which is estimated using the average pair wise distance between samples in the cluster.\n\nThe changes in the recursive clustering step and the exhaustive search step are easy to see and we\nomit them. The coarse clustering step requires additional tools and we describe them in Appendix F.\n\n5 Acknowledgements\n\nWe thank Sanjoy Dasgupta, Todd Kemp, and Krishnamurthy Vishwanathan for helpful discussions.\n\n8\n\n\fReferences\n[1] J. Acharya, A. Jafarpour, A. Orlitksy, and A. T. Suresh. Sorting with adversarial comparators and appli-\n\ncation to density estimation. In ISIT, 2014.\n\n[2] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In COLT, 2005.\n[3] J. Anderson, M. Belkin, N. Goyal, L. Rademacher, and J. R. Voss. The more, the merrier: the blessing of\n\ndimensionality for learning large gaussian mixtures. In COLT, 2014.\n\n[4] M. Azizyan, A. Singh, and L. A. Wasserman. Minimax theory for high-dimensional gaussian mixtures\n\nwith sparse mean separation. In NIPS, 2013.\n\n[5] M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS, 2010.\n[6] S. O. Chan, I. Diakonikolas, R. A. Servedio, and X. Sun. Learning mixtures of structured distributions\n\nover discrete domains. In SODA, 2013.\n\n[7] S. O. Chan, I. Diakonikolas, R. A. Servedio, and X. Sun. Ef\ufb01cient density estimation via piecewise\n\npolynomial approximation. In STOC, 2014.\n\n[8] K. Chaudhuri, S. Dasgupta, and A. Vattani. Learning mixtures of gaussians using the k-means algorithm.\n\nCoRR, abs/0912.0086, 2009.\n\n[9] S. Dasgupta. Learning mixtures of gaussians. In FOCS, 1999.\n[10] S. Dasgupta and L. J. Schulman. A two-round variant of EM for gaussian mixtures. In UAI, 2000.\n[11] C. Daskalakis, I. Diakonikolas, and R. A. Servedio. Learning k-modal distributions via testing. In SODA,\n\n2012.\n\n[12] C. Daskalakis and G. Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of\n\ngaussians. In COLT, 2014.\n\n[13] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer, 2001.\n[14] I. S. Dhillon, Y. Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local\n\nsearch. In ICDM, 2002.\n\n[15] J. Feldman, R. O\u2019Donnell, and R. A. Servedio. Learning mixtures of product distributions over discrete\n\ndomains. In FOCS, 2005.\n\n[16] J. Feldman, R. A. Servedio, and R. O\u2019Donnell. PAC learning axis-aligned mixtures of gaussians with no\n\nseparation assumption. In COLT, 2006.\n\n[17] Y. Freund and Y. Mansour. Estimating a mixture of two product distributions. In COLT, 1999.\n[18] D. Hsu and S. M. Kakade. Learning mixtures of spherical gaussians: moment methods and spectral\n\ndecompositions. In ITCS, 2013.\n\n[19] A. T. Kalai, A. Moitra, and G. Valiant. Ef\ufb01ciently learning mixtures of two gaussians. In STOC, 2010.\n[20] M. J. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. E. Schapire, and L. Sellie. On the learnability of\n\ndiscrete distributions. In STOC, 1994.\n\n[21] J. Ma, L. Xu, and M. I. Jordan. Asymptotic convergence rate of the em algorithm for gaussian mixtures.\n\nNeural Computation, 12(12), 2001.\n\n[22] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In FOCS, 2010.\n[23] A. Orlitsky, N. P. Santhanam, K. Viswanathan, and J. Zhang. On modeling pro\ufb01les instead of values. In\n\nUAI, 2004.\n\n[24] L. Paninski. Variational minimax estimation of discrete distributions under kl loss. In NIPS, 2004.\n[25] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM\n\nReview, 26(2), 1984.\n\n[26] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identi\ufb01cation using gaussian mixture\n\nspeaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72\u201383, 1995.\n\n[27] D. M. Titterington, A. F. Smith, and U. E. Makov. Statistical analysis of \ufb01nite mixture distributions. Wiley\n\nNew York, 1985.\n\n[28] G. Valiant and P. Valiant. Estimating the unseen: an n/log(n)-sample estimator for entropy and support\n\nsize, shown optimal via new clts. In STOC, 2011.\n\n[29] S. Vempala and G. Wang. A spectral algorithm for learning mixtures of distributions. In FOCS, 2002.\n[30] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. CoRR, abs/1011.3027,\n\n2010.\n\n[31] E. P. Xing, M. I. Jordan, and R. M. Karp. Feature selection for high-dimensional genomic microarray\n\ndata. In ICML, 2001.\n\n[32] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam. Springer New York, 1997.\n\n9\n\n\f", "award": [], "sourceid": 775, "authors": [{"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "University of California, San Diego"}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": "University of California, San Diego"}, {"given_name": "Jayadev", "family_name": "Acharya", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Ashkan", "family_name": "Jafarpour", "institution": "University of California, San Diego"}]}