{"title": "Using multiple samples to learn mixture models", "book": "Advances in Neural Information Processing Systems", "page_first": 324, "page_last": 332, "abstract": "In the mixture models problem it is assumed that there are $K$ distributions $\\theta_{1},\\ldots,\\theta_{K}$ and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same $K$ underlying distributions, but with different mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data.", "full_text": "Using multiple samples to learn mixture\n\nmodels\n\nStanford University\n\nJason Lee\u2217\nStanford, USA\n\njdl17@stanford.edu\n\nRan Gilad-Bachrach\n\nMicrosoft Research\n\nRedmond, USA\n\nrang@microsoft.com\n\nRich Caruana\nMicrosoft Research\n\nRedmond, USA\n\nrcaruana@microsoft.com\n\nAbstract\n\nIn the mixture models problem it is assumed that there are K distributions\n\u03b81, . . . , \u03b8K and one gets to observe a sample from a mixture of these distri-\nbutions with unknown coe\ufb03cients. The goal is to associate instances with\ntheir generating distributions, or to identify the parameters of the hidden\ndistributions. In this work we make the assumption that we have access to\nseveral samples drawn from the same K underlying distributions, but with\ndi\ufb00erent mixing weights. As with topic modeling, having multiple samples\nis often a reasonable assumption. Instead of pooling the data into one sam-\nple, we prove that it is possible to use the di\ufb00erences between the samples\nto better recover the underlying structure. We present algorithms that re-\ncover the underlying structure under milder assumptions than the current\nstate of art when either the dimensionality or the separation is high. The\nmethods, when applied to topic modeling, allow generalization to words not\npresent in the training data.\n\n1 Introduction\n\nThe mixture model has been studied extensively from several directions. In one setting it\nis assumed that there is a single sample, that is a single collection of instances, from which\none has to recover the hidden information. A line of studies on clustering theory, starting\nfrom [5] has proposed to address this problem by \ufb01nding a projection to a low dimensional\nspace and solving the problem in this space. The goal of this projection is to reduce the\ndimension while preserving the distances, as much as possible, between the means of the\nunderlying distributions. We will refer to this line as MM (Mixture Models). On the other\nend of the spectrum, Topic modeling (TM), [9, 3], assumes multiple samples (documents)\nthat are mixtures, with di\ufb00erent weights of the underlying distributions (topics) over words.\nComparing the two lines presented above shows some similarities and some di\ufb00erences. Both\nmodels assume the same generative structure: a point (word) is generated by \ufb01rst choosing\nthe distribution \u03b8i using the mixing weights and then selecting a point (word) according to\nthis distribution. The goal of both models is to recover information about the generative\nmodel (see [10] for more on that). However, there are some key di\ufb00erences:\n(a) In MM, there exists a single sample to learn from. In TM, each document is a mixture\n\nof the topics, but with di\ufb00erent mixture weights.\n\n(b) In MM, the points are represented as feature vectors while in TM the data is represented\nas a word-document co-occurrence matrix. As a consequence, the model generated by\nTM cannot assign words that did not previously appear in any document to topics.\n\u2217Work done while the author was an intern at Microsoft Resaerch\n\n1\n\n\f(c) TM assumes high density of the samples, i.e., that the each word appears multiple times.\nHowever, if the topics were not discrete distributions, as is mostly the case in MM, each\n\"word\" (i.e., value) would typically appear either zero or one time, which makes the\nco-occurrence matrix useless.\n\nIn this work we try to close the gap between MM and TM. Similar to TM, we assume that\nmultiple samples are available. However, we assume that points (words) are presented as\nfeature vectors and the hidden distributions may be continuous. This allows us to solve\nproblems that are typically hard in the MM model with greater ease and generate models\nthat generalize to points not in the training data which is something that TM cannot do.\n\n1.1 De\ufb01nitions and Notations\nWe assume a mixture model in which there are K mixture components \u03b81, . . . , \u03b8K de\ufb01ned\nover the space X. These mixture components are probability measures over X. We assume\nthat there are M mixture models (samples), each drawn with di\ufb00erent mixture weights\n\u03a61, . . . , \u03a6M such that \u03a6j = (\u03c6j\nK) where all the weights are non-negative and sum to\n1. Therefore, we have M di\ufb00erent probability measures D1, . . . , DM de\ufb01ned over X such\ni \u03b8i (A). We will denote\n\nthat for a measurable set A and j = 1, . . . , M we have Dj(A) =P\n\n1, . . . , \u03c6j\n\ni \u03c6j\n\nby \u03c6min the minimal value of \u03c6j\ni.\nIn the \ufb01rst part of this work, we will provide an algorithm that given samples S1, . . . , SM\nfrom the mixtures D1, . . . , DM \ufb01nds a low-dimensional embedding that preserves the dis-\ntances between the means of each mixture.\nIn the second part of this work we will assume that the mixture components have disjoint\nsupports. Hence we will assume that X = \u222ajCj such that the Cj\u2019s are disjoint and for\nevery j, \u03b8j(Cj) = 1. Given samples S1, . . . , SM, we will provide an algorithm that \ufb01nds the\nsupports of the underlying distributions, and thus clusters the samples.\n\n1.2 Examples\nBefore we dive further in the discussion of our methods and how they compare to prior art,\nwe would like to point out that the model we assume is realistic in many cases. Consider the\nfollowing example: assume that one would like to cluster medical records to identify sub-\ntypes of diseases (e.g., di\ufb00erent types of heart disease). In the classical clustering setting\n(MM), one would take a sample of patients and try to divide them based on some similarity\ncriteria into groups. However, in many cases, one has access to data from di\ufb00erent hospitals\nin di\ufb00erent geographical locations. The communities being served by the di\ufb00erent hospitals\nmay be di\ufb00erent in socioeconomic status, demographics, genetic backgrounds, and exposure\nto climate and environmental hazards. Therefore, di\ufb00erent disease sub-types are likely to\nappear in di\ufb00erent ratios in the di\ufb00erent hospital. However, if patients in two hospitals\nacquired the same sub-type of a disease, parts of their medical records will be similar.\nAnother example is object classi\ufb01cation in images. Given an image, one may break it to\npatches, say of size 10x10 pixels. These patches may have di\ufb00erent distributions based on the\nobject in that part of the image. Therefore, patches from images taken at di\ufb00erent locations\nwill have di\ufb00erent representation of the underlying distributions. Moreover, patches from\nthe center of the frame are more likely to contain parts of faces than patches from the\nperimeter of the picture. At the same time, patches from the bottom of the picture are\nmore likely to be of grass than patches from the top of the picture.\nIn the \ufb01rst part of this work we discuss the problem of identifying the mixture component\nfrom multiple samples when the means of the di\ufb00erent components di\ufb00er and variances are\nbounded. We focus on the problem of \ufb01nding a low dimensional embedding of the data that\npreserves the distances between the means since the problem of \ufb01nding the mixtures in a\nlow dimensional space has already been address (see, for example [10]). Next, we address a\ndi\ufb00erent case in which we assume that the support of the hidden distributions is disjoint.\nWe show that in this case we can identify the supports of each distribution. Finally we\ndemonstrate our approaches on toy problems. The proofs of the theorems and lemmas\n\n2\n\n\fappear in the appendix. Table 1 summarizes the applicability of the algorithms presented\nhere to the di\ufb00erent scenarios.\n\n1.3 Comparison to prior art\n\nLow\n\nDSC\n\nMSP\n\n\u221a\n\nHigh\n\nclusters\n\ndimension\n\nDisjoint Overlapping\nclusters\nDSC, MSP\n\nThere are two common approaches in the\ntheoretical study of the MM model. The\nmethod of moments [6, 8, 1] allows the re-\ncovery of the model but requires exponen-\ntial running time and sample sizes. The\nother approach, to which we compare our\nresults, uses a two stage approach. In the\n\ufb01rst stage, the data is projected to a low\ndimensional space and in the second stage\nthe association of points to clusters is recov-\nered. Most of the results with this approach\nassume that the mixture components are\nGaussians. Dasgupta [5], in a seminal pa-\nper, presented the \ufb01rst result in this line.\nHe used random projections to project the points to a space of a lower dimension. This\nwork assumes that separation is at least \u2126(\u03c3max\nn). This result has been improved in\na series of papers. Arora and Kannan [10] presented algorithms for \ufb01nding the mixture\ncomponents which are, in most cases, polynomial in n and K. Vempala and Wang [11]\n\nTable 1: Summary of the scenarios the MSP\n(Multi Sample Projection) algorithm and the\nDSC (Double Sample Clustering) algorithm\n\nused PCA to reduce the required separation to \u2126(cid:16)\n\n\u03c3maxK 1/4 log1/4(cid:0)n/\u03c6min\n\n(cid:1)(cid:17). They use\n\nare designed to address.\n\ndimension\n\nPCA to project on the \ufb01rst K principal components, however, they require the Gaussians\nto be spherical. Kanan, Salmasian and Vempala [7] used similar spectral methods but\n2/3/\u03c62min. Chaud-\nwere able to improve the results to require separation of only c\u03c3maxK\nhuri [4] have suggested using correlations and independence between features under the\nassumption that the means of the Gaussians di\ufb00er on many features. They require sepa-\n\n\u03c3maxpK log(K\u03c3max log n/\u03c6min)(cid:17), however they assume that the Gaussians\n\nration of \u2126(cid:16)\n\nare axis aligned and that the distance between the centers of the Gaussians is spread across\n\u2126 (K\u03c3max log n/\u03c6min) coordinates.\nWe present a method to project the problem into a space of dimension d\u2217 which is the\ndimension of the a\ufb03ne space spanned by the means of the distributions. We can \ufb01nd\nthis projection and maintain the distances between the means to within a factor of 1 \u2212 \u0001.\nThe di\ufb00erent factors, \u03c3max, n and \u0001 will a\ufb00ect the sample size needed, but do not make the\nproblem impossible. This can be used as a preprocessing step for any of the results discussed\nabove. For example, combining with [5] yields an algorithm that requires a separation of only\n\n(cid:17). However, using [11] will result in separation requirement\nd\u2217(cid:17) \u2264 \u2126(cid:16)\n\u03c3maxpK log (K\u03c3max log d\u2217/\u03c6min)(cid:17). There is also an improvement in terms of the\n\n\u2126(cid:16)\nof \u2126(cid:16)\n\nvalue of \u03c3max since we need only to control the variance in the a\ufb03ne space spanned by the\nmeans of the Gaussians and do not need to restrict the variance in orthogonal directions,\nas long as it is \ufb01nite. Later we also show that we can work in a more generic setting\nwhere the distributions are not restricted to be Gaussians as long as the supports of the\ndistributions are disjoint. While the disjoint assumption may seem too strict, we note that\nthe results presented above make very similar assumptions. For example, even if the required\nseparation is \u03c3maxK 1/2 then if we look at the Voronoi tessellation around the centers of the\nGaussians, each cell will contain at least 1 \u2212 (2\u03c0)\u22121\nK 3/4 exp (\u2212K/2) of the mass of the\nGaussian. Therefore, when K is large, the supports of the Gaussians are almost disjoint.\n\n\u03c3max\n\n\u03c3max\n\n\u221a\n\n\u221a\n\nK\n\n2 Projection for overlapping components\n\nIn this section we present a method to use multiple samples to project high dimensional\nmixtures to a low dimensional space while keeping the means of the mixture components\n\n3\n\n\fAlgorithm 1 Multi Sample Projection (MSP)\nInputs:\nSamples S1, . . . , Sm from mixtures D1, . . . , Dm\nOutputs:\nVectors \u00afv1, . . . , \u00afvm\u22121 which span the projected space\nAlgorithm:\n\n1. For j = 1, . . . , m let \u00afEj be the mean of the sample Sj\n2. For j = 1, . . . , m \u2212 1 let \u00afvj = \u00afEj \u2212 \u00afEj+1\n3. return \u00afv1, . . . , \u00afvm\u22121\n\nwell separated. The main idea behind the Multi Sample Projection (MSP) algorithm is\nsimple. Let \u00b5i be the mean of the i\u2019th component \u03b8i and let Ej be the mean of the j\u2019th\nmixture Dj. From the nature of the mixture, Ej is in the convex-hull of \u00b51, . . . , \u00b5K and\nhence in the a\ufb03ne space spanned by them; this is demonstrated in Figure 1. Under mild\nassumptions, if we have su\ufb03ciently many mixtures, their means will span the a\ufb03ne space\nspanned by \u00b51, . . . , \u00b5K. Therefore, the MSP algorithm estimates the Ej\u2019s and projects to\nthe a\ufb03ne space they span. The reason for selecting this sub-space is that by projecting on\nthis space we maintain the distance between the means while reducing the dimension to at\nmost K \u2212 1. The MSP algorithm is presented in Algorithm 1. In the following theorem we\nprove the main properties of the MSP algorithm. We will assume that X = Rn, the \ufb01rst two\nmoments of \u03b8j are \ufb01nite, and \u03c32max denotes maximal variance of any of the components in\nany direction. The separation of the mixture components is minj6=j0 k\u00b5j \u2212 \u00b5j0k. Finally, we\nwill denote by d\u2217 the dimension of the a\ufb03ne space spanned by the \u00b5j\u2019s. Hence, d\u2217 \u2264 K \u2212 1.\nTheorem 1. MSP Analysis\nLet Ej = E [Dj] and let vj = Ej \u2212 Ej+1. Let Nj = |Sj|. The following holds for MSP:\n\n1. The computational complexity of the MSP algorithm is nPM\n2. For any \u0001 > 0, Pr(cid:2)supj\n\nwhere n is the original dimension of the problem.\n\n.\n\n1\nNj\n\nj\n\nj=1 Nj + 2n (m \u2212 1)\n\n3. Let \u00af\u00b5i be the projection of \u00b5i on the space spanned by \u00afv1, . . . , \u00afvM\u22121 and assume that\n\nj \u03b1i\n\njvj and let A = maxi\n\nP(cid:12)(cid:12)\u03b1i\n\nj\n\n(cid:12)(cid:12)\n\nX\n\nj\n\n1\nNj\n\n.\n\n\u00012\n\n(cid:13)(cid:13) > \u0001(cid:3) \u2264 n\u03c32max\n(cid:13)(cid:13)Ej \u2212 \u00afEj\nP\nj be such that \u00b5i = P\nP\n(cid:21)\n\n1\nNj\n\n\u00012\n\nj\n\n\u2200i, \u00b5i \u2208 span{vj}. Let \u03b1i\nthen with probability of at least 1 \u2212 n\u03c32max\n\n(cid:20)\n\nPr\n\nmax\ni,i0\n\n|k\u00b5i \u2212 \u00b5i0k \u2212 k\u00af\u00b5i \u2212 \u00af\u00b5i0k| > \u0001\n\n\u2264 4n\u03c32maxA2\n\n\u00012\n\nThe MSP analysis theorem shows that with\nlarge enough samples, the projection will\nmaintain the separation between the centers\nof the distributions. Moreover, since this is\na projection, the variance in any direction\ncannot increase. The value of A measures\nthe complexity of the setting. If the mixing\ncoe\ufb03cients are very di\ufb00erent in the di\ufb00er-\nent samples then A will be small. However,\nif the mixing coe\ufb03cients are very similar, a\nlarger sample is required. Nevertheless, the\nsize of the sample needed is polynomial in\nthe parameters of the problem.\nIt is also\napparent that with large enough samples,\na good projection will be found, even with\n\nFigure 1: The mean of the mixture compo-\nnents will be in the convex hull of their means\n\ndemonstrated here by the red line.\n\n4\n\n\flarge variances, high dimensions and close\ncentroids.\nA nice property of the bounds presented here is that they assume only bounded \ufb01rst and\nsecond moments. Once a projection to a low dimensional space has been found, it is possible\nto \ufb01nd the clusters using approaches presented in section 1.3. However, the analysis of the\nMSP algorithm assumes that the means of E1, . . . , EM span the a\ufb03ne space spanned by\n\u00b51, . . . , \u00b5K. Clearly, this implies that we require that m > d\u2217. However, when m is much\nlarger than d\u2217, we might end-up with a projection on too large a space. This could easily\nbe \ufb01xed since in this case, \u00afE1, . . . , \u00afEm will be almost co-planar in the sense that there will\nbe an a\ufb03ne space of dimension d\u2217 that is very close to all these points and we can project\nonto this space.\n\n3 Disjoint supports and the Double Sample Clustering (DSC)\n\nalgorithm\n\nIn this section we discuss the case where the underlying distributions have disjoint supports.\nIn this case, we do not make any assumption about the distributions. For example, we do\nnot require \ufb01nite moments. However, as in the mixture of Gaussians case some sort of\nseparation between the distributions is needed, this is the role of the disjoint supports.\nWe will show that given two samples from mixtures with di\ufb00erent mixture coe\ufb03cients, it\nis possible to \ufb01nd the supports of the underlying distributions (clusters) by building a tree\nof classi\ufb01ers such that each leaf represents a cluster. The tree is constructed in a greedy\nfashion. First we take the two samples, from the two distributions, and reweigh the examples\nsuch that the two samples will have the same cumulative weight. Next, we train a classi\ufb01er\nto separate between the two samples. This classi\ufb01er becomes the root of the tree. It also\nsplits each of the samples into two sets. We take all the examples that the classi\ufb01er assign\nto the label +1(\u22121), reweigh them and train another classi\ufb01er to separate between the two\nsamples. We keep going in the same fashion until we can no longer \ufb01nd a classi\ufb01er that\nsplits the data signi\ufb01cantly better than random.\nTo understand why this algorithm works it is easier to look \ufb01rst at the case where the\nmixture distributions are known. If D1 and D2 are known, we can de\ufb01ne the L1 distance\nbetween them as L1 (D1, D2) = supA |D1 (A) \u2013D2 (A)|.1 It turns out that the supremum is\nattained by a set A such that for any i, \u00b5i (A) is either zero or one. Therefore, any inner\nnode in the tree splits the region without breaking clusters. This process proceeds until all\nthe points associated with a leaf are from the same cluster in which case, no classi\ufb01er can\ndistinguish between the classes.\nWhen working with samples, we have to tolerate some error and prevent over\ufb01tting. One way\nto see that is to look at the problem of approximating the L1 distance between D1 and D2\nusing samples S1 and S2. One possible way to do that is to de\ufb01ne \u02c6L1 = supA\nHowever, this estimate is almost surely going to be 1 if the underlying distributions are\nabsolutely continuous. Therefore, one has to restrict the class from which A can be selected\nto a class of VC dimension small enough compared to the sizes of the samples. We claim\nthat asymptotically, as the sizes of the samples increase, one can increase the complexity of\nthe class until the clusters can be separated.\nBefore we proceed, we recall a result of [2] that shows the relation between classi\ufb01cation\nand the L1 distance. We will abuse the notation and treat A both as a subset and as a\nclassi\ufb01er. If we mix D1 and D2 with equal weights then\n\n(cid:12)(cid:12)(cid:12)(cid:12) A\u2229S1\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12).\n(cid:12)(cid:12) \u2212 A\u2229S2\n(cid:12)(cid:12)S1\n(cid:12)(cid:12)S2\n\nerr (A) = D1 (X \\ A) + D2 (A)\n= 1 \u2212 D1 (A) + D2 (A)\n= 1 \u2212 (D1 (A) \u2212 D2 (A)) .\n\nTherefore, minimizing the error is equivalent to maximizing the L1 distance.\n\n1the supremum is over all the measurable sets.\n\n5\n\n\fAlgorithm 2 Double Sample Clustering (DSC)\nInputs:\n\n\u2022 Samples S1, S2\n\u2022 A binary learning algorithm L that given samples S1, S2 with weights w1, w2 \ufb01nds\n\u2022 A threshold \u03c4 > 0.\n\na classi\ufb01er h and an estimator e of the error of h.\n\nOutputs:\n\n\u2022 A tree of classi\ufb01ers\n\nAlgorithm:\n\n1. Let w1 = 1 and w2 = |S1|/|S2|\n2. Apply L to S1 & S2 with weights w1 & w2 to get the classi\ufb01er h and estimator e.\n3. If e \u2265 1\n\n2 \u2212 \u03c4,\n\n(a) return a tree with a single leaf.\n\n4. else\n\nj = {x \u2208 Sj s.t. h (x) > 0}\nj = {x \u2208 Sj s.t. h (x) < 0}\n\n(a) For j = 1, 2, let S+\n(b) For j = 1, 2, let S\u2212\n(c) Let T + be the tree returned by the DSC algorithm applied to S+\n(d) Let T \u2212 be the tree returned by the DSC algorithm applied to S\u2212\n(e) return a tree in which c is at the root node and T \u2212 is its left subtree and T +\n\n1 and S+\n2\n1 and S\u2212\n2\n\nis its right subtree\n\ni 6= \u03c62\n\ni > \u03c62\n\ni < \u03c62\n\nThe key observation for the DSC algorithm is that if \u03c61\ni , then a set A that maximizes\nthe L1 distance between D1 and D2 is aligned with cluster boundaries (up to a measure zero).\nFurthermore, A contains all the clusters for\nwhich \u03c61\ni and does not contain all the\nclusters for which \u03c61\ni . Hence, if we\nsplit the space to A and \u00afA we have few clus-\nters in each side. By applying the same trick\nrecursively in each side we keep on bisecting\nthe space according to cluster boundaries\nuntil subspaces that contain only a single\ncluster remain. These sub-spaces cannot be\nfurther separated and hence the algorithm\nwill stop. Figure 2 demonstrates this idea.\nThe following lemma states this argument\nmathematically:\n\ni \u03c6j\n\ni \u03b8i then\n\nLemma 1. If Dj =P\ni max(cid:0)\u03c61\nP\nD2 (A\u2217) =P\n2. If A\u2217 = \u222ai:\u03c61\n\n1. L1 (D1, D2)\n\ni \u2212 \u03c62\ni >\u03c62\n\ni , 0(cid:1).\ni max(cid:0)\u03c61\n\ni\n\nCi then D1 (A\u2217) \u2212\n\ni , 0(cid:1).\n\ni \u2212 \u03c62\n\n3. If \u2200i, \u03c61\n\n6= \u03c62\n\ni\n\ni and A is such that\nD1 (A)\u2212D2 (A) = L1 (D1, D2) then\n\u2200i, \u03b8i (A\u2206A\u2217) = 0.\n\n\u2264\n\nFigure 2: Demonstration of the DSC algo-\nrithm. Assume that \u03a61 = (0.4, 0.3, 0.3) for\nthe orange, green and blue regions respec-\ntively and \u03a62 = (0.5, 0.1, 0.4). The green re-\ngion maximizes the L1 distance and therefore\nwill be separated from the blue and orange.\nConditioned on these two regions, the mixture\ncoe\ufb03cients are \u03a61orange, blue = (4/7, 3/7) and\n\u03a62orange, blue = (5/9, 4/9). The region that\nmaximized this conditional L1 is the orange\nregions that will be separated from the blue.\n\nWe conclude from Lemma 1 that if D1 and\nD2 were explicitly known and one could have found a classi\ufb01er that best separates between\nthe distributions, that classi\ufb01er would not break clusters as long as the mixing coe\ufb03cients\n\n6\n\n\fare not identical. In order for this to hold when the separation is applied recursively in the\nDSC algorithm it su\ufb03ces to have that for every I \u2286 [1, . . . , K] if |I| > 1 and i \u2208 I then\n\niP\n\n\u03c61\ni0\u2208I \u03c61\ni0\n\niP\n\n\u03c62\ni0\u2208I \u03c62\ni0\n\n6=\n\nto guarantee that at any stage of the algorithm clusters will not be split by the classi\ufb01er\n(but may be sections of measure zero). This is also su\ufb03cient to guarantee that the leaves\nwill contain single clusters.\nIn the case where data is provided through a \ufb01nite sample then some book-keeping is\nneeded. However, the analysis follows the same path. We show that with samples large\nenough, clusters are only minimally broken. For this to hold we require that the learning\nalgorithm L separates the clusters according to this de\ufb01nition:\nDe\ufb01nition 1. For I \u2286 [1, . . . , K] let cI : X 7\u2192 {\u00b11} be such that cI(x) = 1 if x \u2208 \u222ai\u2208I Ci\nand cI(x) = \u22121 otherwise. A learning algorithm L separates C1, . . . , CK if for every \u0001, \u03b4 > 0\nthere exists N such that for every n > N and every measure \u03bd over X\u00d7{\u00b11} with probability\n1 \u2212 \u03b4 over samples from \u03bdn:\n\n1. The algorithm L returns an hypothesis h : X 7\u2192 {\u00b11} and an error estimator\n\ne \u2208 [0, 1] such that |Prx,y\u223c\u03bd [h (x) 6= y] \u2212 e| \u2264 \u0001\n\n2. h is such that\n\n\u2200I,\n\nPr\nx,y\u223c\u03bd\n\n[h (x) 6= y] < Pr\nx,y\u223c\u03bd\n\n[cI (x) 6= y] + \u0001 .\n\nBefore we introduce the main statement, we de\ufb01ne what it means for a tree to cluster the\nmixture components:\nDe\ufb01nition 2. A clustering tree is a tree in which in each internal node is a classi\ufb01er and\nthe points that end in a certain leaf are considered a cluster. A clustering tree \u0001-clusters the\nmixture coe\ufb03cient \u03b81, . . . , \u03b8K if for every i \u2208 1, . . . , K there exists a leaf in the tree such\nthat the cluster L \u2286 X associated with this leaf is such that \u03b8i (L) \u2265 1 \u2212 \u0001 and \u03b8i0 (L) < \u0001\nfor every i0 6= i.\nTo be able to \ufb01nd a clustering tree, the two mixtures have to be di\ufb00erent. The following\nde\ufb01nition captures the gap which is the amount of di\ufb00erence between the mixtures.\nDe\ufb01nition 3. Let \u03a61 and \u03a62 be two mixture vectors. The gap, g, between them is\n\ng = min\n\n(cid:26)(cid:12)(cid:12)(cid:12)(cid:12)\n\niP\n\n\u03c61\ni0\u2208I \u03c61\ni0\n\niP\n\n\u03c62\ni0\u2208I \u03c62\ni0\n\n\u2212\n\n(cid:12)(cid:12)(cid:12)(cid:12) : I \u2286 [1, . . . , K] and |I| > 1\n(cid:27)\n\n.\n\nWe say that \u03a6 is b bounded away from zero if b \u2264 mini \u03c6i.\nTheorem 2. Assume that L separates \u03b81, . . . , \u03b8K, there is a gap g > 0 between \u03a61 and \u03a62\nand both \u03a61 and \u03a62 are bounded away from zero by b > 0. For every \u0001\u2217, \u03b4\u2217 > 0 there exists\nN = N (\u0001\u2217, \u03b4\u2217, g, b, K) such that given two random samples of sizes N < n1, n2 from the two\nmixtures, with probability of at least 1 \u2212 \u03b4\u2217 the DSC algorithm will return a clustering tree\nwhich \u0001\u2217-clusters \u03b81, . . . , \u03b8K when applied with the threshold \u03c4 = g/8.\n\n4 Empirical evidence\n\nWe conducted several experiments with synthetic data to compare di\ufb00erent methods when\nclustering in high dimensional spaces. The synthetic data was generated from three Gaus-\nsians with centers at points (0, 0) , (3, 0) and (\u22123, +3). On top of that, we added additional\ndimensions with normally distributed noise. In the \ufb01rst experiment we used unit variance\nfor all dimensions. In the second experiment we skewed the distribution so that the variance\nin the other features is 5.\nTwo sets of mixing coe\ufb03cients for the three Gaussians were chosen at random 100 times by\nselecting three uniform values from [0, 1] and normalizing them to sum to 1. We generated\n\n7\n\n\f(a) Accuracy with spherical Gaussians\n\n(b) Average accuracy with skewed Gaussians\n\nFigure 3: Comparison the di\ufb00erent algorithms: The dimension of the problem is\n\npresented in the X axis and the accuracy on the Y axis.\n\ntwo samples with 80 examples each from the two mixing coe\ufb03cients. The DSC and MSP\nalgorithm received these two samples as inputs while the reference algorithms, which are\nnot designed to use multiple samples, received the combined set of 160 points as input.\nWe ran 100 trials. In each trial, each of the algorithms \ufb01nds 3 Gaussians. We then measure\nthe percentage of the points associated with the true originating Gaussian after making the\nbest assignment of the inferred centers to the true Gaussians.\nWe compared several algorithms. K-means was used on the data as a baseline. We compared\nthree low dimensional projection algorithms. Following [5] we used random projections as\nthe \ufb01rst of these. Second, following [11] we used PCA to project on the maximal variance\nsubspace. MSP was used as the third projection algorithm. In all projection algorithm we\n\ufb01rst projected on a one dimensional space and then applied K-means to \ufb01nd the clusters.\nFinally, we used the DSC algorithm. The DSC algorithm uses the classregtree function in\nMATLAB as its learning oracle. Whenever K-means was applied, the MATLAB implemen-\ntation of this procedure was used with 10 random initial starts.\nFigure 3(a) shows the results of the \ufb01rst experiment with unit variance in the noise dimen-\nsions. In this setting, the Maximal Variance method is expected to work well since the \ufb01rst\ntwo dimensions have larger expected variance. Indeed we see that this is the case. However,\nwhen the number of dimensions is large, MSP and DSC outperform the other methods;\nthis corresponds to the di\ufb03cult regime of low signal to noise ratio. In 12800 dimensions,\nMSP outperforms Random Projections 90% of the time, Maximal Variance 80% of the time,\nand K-means 79% of the time. DSC outperforms Random Projections, Maximal Variance\nand K-means 84%, 69%, and 66% of the time respectively. Thus the p-value in all these\nexperiments is < 0.01.\nFigure 3(b) shows the results of the experiment in which the variance in the noise dimensions\nis higher which creates a more challanging problem. In this case, we see that all the reference\nmethods su\ufb00er signi\ufb01cantly, but the MSP and the DSC methods obtain similar results as in\nthe previous setting. Both the MSP and the DSC algorithms win over Random Projections,\nMaximal Variance and K-means more than 78% of the time when the dimension is 400 and\nup. The p-value of these experiments is < 1.6 \u00d7 10\u22127.\n\n5 Conclusions\n\nThe mixture problem examined here is closely related to the problem of clustering. Most\nclustering data can be viewed as points generated from multiple underlying distributions or\ngenerating functions, and clustering can be seen as the process of recovering the structure\nof or assignments to these distributions. We presented two algorithms for the mixture\nproblem that can be viewed as clustering algorithms. The MSP algorithm uses multiple\nsamples to \ufb01nd a low dimensional space to project the data to. The DSC algorithm builds\na clustering tree assuming that the clusters are disjoint. We proved that these algorithms\nwork under milder assumptions than currently known methods. The key message in this\nwork is that when multiple samples are available, often it is best not to pool the data into\none large sample, but that the structure in the di\ufb00erent samples can be leveraged to improve\nclustering power.\n\n8\n\n37424752576267020004000600080001000012000Random ProjectionK-MeansMaximal VarianceMSPDSC38404244464850020004000600080001000012000\fReferences\n[1] Mikhail Belkin and Kaushik Sinha, Polynomial learning of distribution families, Foun-\ndations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, IEEE,\n2010, pp. 103\u2013112.\n\n[2] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira, Analysis of rep-\nresentations for domain adaptation, Advances in neural information processing systems\n19 (2007), 137.\n\n[3] David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichlet allocation, the\n\nJournal of machine Learning research 3 (2003), 993\u20131022.\n\n[4] Kamalika Chaudhuri and Satish Rao, Learning mixtures of product distributions using\n\ncorrelations and independence, Proc. of COLT, 2008.\n\n[5] Sanjoy Dasgupta, Learning mixtures of gaussians, Foundations of Computer Science,\n\n1999. 40th Annual Symposium on, IEEE, 1999, pp. 634\u2013644.\n\n[6] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant, E\ufb03ciently learning mixtures\nof two gaussians, Proceedings of the 42nd ACM symposium on Theory of computing,\nACM, 2010, pp. 553\u2013562.\n\n[7] Ravindran Kannan, Hadi Salmasian, and Santosh Vempala, The spectral method for\n\ngeneral mixture models, Learning Theory, Springer, 2005, pp. 444\u2013457.\n\n[8] Ankur Moitra and Gregory Valiant, Settling the polynomial learnability of mixtures of\ngaussians, Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Sympo-\nsium on, IEEE, 2010, pp. 93\u2013102.\n\n[9] Christos H Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vem-\npala, Latent semantic indexing: A probabilistic analysis, Proceedings of the seven-\nteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database sys-\ntems, ACM, 1998, pp. 159\u2013168.\n\n[10] Arora Sanjeev and Ravi Kannan, Learning mixtures of arbitrary gaussians, Proceedings\nof the thirty-third annual ACM symposium on Theory of computing, ACM, 2001,\npp. 247\u2013257.\n\n[11] Santosh Vempala and Grant Wang, A spectral algorithm for learning mixtures of distri-\nbutions, Foundations of Computer Science, 2002. Proceedings. The 43rd Annual IEEE\nSymposium on, IEEE, 2002, pp. 113\u2013122.\n\n9\n\n\f", "award": [], "sourceid": 241, "authors": [{"given_name": "Jason", "family_name": "Lee", "institution": "Stanford University"}, {"given_name": "Ran", "family_name": "Gilad-Bachrach", "institution": "Microsoft Research"}, {"given_name": "Rich", "family_name": "Caruana", "institution": "Microsoft Research"}]}