{"title": "Learning Mixture of Gaussians with Streaming Data", "book": "Advances in Neural Information Processing Systems", "page_first": 6605, "page_last": 6614, "abstract": "In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of $N$ points in $d$ dimensions generated by an unknown mixture of $k$ spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd's heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians  accurately if they are sufficiently separated. Assuming each pair of centers are $C\\sigma$ distant with $C=\\Omega((k\\log k)^{1/4}\\sigma)$ and where $\\sigma^2$ is the maximum variance of any Gaussian component, we show that asymptotically the algorithm estimates the centers optimally (up to certain constants); our center separation requirement matches the best known result for spherical Gaussians \\citep{vempalawang}. For finite samples, we show that a bias term based on the initial estimate decreases at $O(1/{\\rm poly}(N))$ rate while variance decreases at nearly optimal rate of $\\sigma^2 d/N$. Our analysis requires seeding the algorithm with a good initial estimate of the true cluster centers for which we provide an online PCA based clustering algorithm. Indeed, the asymptotic per-step time complexity of our algorithm is the optimal $d\\cdot k$ while space complexity of our algorithm is $O(dk\\log k)$.  In addition to the bias and variance terms which tend to $0$, the hard-thresholding based updates of streaming Lloyd's algorithm is agnostic to the data distribution and hence incurs an \\emph{approximation error} that cannot be avoided. However, by using a streaming version of the classical \\emph{(soft-thresholding-based)} EM method that exploits the Gaussian distribution explicitly, we show that for a mixture of two Gaussians the true means can be estimated consistently, with estimation error decreasing at nearly optimal rate, and tending to $0$ for $N\\rightarrow \\infty$.", "full_text": "Learning Mixture of Gaussians with Streaming Data\n\nAditi Raghunathan\nStanford University\n\naditir@stanford.edu\n\nPrateek Jain\n\nMicrosoft Research, India\nprajain@microsoft.com\n\nRavishankar Krishnaswamy\n\nMicrosoft Research, India\nrakri@microsoft.com\n\nAbstract\n\nIn this paper, we study the problem of learning a mixture of Gaussians with stream-\ning data: given a stream of N points in d dimensions generated by an unknown\nmixture of k spherical Gaussians, the goal is to estimate the model parameters using\na single pass over the data stream. We analyze a streaming version of the popular\nLloyd\u2019s heuristic and show that the algorithm estimates all the unknown centers of\nthe component Gaussians accurately if they are suf\ufb01ciently separated. Assuming\neach pair of centers are C\u03c3 distant with C = \u2126((k log k)1/4\u03c3) and where \u03c32 is\nthe maximum variance of any Gaussian component, we show that asymptotically\nthe algorithm estimates the centers optimally (up to certain constants); our center\nseparation requirement matches the best known result for spherical Gaussians [18].\nFor \ufb01nite samples, we show that a bias term based on the initial estimate decreases\nat O(1/poly(N )) rate while variance decreases at nearly optimal rate of \u03c32d/N.\nOur analysis requires seeding the algorithm with a good initial estimate of the true\ncluster centers for which we provide an online PCA based clustering algorithm.\nIndeed, the asymptotic per-step time complexity of our algorithm is the optimal\nd \u00b7 k while space complexity of our algorithm is O(dk log k).\nIn addition to the bias and variance terms which tend to 0, the hard-thresholding\nbased updates of streaming Lloyd\u2019s algorithm is agnostic to the data distribution\nand hence incurs an approximation error that cannot be avoided. However, by\nusing a streaming version of the classical (soft-thresholding-based) EM method\nthat exploits the Gaussian distribution explicitly, we show that for a mixture of\ntwo Gaussians the true means can be estimated consistently, with estimation error\ndecreasing at nearly optimal rate, and tending to 0 for N \u2192 \u221e.\n\n1\n\nIntroduction\n\nClustering data into homogeneous clusters is a critical \ufb01rst step in any data analysis/exploration task\nand is used extensively to pre-process data, form features, remove outliers and visualize data. Due\nto the explosion in amount of data collected and processed, designing clustering algorithms that\ncan handle large datasets that do not \ufb01t in RAM is paramount to any big-data system. A common\napproach in such scenarios is to treat the entire dataset as a stream of data, and then design algorithms\nwhich update the model after every few points from the data stream. In addition, there are several\npractical applications where the data itself is not available beforehand and is streaming in, for example\nin any typical online system like web-search.\nFor such a model, the algorithm of choice in practice is the so-called streaming k-means heuristic.\nIt is essentially a streaming version of the celebrated k-means algorithm or Lloyd\u2019s heuristic [8].\nThe basic k-means algorithm is designed for of\ufb02ine/batch data where each data point is assigned to\nthe nearest centroid and the centroids are then updated based on the assigned points; this process is\niterated till the solution is locally optimal. The streaming version of the k-means algorithm assigns\nthe new point from the stream to the closest centroid and updates this centroid immediately. That is,\nunlike of\ufb02ine k-means which \ufb01rst assigns all the points to the respective centroids and then updates\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe centroids, the streaming algorithm updates the centroids after each point, making it much more\nspace ef\ufb01cient. While streaming k-means and its several variants are used heavily in practice, their\nproperties such as solution quality, time complexity of convergence have not been studied widely. In\nthis paper, we attempt to provide such a theoretical study of the streaming k-means heuristic. One\nof the big challenges is that even the (of\ufb02ine) k-means algorithm attempts to solve a non-convex\nNP-hard problem. Streaming data poses additional challenges because of the large noise in each\npoint that can deviate the solution signi\ufb01cantly.\nIn the of\ufb02ine setting, clustering algorithms are typically studied under certain simplifying assumptions\nthat help bypass the worst-case NP-hardness of these problems. One of the most widely studied\nsetting is when the data is sampled from a mixture of well-separated Gaussians [5, 18, 1], which is\nalso the generative assumption that we impose on the data in this work. However, the online/streaming\nversion of the k-means algorithm has not been studied in such settings. In this work, we design and\nstudy a variant of the popular online k-means algorithm where the data is streaming-in, we cannot\nstore more than logarithmically many data points, and each data point is sampled from a mixture of\nwell-separated spherical Gaussians. The goal of the algorithm is then to learn the means of each of\nthe Gaussians; note that estimating other parameters like variance, and weight of each Gaussian in\nthe mixture becomes simple once the true means are estimated accurately.\nOur Results. Our main contribution is the \ufb01rst bias-variance bound for the problem of learning\n\u221a\nGaussian mixtures with streaming data. Assuming that the centers are separated by C\u03c3 where\nlog k) and if we seed the algorithm with initial cluster centers that are \u2264 C\u03c3/20 distance\nC = \u2126(\naway from the true centers, then we show that the error in estimating the true centers can be\ndecomposed into three terms and bound each one of them: (a) the bias term, i.e., the term dependent\non distance of true means to initial centers decreases at a 1/poly(N ) rate, where N is the number\n\nof data points observed so far, (b) the variance term is bounded by \u03c32(cid:0) d log N\n\n(cid:1) where \u03c3 is the\n\nN\n\n\u221a\n\nstandard deviation of each of the Gaussians, and d is the dimensionality of the data, and (c) an of\ufb02ine\napproximation error: indeed, note that even the of\ufb02ine Lloyd\u2019s heuristic will have an approximation\nerror due to its hard-thresholding nature. For example, even when k = 2, and the centers are separated\nby C\u03c3, around exp(\u2212 C2\n8 ) fraction of points from the \ufb01rst Gaussian will be closer to the second\ncenter, and so the k-means heuristic will converge to centers that are at a squared distance of roughly\nO(C 2) exp(\u2212 C2\n8 )\u03c32 from the true means. We essentially inherit this of\ufb02ine error up to constants.\nNote that the above result holds at a center separation of \u2126(\nlog k\u03c3) distance, which is substantially\nweaker than the currently best-known result of \u2126(\u03c3k1/4) for even the of\ufb02ine problem [18]. However,\nas mentioned before, this only holds provided we have a good initialization. To this end, we show\nthat when C = \u2126(\u03c3(k log k)1/4), we can combine an online PCA algorithm [9, 11] with the batch\nk-means algorithm on a small seed sample of around O(k log k) points, to get such an initialization.\nNote that this separation requirement nearly matches the best-known result of\ufb02ine results [18].\nFinally, we also study a soft-version of streaming k-means algorithm, which can also be viewed\nas the streaming version of the popular Expectation Maximization (EM) algorithm. We show that\nfor mixture of two well-separated Gaussians, a variant of streaming EM algorithm recovers the\nabove mentioned bias-variance bound but without the approximation error. That is, after observing\nin\ufb01nite many samples, streaming EM converges to the true means and matches the corresponding\nof\ufb02ine results in [3, 6]; to the best of our knowledge this is also \ufb01rst such consistency result for the\nstreaming mixture problem. However, the EM updates require that the data is sampled from mixture\nof Gaussians, while the updates of streaming Lloyd\u2019s algorithm are agnostic of the data distribution\nand hence same updates can be used to solve arbitrary mixture of sub-Gaussians as well.\nTechnical Challenges. One key technical challenge in analyzing streaming k-means algorithm in\ncomparison to the standard streaming regression style problems is that the of\ufb02ine problem itself is\nnon-convex and moreover can only be solved approximately. Hence, a careful analysis is required to\nseparate out the error we get in each iteration in terms of the bias, variance, and inherent approximation\nerror terms. Moreover, due to the non-convexity, we are able to guarantee decrease in error only if\neach of our iterates lies in a small ball around the true mean. While this is initially true due to the\ninitialization algorithm, our intermediate centers might escape these balls during our update. However,\nwe show using a delicate martingale based argument that with high probability, our estimates stay\nwithin slightly larger balls around the true means, which turns out to be suf\ufb01cient for us.\n\n2\n\n\fRelated Work. A closely related work to ours is an independent work by [17] which studies a\nstochastic version of k-means for data points that satisfy a spectral variance condition which can be\nseen as a deterministic version of the mixture of distributions assumption. However, their method\nrequires multiple passes over the data, thus doesn\u2019t \ufb01t directly in the streaming k-means setting.\nIn particular, the above mentioned paper analyzes the stochastic k-means method only for highly\naccurate initial set of iterates which requires a large burn-in period of t = O(N 2) and hence needs\nO(N ) passes over the data, where N is the number of data points. Tensor methods [1, 10] can also\nbe extended to cluster streaming data points sampled from a mixture distribution but these methods\nsuffer from large sample/time complexity and might not provide reasonable results when the data\ndistribution deviates from the assumed generative model.\nIn addition to the gaussian mixture model, clustering problems are also studied under other models\nsuch as data with small spectral variance [12], stability of data [4], etc. It would be interesting to\nstudy the streaming versions in such models as well.\nPaper Outline. We describe our models and problem setup in Section 2. We then present our\nstreaming k-means algorithm and its proof overview in Sections 3 and 4. We then discuss the\ninitialization procedure in Section 5. Finally we describe our streaming-EM algorithm in Section 6.\n\n2 Setup and Notation\n\nWe assume that the data is drawn from a mixture of k spherical Gaussians distributions, i.e.,\n\nxt i.i.d\u223c (cid:88)\n\nwiN (\u00b5(cid:63)\n\ni , \u03c32I), \u00b5(cid:63)\n\ni \u2208 Rd \u2200i = 1, 2, . . . k\n\n(1)\n\ni\n\ni \u2208 Rd is the mean of the i-th mixture component, mixture weights wi \u2265 0, and(cid:80)\n\ni wi = 1.\nwhere \u00b5(cid:63)\nAll the problem parameters (i.e., the true means, the variance \u03c32 and the mixture weights) are\nunknown to the algorithm. Using the standard streaming setup, where the tth sample xt \u2208 Rd is\ndrawn from the data distribution, our goal is to produce an estimate \u02c6\u00b5i of \u00b5(cid:63)\ni for i = 1, 2, . . . k in a\nsingle pass over the data using bounded space.\nCenter Separation. A suitable notion of signal to noise ratio for our problem turns out to be the ratio\nof minimum separation between the true centers and the maximum variance along any direction. We\ndenote this ratio by C = mini,j\nby Cij. Here\nand in the rest of the paper, (cid:107)y(cid:107) is the Euclidean norm of a vector y. We use \u03b7 to denote the learning\nrate of the streaming updates and \u00b5t\n\n. For convenience, we also denote\n\ni to denote the estimate of \u00b5(cid:63)\n\ni at time t.\n\n(cid:107)\u00b5(cid:63)\n\ni \u2212\u00b5(cid:63)\nj (cid:107)\n\u03c3\n\n(cid:107)\u00b5(cid:63)\n\ni \u2212\u00b5(cid:63)\nj (cid:107)\n\u03c3\n\nRemarks. For a cleaner presentation, we assume that all the mixture weights are 1/k, but our results\nhold with general weights as long as an appropriate center separation condition is satis\ufb01ed. Secondly,\nour proofs also go through when the Gaussians have different variances \u03c3i\n2, as long as the separation\nconditions are satis\ufb01ed with \u03c3 = maxi \u03c3i. We furnish details in the full version of this paper [14].\n\n3 Algorithm and Main Result\n\nIn this section, we describe our proposed streaming clustering algorithm and present our analysis of\nthe algorithm. At a high level, we follow the approach of various recent results for (of\ufb02ine) mixture\nrecovery algorithms [18, 12]. That is, we initialize the algorithm with an SVD style operation which\nde-noises the data signi\ufb01cantly in Algorithm 1 and then apply our streaming version of Lloyd\u2019s\nheuristic in Algorithm 2. Note that the Lloyd\u2019s algorithm is agnostic to the underlying distribution\nand does not include distribution speci\ufb01c terms like variance etc.\nIntuitively, the initialization algorithm \ufb01rst computes an online batch PCA in the for-loop. After this\nstep, we perform an of\ufb02ine distance-based clustering on the projected subspace (akin to Vempala-\nWang for the of\ufb02ine algorithm). Note that since we only need estimates for centers within a suitable\nproximity from the true centers, this step only uses few (roughly k log k) samples. These centers are\nfed as the initial centers for the streaming update algorithm. The streaming algorithm then, for each\nnew sample, updates the current center which is closest to the sample, and iterates.\n\n3\n\n\fFigure 1: Illustration of optimal K-means error\n\nAlgorithm 1 InitAlg(N0)\n\nU \u2190 random orthonormal matrix \u2208 Rd\u00d7k\nB = \u0398(d log d), S = 0\nfor t = 1 to N0 \u2212 k log k do\n\nif mod(t, B) = 0 then\nU \u2190 QR(S \u00b7 U ),\n\nS \u2190 0\n\nend if\nReceive xt as generated by the input stream\nS = S + xt(xt)T\n\nend for\nX0 = [xN0\u2212k log k+1, . . . , xN0]\nForm nearest neighbor graph using U T X0 and\n\ufb01nd connected components\nk] \u2190 mean of points in each compo-\n[\u03bd0\nnent\nReturn: [\u00b50\n1, . . . , \u00b50\n\n1 , . . . , U \u03bd0\nk]\n\nk] = [U \u03bd0\n\n1 , . . . , \u03bd0\n\n.\n\nN\n\nk} \u2190 InitAlgo(N0).\n\nAlgorithm 2 StreamKmeans(N, N0)\n1: Set \u03b7 \u2190 3k log 3N\n2: Set {\u00b50\n1, . . . , \u00b50\n3: for t = 1 to N do\nReceive xt+N0 given by the input stream\n4:\nx = xt+N0\n5:\nLet it = arg mini (cid:107)x \u2212 \u00b5t\u22121\n6:\nSet \u00b5t\n7:\n+ \u03b7x\nit\nSet \u00b5t\n8:\n9: end for\n10: Output: \u00b5N\n\n= (1 \u2212 \u03b7)\u00b5t\u22121\nfor i (cid:54)= it\ni = \u00b5t\u22121\n1 , . . . , \u00b5N\nk\n\n(cid:107).\n\nit\n\ni\n\ni\n\nWe now present our main result for the streaming clustering problem.\nTheorem 1. Let xt, 1 \u2264 t \u2264 N + N0 be generated using a mixture of Gaussians (1) with wi = 1/k,\n\u2200i. Let N0, N \u2265 O(1)k3d3 log d and C \u2265 \u2126((k log k)1/4). Then, the mean estimates (\u00b5N\n1 , . . . , \u00b5N\nk )\noutput by Algorithm 2 satis\ufb01es the following error bound:\n\n(cid:34)(cid:88)\n\ni\n\nE\n\n(cid:35)\n\n(cid:107)\u00b5N\n\ni \u2212 \u00b5(cid:63)\ni (cid:107)2\n\ni (cid:107)2\n\u2264 maxi (cid:107)\u00b5(cid:63)\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n\nN\u2126(1)\n\nbias\n\n+O(k3)\n\n\uf8eb\uf8ec\uf8ec\uf8ed\u03c32 d log N\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n\nN\n\nvariance\n\n(cid:124)\n\n+ exp(\u2212C 2/8)(C 2 + k)\u03c32\n\n\u2248o\ufb04ine k\u2212means error\n\n(cid:123)(cid:122)\n\n\uf8f6\uf8f7\uf8f7\uf8f8 .\n(cid:125)\n\nOur error bound consists of three key terms: bias, variance, and of\ufb02ine k-means error, with bias and\nvariance being standard statistical error terms: (i) bias is dependent on the initial estimation error\nand goes down at N \u03b6 rate where \u03b6 > 1 is a large constant; (ii) variance error is the error due to\nnoise in each observation xt and goes down at nearly optimal rate of \u2248 \u03c32 d\nN albeit with an extra\nlog N term as well as worse dependence on k; and (iii) an of\ufb02ine k-means error, which is the error\nthat even the of\ufb02ine Lloyds\u2019 algorithm would incur for a given center separation C. Note that while\nsampling from the mixture distribution, \u2248 exp(\u2212C 2/8) fraction of data-points can be closer to the\ntrue means of other clusters rather than their own mean, because the tails of the distributions overlap.\nHence, in general it is not possible to assign back these points to the correct cluster, without any\nmodeling assumptions. These misclassi\ufb01ed points will shift the estimated centers along the line\njoining the means. See Figure 3 for an illustration. This error can however be avoided by performing\nsoft updates, which is discussed in Section 6.\nTime, space, and sample complexity: Our algorithm has nearly optimal time complexity of O(d\u00b7 k)\nper iteration; the initialization algorithm requires about O(d4k3) time. Space complexity of our\nalgorithm is O(dk \u00b7 log k) which is also nearly optimal. Finally, the sample complexity is O(d3k3),\nwhich is a loose upper bound and can be signi\ufb01cantly improved by a more careful analysis. To\ncompare, the best known sample complexity for the of\ufb02ine setting is \u02dcO(kd) [2], which is better by a\nfactor of (dk)2.\n\n4\n\n\f\u221a\n\nAnalysis Overview. The proof of Theorem 1 essentially follows from the two theorems stated below:\na) update analysis given a good initialization; b) InitAlg analysis for showing such an initialization.\nTheorem 2 (Streaming Update). Let xt, N0 + 1 \u2264 t \u2264 N + N0 be generated using a mixture of\nGaussians (1) with wi = 1/k, \u2200i, and N = \u2126(k3d3 log kd). Also, let the center-separation C \u2265\ni (cid:107) \u2264 C\u03c3\n20 .\n\u2126(\n(cid:19)\nThen, the streaming update of StreamKmeans(N, N0) , i.e, Steps 3-8 of Algorithm 2 satis\ufb01es:\n\ni are such that for all 1 \u2264 i \u2264 k, (cid:107)\u00b50\n(cid:18)\n\nlog k), and also suppose our initial centers \u00b50\n\ni \u2212 \u00b5(cid:63)\n\n(cid:35)\n\n+ O(k3)\n\nexp(\u2212C 2/8)(C 2 + k)\u03c32 +\n\nlog N\n\nd\u03c32\n\n.\n\n(cid:107)\u00b5N\n\ni \u2212 \u00b5(cid:63)\ni (cid:107)2\n\n\u2264 maxi (cid:107)\u00b5(cid:63)\ni (cid:107)2\n\nN\u2126(1)\n\n(cid:34)(cid:88)\n\ni\n\nE\n\nN\n\n\u221a\n\nNote that our streaming update analysis requires only C = \u2126(\nlog k) separation but needs appropri-\nate initialization that is guaranteed by the below result.\nTheorem 3 (Initialization). Let xt, 1 \u2264 t \u2264 N0 be generated using a mixture of Gaussians (1)\nwith wi = 1/k, \u2200i. Let \u00b50\nand\nN0 = \u2126\n\n, then w.p. \u2265 1 \u2212 1/poly(k), we have maxi (cid:107)\u00b50\n\nk be the output of Algorithm 1. If C = \u2126\n\n(k log k)1/4(cid:17)\n\n(cid:16)\ni (cid:107) \u2264 C\n\ni \u2212 \u00b5(cid:63)\n\nd3k3 log dk\n\n2, . . . \u00b50\n\n1, \u00b50\n\n(cid:16)\n\n(cid:17)\n\n20 \u03c3.\n\n4 Streaming Update Analysis\n\nAt a high level our analysis shows that at each step of the streaming updates, the error decreases on\naverage. However, due to the non-convexity of the objective function we can show such a decrease\nonly if the current estimates of our centers lie in a small ball around the true centers of the gaussians.\nIndeed, while the initialization provides us with such centers, due to the added noise in each step,\nour iterates may occasionally fall outside these balls, and we need to bound the probability that this\nhappens. To overcome this, we start with initial centers that are within slightly smaller balls around\nthe true means, and use a careful Martingale argument to show that even if the iterates go a little\nfarther from the true centers (due to noise), with high probability, the iterates are still within the\nslightly larger ball that we require to show decrease in error.\nWe therefore divide our proof in two parts: a) \ufb01rst we show in Section 4.1 that the error decreases in\nexpectation, assuming that the current estimates lie in a reasonable neighborhood around the true\ncenters; and b) in Section 4.2) we show using a martingale analysis that with high probability, each\niterate satis\ufb01es the required neighborhood condition if the initialization is good enough.\nWe formalize the required condition for our per-iteration error analysis below. For the remainder of\nthis section, we \ufb01x the initialization and only focus on Steps 3-8 of Algorithm 2.\nDe\ufb01nition 1. For a \ufb01xed initialization, and given a sequence of points \u03c9t = (xt(cid:48)+N0+1 : 0 \u2264 t(cid:48) <\nt), we say that condition It is satis\ufb01ed at time t if maxi (cid:107)\u00b5t(cid:48)\n10 holds for all 0 \u2264 t(cid:48) \u2264 t.\nNote that given a sequence of points and a \ufb01xed initialization, Algorithm 2 is deterministic.\n\ni (cid:107) \u2264 C\u03c3\n\ni \u2212 \u00b5(cid:63)\n\nWe now de\ufb01ne the following quantities which will be useful in the upcoming analysis. At any\ntime t \u2265 1, let \u03c9t = (xt(cid:48)+N0+1 : 0 \u2264 t(cid:48) < t) denote the sequence of points received by our\ni (cid:107)2 denote the random variable measuring the current\nt to be the maximum cluster error at time t. Now, let\n\n(cid:3) be the expected error of the ith cluster center after receiving\n(cid:3) be the expected error\n\nalgorithm. For all t \u2265 0, let (cid:101)Ei\nerror for cluster i, and let (cid:101)Vt = maxi (cid:101)Ei\nt = (cid:107)\u00b5t\ni \u2212 \u00b5(cid:63)\n(cid:98)Ei\ni (cid:107)2 |\u03c9t\ni \u2212 \u00b5(cid:63)\nconditioned on It, and let Et =(cid:80)\n\nt+1 = E\nthe (t + 1)th, conditioned on \u03c9t. Finally, let Ei\n\nt = E(cid:2)(cid:107)\u00b5t\n\n(cid:2)(cid:107)\u00b5t+1\n\ni (cid:107)2 |It\n\ni \u2212 \u00b5(cid:63)\n\nxt+N0+1\n\nt.\ni Ei\n\n4.1 Error Reduction in Single Iteration\n\nOur main tool toward showing Theorem 2 is the following theorem which bounds the expected error\nafter updating the means on arrival of the next sample.\n\u221a\nTheorem 4. If It holds and C \u2265 \u2126(\n\nlog k), then for all i, we have\n\nk5(cid:101)Vt + O(1)\u03b72d\u03c32 + O(k)\u03b7(1 \u2212 \u03b7) exp(\u2212C 2/8)(C 2 + k)\u03c32 .\n\n(cid:98)Ei\nt+1 \u2264(1 \u2212 \u03b7\n2k\n\n)(cid:101)Ei\n\nt +\n\n\u03b7\n\n5\n\n\fProof sketch of Theorem 4. In all calculations in this proof, we \ufb01rst assume that the candidate centers\nsatisfy It, and all expectations and probabilities are only over the new sample xt+N0+1, which we\ndenote by x after omitting the superscript. Now recall our update rule: \u00b5t+1\ni + \u03b7x if \u00b5t\ni\nis the closest center for the new sample x; the other centers are unchanged. To simplify notations, let:\n\ni = (1 \u2212 \u03b7)\u00b5t\n\ni (x) = 1 iff i = arg min\ngt\n\nj\n\n(cid:107)x \u2212 \u00b5t\n\nj(cid:107), gt\n\ni (x) = 0 otherwise.\n\n(2)\n\nBy de\ufb01nition, we have for all i,\ni = (1 \u2212 \u03b7)\u00b5t\n\u00b5t+1\n\ni + \u03b7(cid:0)gt\n\ni (x)x + (1 \u2212 gt\n\ni (x))\u00b5t\ni\n\ni + \u03b7gt\n\ni (x)(x \u2212 \u00b5t\ni).\n\n(cid:1) = \u00b5t\n\nOur proof relies on the following simple yet crucial lemmas. The \ufb01rst bounds the failure probability\nof a sample being closest to an incorrect cluster center among our candidates. The second shows\nthat if the candidate centers are suf\ufb01ciently close to the true centers, then the failure probability of\nmis-classifying a point to a wrong center is (upto constant factors) the probability of mis-classi\ufb01cation\neven in the optimal solution (with true centers). Finally the third lemma shows that the probability of\ni (x) = 1 for each i is lower-bounded. Complete details and proofs appear in [14].\ngt\nLemma 1. Suppose condition It holds. For any i, j (cid:54)= i, let x \u223c Cl(j) denote a random point from\n\nj(cid:107)(cid:3) \u2264 exp(\u2212\u2126(C 2\n\nij)).\n\ni (cid:107)) \u2264 \u03c3/Cij. For any i, j (cid:54)= i, let x \u223c Cl(j) denote\ni(cid:107) \u2264 (cid:107)x \u2212 \u00b5t\n\nj(cid:107)(cid:3) \u2264 O(1) exp(\u2212C 2\n\nij/8).\ni (x) = 1] \u2265 1\n2k .\n\nlog k), then for all i, then Pr [gt\n\nAnd so, equipped with the above notations and lemmas, we have\n\ncluster j. Then Pr(cid:2)(cid:107)x \u2212 \u00b5t\na random point from cluster j. Then Pr(cid:2)(cid:107)x \u2212 \u00b5t\n\ni(cid:107) \u2264 (cid:107)x \u2212 \u00b5t\ni \u2212 \u00b5(cid:63)\ni (cid:107),(cid:107)\u00b5t\n\u221a\nLemma 3. If It holds and C = \u2126(\n\nLemma 2. Suppose max((cid:107)\u00b5t\n\ni \u2212 \u00b5(cid:63)\n\n(cid:98)Ei\n\nt+1 = Ex\n\n= (1 \u2212 \u03b7)2(cid:107)\u00b5t\n\n(cid:2)(cid:107)\u00b5t+1\ni (cid:107)2(cid:3)\ni (cid:107)2 + \u03b72E(cid:2)(cid:107)gt\ni \u2212 \u00b5(cid:63)\n+ 2\u03b7(1 \u2212 \u03b7)E(cid:104)(cid:68)\ni ,(cid:0)gt\ni \u2212 \u00b5(cid:63)\nt + \u03b72 E(cid:2)(cid:107)gt\ni \u2212 \u00b5(cid:63)\ni (x)(x \u2212 \u00b5(cid:63)\n)(cid:101)Ei\n\u00b5t\n(cid:124)\n(cid:123)(cid:122)\n\u03b72E(cid:2)(1 \u2212 gt\ni (cid:107)2(cid:3) \u2264 \u03b72(cid:101)Ei\n2\u03b7(1 \u2212 \u03b7)(cid:101)Ei\ni (x) = 0] \u2264 2\u03b7(1 \u2212 \u03b7)(cid:101)Ei\ni (x))(cid:107)\u00b5t\ni \u2212 \u00b5(cid:63)\nterms with coef\ufb01cient (cid:101)Ei\nt Pr [gt\n\n\u2264 (1 \u2212 \u03b7\n2k\n\ni (x)(x \u2212 \u00b5(cid:63)\n\nT1\n\nt.\n\ni (x)(x \u2212 \u00b5(cid:63)\ni )(cid:107)2(cid:3)\ni ) + (1 \u2212 gt\n(cid:125)\n\ni ) + (1 \u2212 gt\n+2\u03b7(1 \u2212 \u03b7) E(cid:104)(cid:68)\n(cid:124)\n\ni (x))(\u00b5t\ni \u2212 \u00b5(cid:63)\ni \u2212 \u00b5(cid:63)\n\u00b5t\n\ni (x))(\u00b5t\n\ni \u2212 \u00b5(cid:63)\n\ni )(cid:107)2(cid:3)\ni )(cid:1)(cid:69)(cid:105)\ni ,(cid:0)gt\n(cid:123)(cid:122)\ni (x)(x \u2212 \u00b5(cid:63)\n\nT2\n\ni )(cid:1)(cid:69)(cid:105)\n(cid:125)\n\nThe last inequality holds because of the following line of reasoning:\nterm in the second squared norm evaluates to 0 due to the product gt\ni , (1 \u2212 gt\n\nthe cross\ni (x)), (ii)\ni )(cid:105)] \u2264\ni \u2212 \u00b5(cid:63)\nt(1 \u2212 1/2k) by Lemma 3, and \ufb01nally (iv) by collecting\n\n(i) \ufb01rstly,\ni (x)(1 \u2212 gt\ni (x))(\u00b5t\n\nt, (iii) 2\u03b7(1 \u2212 \u03b7)E [(cid:104)\u00b5t\n\ni \u2212 \u00b5(cid:63)\n\nThe proof then roughly proceeds as follows: suppose in an ideal case, gt\ni (x) is 1 for all points x\ngenerated from cluster i, and 0 otherwise. Then, if x is a random sample from cluster i, T1 would be\nd\u03c32, and T2 would be 0. Of course, the dif\ufb01culty is that gt\ni (x) is not always as well-behaved, and so\nthe bulk of the analysis is in carefully using Lemmas 1and 2, and appropriately \u201ccharging\u201d the various\n\nerror terms we get to the current error (cid:101)Ei\n\nt, the variance, and the of\ufb02ine approximation error.\n\n4.2 Ensuring Proximity Condition Via Super-Martingales\nIn the previous section, we saw that condition It = 1 is suf\ufb01cient to ensure that expected one-step\nerror reduces at time step t + 1. Our next result shows that IN = 1 is satis\ufb01ed with high probability.\nTheorem 5. Suppose maxi (cid:107)\u00b50\n\nOur argument proceeds as follows. Suppose we track the behaviour of the actual error terms (cid:101)Ei\n\nt\nover time, and stop the process (call it a failure) when any of these error terms exceeds C 2\u03c32/100\n(recall that they are all initially smaller than C 2\u03c32/400). Assuming that the process has not stopped,\nwe show that each of these error terms has a super-martingale behaviour using Theorem 4, which\n\n20 \u03c3, then IN = 1 w.p \u2265 1 \u2212 (\n\ni (cid:107) \u2264 C\n\ni \u2212 \u00b5(cid:63)\n\npoly(N ) ).\n\n1\n\n6\n\n\fsays that on average, the expected one-step error drops. Moreover, we also show that the actual\none-step difference, while not bounded, has a sub-gaussian tail. Our theorem now follows by using\nAzuma-Hoeffding type inequality for super-martingale sequences.\n\n4.3 Wrapping Up\n\nNow, using Theorems 4 and 5, we can get the following theorem.\nTheorem 6. Let \u03b3 = O(k)\u03b72d\u03c32 + O(k2)\u03b7(1 \u2212 \u03b7)exp(\u2212C 2/8)(C 2 + k)\u03c32. Then if C \u2265\n\u03b7 \u03b3.\n\u2126(\n\nlog k), for all t, we have Et+1 \u2264 (1 \u2212 \u03b7\n\n4k )N E0 + 4k\n\n\u221a\n\nt+1 = E(cid:104)(cid:107)\u00b5t+1\n\ni\n\ni \u2212 \u00b5(cid:63)\n\n(cid:105)\n\ni (cid:107)2(cid:12)(cid:12)(cid:12)It\n\n4k )Et + \u03b3. It follows that EN \u2264 (1 \u2212 \u03b7\nto be the average over all sample paths of (cid:101)Ei\n\nProof. Let E\nt+1 condi-\ntioned on It. Recall that Et+1 is very similar, except the conditioning is on It+1. With this notation,\nlet us take expectation over all sample paths where It is satis\ufb01ed, and use Theorem 4 to get\nk5 Et + O(1)\u03b72d\u03c32 + O(k)\u03b7(1 \u2212 \u03b7) exp(\u2212C 2/8)(C 2 + k)\u03c32 .\n\nt+1 \u2264(1 \u2212 \u03b7\n2k\n\n)Ei\n\nt +\n\nE\n\n\u03b7\n\ni\n\nAnd so, summing over all i we will get\n\nEt+1 \u2264(1 \u2212 \u03b7\n3k\n\n)Et + O(k)\u03b72d\u03c32 + O(k2)\u03b7(1 \u2212 \u03b7) exp(\u2212C 2/8)(C 2 + k)\u03c32 .\n\nN 2 ) since Pr [It+1] \u2265 1 \u2212 1/N 5 by Theorem 5.\n\nFinally note that Et+1 and Et+1 are related as Et+1 Pr [It+1] \u2264 Et+1 Pr [It], and so Et+1 \u2264\nEt+1(1 + 1\nProof of Theorem 2. From Theorem 5 we know that the probability of IN being satis\ufb01ed is 1\u22121/N 5,\nand in this case, we can use Theorem 6 to get the desired error bound. In case IN fails, then the\nmaximum possible error is roughly maxi,j (cid:107)\u00b5(cid:63)\nj(cid:107)2 \u00b7 N (when all our samples are sent to the\nsame cluster), which contributes a negligible amount to the bias term.\n\ni \u2212 \u00b5(cid:63)\n\n5\n\nInitialization for streaming k-means\n\nlog k)\u03c3 if we can initialize all centers such that (cid:107)\u00b50\n\n\u221a\nIn Section 4 we saw that our proposed streaming algorithm can lead to a good solution for any\nseparation C\u03c3 \u2265 O(\n20 \u03c3. We now\nshow that InitAlg (Algorithm 1) is one such procedure. We \ufb01rst approximately compute the top-k\neigenvectors U of the data covariance using a streaming PCA algorithm [9, 13] on O(k3d3 log d)\nsamples. We next store k log k points and project them onto the subspace spanned by U. We then\nperform a simple distance based clustering [18] that correctly clusters the stored points (assuming\nreasonable center separation), and \ufb01nally we output these cluster centers.\n\ni (cid:107) \u2264 C\n\ni \u2212 \u00b5(cid:63)\n\nProof of Theorem 3. Using an argument similar to [9] (Theorem 3), we get that U obtained by the\nonline PCA algorithm (Steps 1:4 of Algorithm 1) satis\ufb01es (w.p. \u2265 1 \u2212 1/poly(d)):\n\nNow, let(cid:98)\u00b5\n\n\u2217\ni = U T \u00b5(cid:63)\nHence, if U T xt, U T xt(cid:48)\n\ni \u2212 \u00b5(cid:63)\n\n(cid:107)U U T \u00b5(cid:63)\n\ni (cid:107)2 \u2264 .01\u03c32, \u22001 \u2264 i \u2264 k.\n\ni . For any x sampled from mixture distribution (1), U T x \u223c(cid:80)\n2 \u2264 (k + 8\u03b1(cid:112)k log k)\u03c32,\n\nboth belong to cluster i, then (w.p. \u2265 1 \u2212 1/k\u03b1):\n\n(cid:107)U T xt(cid:48) \u2212 U T xt(cid:48)(cid:107)2 = (cid:107)U T (zt \u2212 zt(cid:48)\n\n)(cid:107)2\n\ni wiN ((cid:98)\u00b5\n\n(3)\n\u2217\ni , \u03c32I).\n\n(4)\n\nwhere xt = \u00b5(cid:63)\nrandom variable tail bound. Similarly if U T xt, U T xt(cid:48)\nand xt(cid:48)\n\nthen (w.p. \u2265 1 \u2212 1/k\u03b1):\n\ni + zt and xt(cid:48)\nj + zt(cid:48)\n\ni + zt(cid:48)\n\n= \u00b5(cid:63)\n\n= \u00b5(cid:63)\n\n(cid:107)U T xt(cid:48) \u2212 U T xt(cid:48)(cid:107)2 = (cid:107)(cid:98)\u00b5\n\ni \u2212(cid:98)\u00b5\n\n\u2217\n\n. The last inequality above follows by using standard \u03c72\ni + zt\n\nbelong to cluster i and j, i.e., xt = \u00b5(cid:63)\n\n\u2217\nj(cid:107)2 + (cid:107)U T (zt \u2212 zt(cid:48)\n\n\u2265 (C 2 \u2212 .2C + 8\u03b1(cid:112)k log k \u2212 16\u03b1C(cid:112)log k)\u03c32,\n\n\u2217\nj )T U T (zt \u2212 zt(cid:48)\n\n)(cid:107)2\n\n\u2217\n\n)\n\n2 + 2((cid:98)\u00b5\n\ni \u2212(cid:98)\u00b5\n\n(5)\n\nwhere the above equation follows by using (3), setting \u03b1 = C/32 and using C = \u2126((k log k)1/4).\n\n7\n\n\fUsing (4), (5), w.h.p. all the points from the same cluster are closer to each other than points from\nother clusters. Hence, connected components of nearest neighbor graph recover clusters accurately.\nt\u2208Cluster(i) U T xt for each i. Since, our clustering is com-\n\nNow, we estimate (cid:98)\u00b5i =\n\n|Cluster(i)|\n\n1\n\npletely accurate, we have w.p. \u2265 1 \u2212 2m2/kC/32,\n\u2217\ni (cid:107)2 \u2264 \u03c3\n\n\u221a\n\n(cid:112)|Cluster(i)| .\n\nlog k\n\n(6)\n\n(cid:80)\n(cid:107)(cid:98)\u00b5i \u2212(cid:98)\u00b5\n\nk \u2212 C(cid:112) m\n\nAs wi = 1/k for all i, |Cluster(i)| \u2265 m\nsetting m = O(k log k) and by using (3), (6) along with C = \u2126((k log k)1/4).\nRemark 1. We would like to emphasize that our analysis for the convergence of streaming algo-\nrithms works even for smaller separations C = O(\nlog k), as long as we can get a good enough\ninitialization. Hence, a better initialization algorithm with weaker dependence of C on k would lead\nto an improvement in the overall algorithm.\n\nk w.p. \u2265 1 \u2212 1/kC/32. Theorem now follows by\n\n\u221a\n\n6 Soft thresholding EM based algorithm\n\nIn this section, we study a streaming version of the Expectation Maximization (EM) algorithm [7]\nwhich is also used extensively in practice. While the standard k-means or Lloyd\u2019s heuristic is known\nto be agnostic to the distribution, and the same procedure can solve the mixture problem for a variety\nof distributions [12], EM algorithms are designed speci\ufb01cally for the input mixture distribution. In\nthis section, we consider a streaming version of the EM algorithm when applied to the problem of\nmixture of two spherical Gaussians with known variances. In this case, the EM algorithm reduces to\na softer version of the Lloyd\u2019s algorithm where a point can be partially assigned to the two clusters.\nRecent results by [6, 3, 19] show convergence of the EM algorithm in the of\ufb02ine setting for this\n2 = \u2212\u00b5(cid:63) and the center separation\nsimple setup. In keeping with earlier notation, let \u00b5(cid:63)\nC = 2(cid:107)\u00b5(cid:63)(cid:107)\n\n. Hence, xt i.i.d\u223c 1\n\n1 = \u00b5(cid:63) and \u00b5(cid:63)\n\n2N (\u00b5(cid:63), \u03c32I) + 1\n\n2N (\u2212\u00b5(cid:63), \u03c32I).\n\n\u03c3\n\nAlgorithm 3 StreamSoftUpdate(N, N0)\n\nSet \u03b7 = 3 log N\nN .\nSet \u00b50\nfor t = 1 to N do\n\ni \u2190 InitAlgo(N0).\n\nReceive xt+N0 as generated by the input stream.\nx = xt+N0\nLet wt =\n\n(cid:1)+exp(cid:0) \u2212(cid:107)x+\u00b5t(cid:107)2\n\nexp(cid:0) \u2212(cid:107)x\u2212\u00b5t(cid:107)2\n\nexp(cid:0) \u2212(cid:107)x\u2212\u00b5t(cid:107)2\n\n(cid:1)\n\n(cid:1)\n\n\u03c32\n\n\u03c32\n\n\u03c32\n\nSet \u00b5t+1 = (1 \u2212 \u03b7)\u00b5t + \u03b7[2wt \u2212 1]x.\n\nend for\n\nIn our algorithm, wt(x) is an estimate of the probability that x belongs to the cluster with \u00b5t, given\nthat it is drawn from a balanced mixture of gaussians at \u00b5t and \u2212\u00b5t. Calculating wt(x) corresponds\nto the E step and updating the estimate of the centers corresponds to the M step of the EM algorithm.\nSimilar to the streaming Lloyd\u2019s algorithm presented in Section 3, our analysis of streaming soft\nupdates can be separated into streaming update analysis and analysis InitAlg (which is already\npresented in Section 5). We now provide our main theorem, and the proof is presented in Appendix C.\nTheorem 7 (Streaming Update). Let xt, 1 \u2264 t \u2264 N + N0 be generated using a mixture two\nbalanced spherical Gaussians with variance \u03c32. Also, let the center-separation C \u2265 4, and also\nsuppose our initial estimate \u00b50 is such that (cid:107)\u00b50 \u2212 \u00b5(cid:63)(cid:107) \u2264 C\u03c3\n20 . Then, the streaming update of\nStreamSoftUpdate(N, N0) , i.e, Steps 3-8 of Algorithm 3 satis\ufb01es:\n\nE(cid:2)(cid:107)\u00b5N \u2212 \u00b5(cid:63)(cid:107)2(cid:3) \u2264 (cid:107)\u00b5(cid:63)(cid:107)2\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nN \u2126(1)\n\n(cid:124)\n\nlog N\n\n(cid:123)(cid:122)\n\nN\n\nd\u03c32\n\n(cid:125)\n\n.\n\n+ O(1)\n\nbias\n\nvariance\n\n8\n\n\fRemark 2. Our bias and variance terms are similar to the ones in Theorem 1 but the above bound\ndoes not have the additional approximation error term. Hence, in this case we can estimate \u00b5(cid:63)\nconsistently but the algorithm applies only to a mixture of Gaussians while our algorithm and result\nin Section 3 can potentially be applied to arbitrary sub-Gaussian distributions.\nRemark 3. We note that for our streaming soft update algorithm, it is not critical to know the\nvariance \u03c32 beforehand. One could get a good estimate of \u03c3 by taking the mean of a random\nprojection of a small number of points. We provide the details in the full version of this paper [14].\n\n7 Conclusions\nIn this paper, we studied the problem of clustering with streaming data where each data point is\nsampled from a mixture of spherical Gaussians. For this problem, we study two algorithms that use\nappropriate initialization: a) a streaming version of Lloyd\u2019s method, b) a streaming EM method. For\nboth the methods we show that we can accurately initialize the cluster centers using an online PCA\nbased method. We then show that assuming \u2126((k log k)1/4\u03c3) separation between the cluster centers,\nthe updates by both the methods lead to decrease in both the bias as well as the variance error terms.\nFor Lloyd\u2019s method there is an additional estimation error term, which even the of\ufb02ine algorithm\nincurs, and which is avoided by the EM method. However, the streaming Lloyd\u2019s method is agnostic\nto the data distribution and can in fact be applied to any mixture of sub-Gaussians problem. For future\nwork, it would be interesting to study the streaming data clustering problem under deterministic\nassumptions like [12, 16]. Also, it is an important question to understand the optimal separation\nassumptions needed for even the of\ufb02ine gaussian mixture clustering problem.\n\nReferences\n[1] Anima Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor\ndecompositions for learning latent variable models (A survey for ALT). In Proceedings of ALT,\npages 19\u201338, 2015.\n\n[2] Hassan Ashtiani, Shai Ben-David, and Abbas Mehrabian. Sample-ef\ufb01cient learning of mixtures.\n\narXiv preprint arXiv:1706.01596, 2017.\n\n[3] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. St atistical guarantees for the em\nalgorithm: From population to sample-based analysis. Annals of Stats. 45 (1), 77-120, 2014.\n\n[4] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Clustering under approximation\n\nstability. J. ACM, 60(2):8:1\u20138:34, 2013.\n\n[5] Anirban Dasgupta, John Hopcroft, Ravi Kannan, and Pradipta Mitra. Spectral clustering with\n\nlimited independence. In Proceedings of SODA, pages 1036\u20131045, 2007.\n\n[6] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suf\ufb01ce\n\nfor mixtures of two gaussians. arXiv preprint arXiv:1609.00368, 2016.\n\n[7] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete\n\ndata via the em algorithm. Journal of the royal statistical society, pages 1\u201338, 1977.\n\n[8] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi\ufb01cation. John Wiley and Sons, 2000.\n\n[9] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In\n\nProceedings of NIPS, pages 2861\u20132869, 2014.\n\n[10] Daniel J. Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment methods\n\nand spectral decompositions. In Proceedings of ITCS \u201913, pages 11\u201320, 2013.\n\n[11] Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming\nPCA: matching matrix bernstein and near-optimal \ufb01nite sample guarantees for oja\u2019s algorithm.\nIn Proceedings of COLT, pages 1147\u20131164, 2016.\n\n[12] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm.\n\nIn Proceedings of FOCS, pages 299\u2013308, 2010.\n\n9\n\n\f[13] Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory limited, streaming PCA.\n\nIn Proceedings of 27th NIPS, pages 2886\u20132894, 2013.\n\n[14] Aditi Raghunathan, Ravishankar Krishnaswamy, and Prateek Jain. Learning mixture of gaus-\n\nsians with streaming data. CoRR, abs/1707.02391, 2017.\n\n[15] Ohad Shamir. A variant of azuma\u2019s inequality for martingales with subgaussian tails. arXiv\n\npreprint arXiv:1110.2392, 2011.\n\n[16] Cheng Tang and Claire Monteleoni. On lloyd\u2019s algorithm: New theoretical insights for clustering\n\nin practice. In Proceedings of AISTATS, pages 1280\u20131289, 2016.\n\n[17] Cheng Tang and Claire Monteleoni. Convergence rate of stochastic k-means. Proceedings of\n\nAISTATS, 2017.\n\n[18] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. J. Comput.\n\nSyst. Sci., 68(4):841\u2013860, 2004.\n\n[19] Ji Xu, Daniel J Hsu, and Arian Maleki. Global analysis of expectation maximization for\nmixtures of two gaussians. In Advances in Neural Information Processing Systems, pages\n2676\u20132684, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3303, "authors": [{"given_name": "Aditi", "family_name": "Raghunathan", "institution": "Stanford University"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Ravishankar", "family_name": "Krishnawamy", "institution": "Microsoft Research India"}]}