{"title": "Learning Manifolds with K-Means and K-Flats", "book": "Advances in Neural Information Processing Systems", "page_first": 2465, "page_last": 2473, "abstract": "We study the problem of estimating a manifold from random samples. In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-\ufb02ats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-\ufb02ats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-\ufb02ats, both the results and the mathematical tools are new.", "full_text": "Learning Manifolds with K-Means and K-Flats\n\nGuillermo D. Canas?,\u2020\n\n? Laboratory for Computational and Statistical Learning - MIT-IIT\n\u2020 CBCL, McGovern Institute - Massachusetts Institute of Technology\nlrosasco@mit.edu\nguilledc@mit.edu\n\ntp@ai.mit.edu\n\nTomaso Poggio?,\u2020\n\nLorenzo A. Rosasco?,\u2020\n\nAbstract\n\nWe study the problem of estimating a manifold from random samples. In partic-\nular, we consider piecewise constant and piecewise linear estimators induced by\nk-means and k-\ufb02ats, and analyze their performance. We extend previous results\nfor k-means in two separate directions. First, we provide new results for k-means\nreconstruction on manifolds and, secondly, we prove reconstruction bounds for\nhigher-order approximation (k-\ufb02ats), for which no known results were previously\navailable. While the results for k-means are novel, some of the technical tools are\nwell-established in the literature. In the case of k-\ufb02ats, both the results and the\nmathematical tools are new.\n\n1 Introduction\n\nOur study is broadly motivated by questions in high-dimensional learning. As is well known, learn-\ning in high dimensions is feasible only if the data distribution satis\ufb01es suitable prior assumptions.\nOne such assumption is that the data distribution lies on, or is close to, a low-dimensional set embed-\nded in a high dimensional space, for instance a low dimensional manifold. This latter assumption\nhas proved to be useful in practice, as well as amenable to theoretical analysis, and it has led to\na signi\ufb01cant amount of recent work. Starting from [23, 34, 5], this set of ideas, broadly referred\nto as manifold learning, has been applied to a variety of problems from supervised [35] and semi-\nsupervised learning [6], to clustering [37] and dimensionality reduction [5], to name a few.\nInterestingly, the problem of learning the manifold itself has received less attention: given samples\nfrom a d-manifold M embedded in some ambient space X , the problem is to learn a set that approx-\nimates M in a suitable sense. This problem has been considered in computational geometry, but in\na setting in which typically the manifold is a hyper-surface in a low-dimensional space (e.g. R3),\nand the data are typically not sampled probabilistically, see for instance [26, 24]. The problem of\nlearning a manifold is also related to that of estimating the support of a distribution, (see [13, 14] for\nrecent surveys.) In this context, some of the distances considered to measure approximation quality\nare the Hausforff distance, and the so-called excess mass distance.\nThe reconstruction framework that we consider is related to the work of [1, 32], as well as to the\nframework proposed in [30], in which a manifold is approximated by a set, with performance mea-\nsured by an expected distance to this set. This setting is similar to the problem of dictionary learning\n(see for instance [29], and extensive references therein), in which a dictionary is found by minimiz-\ning a similar reconstruction error, perhaps with additional constraints on an associated encoding of\nthe data. Crucially, while the dictionary is learned on the empirical data, the quantity of interest is\nthe expected reconstruction error, which is the focus of this work.\nWe analyze this problem by focusing on two important, and widely-used algorithms, namely k-\nmeans and k-\ufb02ats. The k-means algorithm can be seen to de\ufb01ne a piecewise constant approximation\nof M. Indeed, it induces a Voronoi decomposition on M, in which each Voronoi region is effectively\napproximated by a \ufb01xed mean. Given this, a natural extension is to consider higher order approxima-\n\n1\n\n\ftions, such as those induced by discrete collections of k d-dimensional af\ufb01ne spaces (k-\ufb02ats), with\npossibly better resulting performance. Since M is a d-manifold, the k-\ufb02ats approximation naturally\nresembles the way in which a manifold is locally approximated by its tangent bundle.\nOur analysis extends previous results for k-means to the case in which the data-generating distribu-\ntion is supported on a manifold, and provides analogous results for k-\ufb02ats. We note that the k-means\nalgorithm has been widely studied, and thus much of our analysis in this case involves the combi-\nnation of known facts to obtain novel results. The analysis of k-\ufb02ats, however, requires developing\nsubstantially new mathematical tools.\nThe rest of the paper is organized as follows.\nIn section 2, we describe the formal setting and\nthe algorithms that we study. We begin our analysis by discussing the reconstruction properties\nof k-means in section 3. In section 4, we present and discuss our main results, whose proofs are\npostponed to the appendices.\n\nZ\n\nE\u03c1(Sn) :=\n\n2 Learning Manifolds\nLet X by a Hilbert space with inner product h\u00b7,\u00b7i, endowed with a Borel probability measure \u03c1\nsupported over a compact, smooth d-manifold M. We assume the data to be given by a training set,\nin the form of samples Xn = (x1, . . . , xn) drawn identically and independently with respect to \u03c1.\nOur goal is to learn a set Sn that approximates well the manifold. The approximation (learning\nerror) is measured by the expected reconstruction error\n\nd\u03c1(x) d2\n\nX (x, Sn),\n\nM\n\nX (x, S) = inf x0\u2208S d2\n\n(1)\nwhere the distance to a set S \u2286 X is d2\nX (x, x0), with dX (x, x0) = kx \u2212 x0k.\nThis is the same reconstruction measure that has been the recent focus of [30, 4, 32].\nIt is easy to see that any set such that S \u2283 M will have zero risk, with M being the \u201csmallest\u201d such\nset (with respect to set containment.) In other words, the above error measure does not introduce an\nexplicit penalty on the \u201csize\u201d of Sn: enlarging any given Sn can never increase the learning error.\nWith this observation in mind, we study speci\ufb01c learning algorithms that, given the data, produce\na set belonging to some restricted hypothesis space H (e.g. sets of size k for k-means), which\neffectively introduces a constraint on the size of the sets. Finally, note that the risk of Equation 1 is\nnon-negative and, if the hypothesis space is suf\ufb01ciently rich, the risk of an unsupervised algorithm\nmay converge to zero under suitable conditions.\n\n2.1 Using K-Means and K-Flats for Piecewise Manifold Approximation\n\nIn this work, we focus on two speci\ufb01c algorithms, namely k-means [28, 27] and k-\ufb02ats [9]. Although\ntypically discussed in the Euclidean space case, their de\ufb01nition can be easily extended to a Hilbert\nspace setting. The study of manifolds embedded in a Hilbert space is of special interest when\nconsidering non-linear (kernel) versions of the algorithms [15]. More generally, this setting can be\nseen as a limit case when dealing with high dimensional data. Naturally, the more classical setting\nof an absolutely continuous distribution over d-dimensional Euclidean space is simply a particular\ncase, in which X = Rd, and M is a domain with positive Lebesgue measure.\nK-Means. Let H = Sk be the class of sets of size k in X . Given a training set Xn and a choice of\nk, k-means is de\ufb01ned by the minimization over S \u2208 Sk of the empirical reconstruction error\n\nEn(S) :=\n\n1\nn\n\nX (xi, S).\nd2\n\n(2)\n\nwhere, for any \ufb01xed set S, En(S) is an unbiased empirical estimate of E\u03c1(S), so that k-means can\nbe seen to be performing a kind of empirical risk minimization [10, 7, 30, 8, 31].\nA minimizer of Equation 2 on Sk is a discrete set of k means Sn,k = {m1, . . . , mk}, which induces\na Dirichlet-Voronoi tiling of X : a collection of k regions, each closest to a common mean [3] (in our\nnotation, the subscript n denotes the dependence of Sn,k on the sample, while k refers to its size.)\nBy virtue of Sn,k being a minimizing set, each mean must occupy the center of mass of the samples\n\nnX\n\ni=1\n\n2\n\n\fin its Voronoi region. These two facts imply that it is possible to compute a local minimum of\nthe empirical risk by using a greedy coordinate-descent relaxation, namely Lloyd\u2019s algorithm [27].\nFurthermore, given a \ufb01nite sample Xn, the number of locally-minimizing sets Sn,k is also \ufb01nite\nsince (by the center-of-mass condition) there cannot be more than the number of possible partitions\nof Xn into k groups, and therefore the global minimum must be attainable. Even though Lloyd\u2019s\nalgorithm provides no guarantees of closeness to the global minimizer, in practice it is possible to\nuse a randomized approximation algorithm, such as kmeans++ [2], which provides guarantees of\napproximation to the global minimum in expectation with respect to the randomization.\nK-Flats. Let H = Fk be the class of collections of k \ufb02ats (af\ufb01ne spaces) of dimension d. For\nany value of k, k-\ufb02ats, analogously to k-means, aims at \ufb01nding the set Fk \u2208 Fk that minimizes the\nempirical reconstruction (2) over Fk. By an argument similar to the one used for k-means, a global\nminimizer must be attainable, and a Lloyd-type relaxation converges to a local minimum. Note that,\nin this case, given a Voronoi partition of M into regions closest to each d-\ufb02at, new optimizing \ufb02ats\nfor that partition can be computed by a d-truncated PCA solution on the samples falling in each\nregion.\n\n2.2 Learning a Manifold with K-means and K-\ufb02ats\n\nIn practice, k-means is often interpreted to be a clustering algorithm, with clusters de\ufb01ned by the\nVoronoi diagram of the set of means Sn,k. In this interpretation, Equation 2 is simply rewritten\nby summing over the Voronoi regions, and adding all pairwise distances between samples in the\nregion (the intra-cluster distances.) For instance, this point of view is considered in [11] where k-\nmeans is studied from an information theoretic persepective. K-means can also be interpreted to\nbe performing vector quantization, where the goal is to minimize the encoding error associated to\na nearest-neighbor quantizer [17]. Interestingly, in the limit of increasing sample size, this problem\ncoincides, in a precise sense [33], with the problem of optimal quantization of probability distribu-\ntions (see for instance the excellent monograph of [18].)\nWhen the data-generating distribution is supported on a manifold M, k-means can be seen to be\napproximating points on the manifold by a discrete set of means. Analogously to the Euclidean\nsetting, this induces a Voronoi decomposition of M, in which each Voronoi region is effectively\napproximated by a \ufb01xed mean (in this sense k-means produces a piecewise constant approximation\nof M.) As in the Euclidean setting, the limit of this problem with increasing sample size is precisely\nthe problem of optimal quantization of distributions on manifolds, which is the subject of signi\ufb01cant\nrecent work in the \ufb01eld of optimal quantization [20, 21].\nIn this paper, we take the above view of k-means as de\ufb01ning a (piecewise constant) approximation of\nthe manifold M supporting the data distribution. In particular, we are interested in the behavior of\nthe expected reconstruction error E\u03c1(Sn,k), for varying k and n. This perspective has an interesting\nrelation with dictionary learning, in which one is interested in \ufb01nding a dictionary, and an associated\nrepresentation, that allows to approximately reconstruct a \ufb01nite set of data-points/signals. In this\ninterpretation, the set of means can be seen as a dictionary of size k that produces a maximally\nsparse representation (the k-means encoding), see for example [29] and references therein. Crucially,\nwhile the dictionary is learned on the available empirical data, the quantity of interest is the expected\nreconstruction error, and the question of characterizing the performance with respect to this latter\nquantity naturally arises.\nSince k-means produces a piecewise constant approximation of the data, a natural idea is to consider\nhigher orders of approximation, such as approximation by discrete collections of k d-dimensional\naf\ufb01ne spaces (k-\ufb02ats), with possibly better performance. Since M is a d-manifold, the approx-\nimation induced by k-\ufb02ats may more naturally resemble the way in which a manifold is locally\napproximated by its tangent bundle. We provide in Sec. 4.2 a partial answer to this question.\n\n3 Reconstruction Properties of k-Means\n\nSince we are interested in the behavior of the expected reconstruction (1) of k-means and k-\ufb02ats for\nvarying k and n, before analyzing this behavior, we consider what is currently known about this\nproblem, based on previous work. While k-\ufb02ats is a relatively new algorithm whose behavior is not\nyet well understood, several properties of k-means are currently known.\n\n3\n\n\fSphere Dataset\n\nMNIST Dataset\n\nFigure 1: We consider the behavior of k-means for data sets obtained by sampling uniformly a 19\ndimensional sphere embedded in R20 (left). For each value of k, k-means (with k-means++ seeding)\nis run 20 times, and the best solution kept. The reconstruction performance on a (large) hold-out set\nis reported as a function of k. The results for four different training set cardinalities are reported: for\nsmall number of points, the reconstruction error decreases sharply for small k and then increases,\nwhile it is simply decreasing for larger data sets. A similar experiment, yielding similar results,\nis performed on subsets of the MNIST (http://yann.lecun.com/exdb/mnist) database\n(right). In this case the data might be thought to be concentrated around a low dimensional manifold.\nFor example [22] report an average intrinsic dimension d for each digit to be between 10 and 13.\n\nRecall that k-means \ufb01nd an discrete set Sn,k of size k that best approximates the samples in the\nsense of (2). Clearly, as k increases, the empirical reconstruction error En(Sn,k) cannot increase,\nand typically decreases. However, we are ultimately interested in the expected reconstruction error,\nand therefore would like to understand the behavior of E\u03c1(Sn,k) with varying k, n.\nIn the context of optimal quantization, the behavior of the expected reconstruction error E\u03c1 has been\nconsidered for an approximating set Sk obtained by minimizing the expected reconstruction error\nitself over the hypothesis space H = Sk. The set Sk can thus be interpreted as the output of a\npopulation, or in\ufb01nite sample version of k-means. In this case, it is possible to show that E\u03c1(Sk) is\na non increasing function of k and, in fact, to derive explicit rates. For example in the case X = Rd,\nand under fairly general technical assumptions, it is possible to show that E\u03c1(Sk) = \u0398(k\u22122/d),\nwhere the constants depend on \u03c1 and d [18].\nIn machine learning, the properties of k-means have been studied, for \ufb01xed k, by considering the\nexcess reconstruction error E\u03c1(Sn,k) \u2212 E\u03c1(Sk).\nIn particular, this quantity has been studied for\nThe case where X is a Hilbert space has been considered in [30, 8], where an upper-bound of order\nn is proven to hold with high probability. The more general setting where X is a metric space\nk/\nhas been studied in [7].\nWhen analyzing the behavior of E\u03c1(Sn,k), and in the particular case that X = Rd, the above results\ncan be combined to obtain, with high probability, a bound of the form\n\nX = Rd, and shown to be, with high probability, of orderpkd/n, up-to logarithmic factors [31].\n\n\u221a\n\nE\u03c1(Sn,k) \u2264 |E\u03c1(Sn,k) \u2212 En(Sn,k)| + En(Sn,k) \u2212 En(Sk) + |En(Sk) \u2212 E\u03c1(Sk)| + E\u03c1(Sk)\n\n r\n\n!\n\n\u2264 C\n\n+ k\u22122/d\n\nkd\nn\n\n(3)\n\nup to logarithmic factors, where the constant C does not depend on k or n (a complete derivation is\ngiven in the Appendix.) The above inequality suggests a somewhat surprising effect: the expected\nreconstruction properties of k-means may be described by a trade-off between a statistical error (of\norder\n\nn ) and a geometric approximation error (of order k\u22122/d.)\n\nq kd\n\nThe existence of such a tradeoff between the approximation, and the statistical errors may itself not\nbe entirely obvious, see the discussion in [4]. For instance, in the k-means problem, it is intuitive\nthat, as more means are inserted, the expected distance from a random sample to the means should\n\n4\n\n00.10.20.30.40.50.60.70.80.910.650.70.750.80.850.90.9511.05k/nExpected reconstruction error: k\u2212means, d=20 n=100n=200n=500n=100000.10.20.30.40.50.60.70.80.9122.533.5x 106k/nTest set reconstruction error: k\u2212means, MNIST n=100n=200n=500n=1000\f(a) E\u03c1(Sk=1) \u2019 1.5\n\n(b) E\u03c1(Sk=2) \u2019 2\n\nFigure 2: The optimal k-means (red) computed from n = 2 samples drawn uniformly on S100 (blue.) For a)\nk = 1, the expected squared-distance to a random point x \u2208 S100 is E\u03c1(Sk=1) \u2019 1.5, while for b) k = 2, it is\nE\u03c1(Sk=2) \u2019 2.\n\ndecrease, and one might expect a similar behavior for the expected reconstruction error. This obser-\nvation naturally begs the question of whether and when this trade-off really exists or if it is simply a\nresult of the looseness in the bounds. In particular, one could ask how tight the bound (3) is.\nWhile the bound on E\u03c1(Sk) is known to be tight for k suf\ufb01ciently large [18], the remaining terms\n(which are dominated by |E\u03c1(Sn,k) \u2212 En(Sn,k)|) are derived by controlling the supremum of an\nempirical process\n\n|En(S) \u2212 E\u03c1(S)|\n\nsup\nS\u2208Sk\n\n(4)\n\nq\n\nd\n\nk1\u2212 4\nn\n\nand it is unknown whether available bounds for it are tight [30]. Indeed, it is not clear how close\nthe distortion redundancy E\u03c1(Sn,k) \u2212 E\u03c1(Sk) is to its known lower bound of order d\n(in\nexpectation) [4]. More importantly, we are not aware of a lower bound for E\u03c1(Sn,k) itself. Indeed,\nas pointed out in [4], \u201cThe exact dependence of the minimax distortion redundancy on k and d is\nstill a challenging open problem\u201d.\nFinally, we note that, whenever a trade-off can be shown to hold, it may be used to justify a heuristic\nfor choosing k empirically as the value that minimizes the reconstruction error in a hold-out set.\nIn Figure 1 we perform some simple numerical simulations showing that the trade-off indeed occurs\nin certain regimes. The following example provides a situation where a trade-off can be easily shown\nto occur.\nExample 1. Consider a setup in which n = 2 samples are drawn from a uniform distribution on the\nunit d = 100-sphere, though the argument holds for other n much smaller than d. Because d (cid:29) n,\nwith high probability, the samples are nearly orthogonal: < x1, x2 >X\u2019 0, while a third sample x\ndrawn uniformly on S100 will also very likely be nearly orthogonal to both x1, x2 [25]. The k-means\nsolution on this dataset is clearly Sk=1 = {(x1 + x2)/2} (Fig 2(a)). Indeed, since Sk=2 = {x1, x2}\n(Fig 2(b)), it is E\u03c1(Sk=1) \u2019 1.5 < 2 \u2019 E\u03c1(Sk=2) with very high probability. In this case, it is\nbetter to place a single mean closer to the origin (with E\u03c1({0}) = 1), than to place two means at\nthe sample locations. This example is suf\ufb01ciently simple that the exact k-means solution is known,\nbut the effect can be observed in more complex settings.\n\n4 Main Results\n\nContributions. Our work extends previous results in two different directions:\n\n(a) We provide an analysis of k-means for the case in which the data-generating distribution is\nsupported on a manifold embedded in a Hilbert space. In particular, in this setting: 1) we derive\nnew results on the approximation error, and 2) new sample complexity results (learning rates)\narising from the choice of k by optimizing the resulting bound. We analyze the case in which\na solution is obtained from an approximation algorithm, such as k-means++ [2], to include this\ncomputational error in the bounds.\n\n5\n\n\f(b) We generalize the above results from k-means to k-\ufb02ats, deriving learning rates obtained from\nnew bounds on both the statistical and the approximation errors. To the best of our knowledge,\nthese results provide the \ufb01rst theoretical analysis of k-\ufb02ats in either sense.\n\nWe note that the k-means algorithm has been widely studied in the past, and much of our analysis in\nthis case involves the combination of known facts to obtain novel results. However, in the case of k-\n\ufb02ats, there is currently no known analysis, and we provide novel results as well as new performance\nbounds for each of the components in the bounds.\nThroughout this section we make the following technical assumption:\nAssumption 1. M is a smooth d-manifold with metric of class C1, contained in the unit ball in X ,\nand with volume measure denoted by \u00b5I. The probability measure \u03c1 is absolutely continuous with\nrespect to \u00b5I, with density p.\n\n4.1 Learning Rates for k-Means\n\nThe \ufb01rst result considers the idealized case where we have access to an exact solution for k-means.\nTheorem 1. Under Assumption 1, if Sn,k is a solution of k-means then, for 0 < \u03b4 < 1, there are\nconstants C and \u03b3 dependent only on d, and suf\ufb01ciently large n0 such that, by setting\n\nd\n\nkn = n\n\n2(d+2) \u00b7\n\n(cid:18) C\n\n(cid:26)Z\n(cid:19)d/(d+2) \u00b7\n(cid:26)Z\n(cid:20)\nE\u03c1(Sn) \u2264 \u03b3 \u00b7 n\u22121/(d+2) \u00b7pln 1/\u03b4 \u00b7\n\n\u221a\n24\n\n\u03c0\n\nP\n\nand Sn = Sn,kn, it is\n\n(cid:27)\n(cid:27)(cid:21)\n\nd\u00b5I(x)p(x)d/(d+2)\n\n,\n\nM\n\nd\u00b5I(x)p(x)d/(d+2)\n\nM\n\n\u2265 1 \u2212 \u03b4,\n\n(5)\n\n(6)\n\n(9)\n\nfor all n \u2265 n0, where C \u223c d/(2\u03c0e) and \u03b3 grows sublinearly with d.\nRemark 1. Note that the distinction between distributions with density in M, and singular distri-\nbutions is important. The bound of Equation (6) holds only when the absolutely continuous part of \u03c1\nover M is non-vanishing. the case in which the distribution is singular over M requires a different\nanalysis, and may result in faster convergence rates.\n\n(cid:27)\n\nkn = n\n\nd\n\n2(d+2) \u00b7\n\n\u221a\n24\n\n\u03c0\n\nand Sn = Sn,kn, it is\n\nThe following result considers the case where the k-means++ algorithm is used to compute the\nestimator.\nTheorem 2. Under Assumption 1, if Sn,k is the solution of k-means++ , then for 0 < \u03b4 < 1, there\nare constants C and \u03b3 that depend only on d, and a suf\ufb01ciently large n0 such that, by setting\n\n(cid:18) C\n\n(cid:26)Z\n(cid:19)d/(d+2) \u00b7\n(cid:20)\n(cid:1) \u00b7pln 1/\u03b4 \u00b7\nEZ E\u03c1(Sn) \u2264 \u03b3 \u00b7 n\u22121/(d+2)(cid:0)ln n + lnkpkd/(d+2)\n(cid:27)(d+2)/d\n\n\u2265 1\u2212\u03b4,\n(8)\nfor all n \u2265 n0, where the expectation is with respect to the random choice Z in the algorithm, and\nkpkd/(d+2) =\nRemark 2. In the particular case that X = Rd and M is contained in the unit ball, we may further\nbound the distribution-dependent part of Equations 6 and 8. Using H\u00a8older\u2019s inequality, one obtains\n\n, C \u223c d/(2\u03c0e), and \u03b3 grows sublinearly with d.\n\nd\u00b5I(x)p(x)d/(d+2)\n\nd\u00b5I(x)p(x)d/(d+2)\n\nd\u00b5I(x)p(x)d/(d+2)\n\n,\n\n(7)\n\n(cid:26)Z\n\n(cid:26)Z\n\n(cid:27)(cid:21)\n\nM\n\nM\n\nM\n\nP\n\nZ\n\n(cid:20)Z\n\n(cid:21)d/(d+2) \u00b7\n\n(cid:20)Z\n\n(cid:21)2/(d+2)\n\nd\u03bd(x)p(x)d/(d+2) \u2264\n\nd\u03bd(x)p(x)\n\nM\n\n\u2264 Vol(M)2/(d+2) \u2264 \u03c92/(d+2)\n\nd\n\nd\u03bd(x)\n\nM\n\n,\n\nwhere \u03bd is the Lebesgue measure in Rd, and \u03c9d is the volume of the d-dimensional unit ball.\n\n6\n\n\fIt is clear from the proof of Theorem 1 that, in this case, we may choose\n\nkn = n\n\nd\n\n2(d+2) \u00b7\n\n(cid:18) C\n\n\u221a\n\n24\n\n\u03c0\n\n(cid:19)d/(d+2) \u00b7 \u03c92/d\n(cid:16)\n\n,\n\nd\n\nn\u22121/(d+2) \u00b7pln 1/\u03b4\n\n(cid:17)\n\nn) = O\n\nindependently of the density p, to obtain a bound E\u03c1(S\u2217\nwith prob-\nability 1 \u2212 \u03b4 (and similarly for Theorem 2, except for an additional ln n term), where the constant\nonly depends on the dimension.\nRemark 3. Note that according to the above theorems, choosing k requires knowledge of proper-\nties of the distribution \u03c1 underlying the data, such as the intrinsic dimension of the support. In fact,\nfollowing the ideas in [36] Section 6.3-5, it is easy to prove that choosing k to minimize the recon-\nstruction error on a hold-out set, allows to achieve the same learning rates (up to a logarithmic\nfactor), adaptively in the sense that knowledge of properties of \u03c1 are not needed.\n\n4.2 Learning Rates for k-Flats\n\nTo study k-\ufb02ats, we need to slightly strengthen Assumption 1 by adding to it by the following:\nAssumption 2. Assume the manifold M to have metric of class C3, and \ufb01nite second fundamental\nform II [16].\n\nOne reason for the higher-smoothness assumption is that k-\ufb02ats uses higher order approximation,\nwhose analysis requires a higher order of differentiability.\nWe begin by providing a result for k-\ufb02ats on hypersurfaces (codimension one), and next extend it to\nmanifolds in more general spaces.\nTheorem 3. Let, X = Rd+1. Under Assumptions 1,2, if Fn,k is a solution of k-\ufb02ats, then there is a\nconstant C that depends only on d, and suf\ufb01ciently large n0 such that, by setting\n\n(cid:18) C\n\n\u221a\n\n2\n\n2\u03c0d\n\n(cid:19)d/(d+4) \u00b7 (\u03baM)4/(d+4) ,\nr1\n\n2\n\n#\n\n(10)\n\n\u2265 1 \u2212 \u03b4,\n\n(11)\n\n2(d+4) \u00b7\nand Fn = Fn,kn, then for all n \u2265 n0 it is\n\nkn = n\n\nd\n\n\"\nwhere \u03baM := \u00b5|II|(M) =R\n\nP\n\nE\u03c1(Fn) \u2264 2 (8\u03c0d)2/(d+4) C d/(d+4) \u00b7 n\u22122/(d+4) \u00b7\n\nln 1/\u03b4 \u00b7 (\u03baM)4/(d+4)\n\nM d\u00b5I(x)|\u03ba1/2\n\nG (x)| is the total root curvature of M, \u00b5|II| is the measure\nassociated with the (positive) second fundamental form, and \u03baG is the Gaussian curvature on M.\nIn the more general case of a d-manifold M (with metric in C3) embedded in a separable Hilbert\nspace X , we cannot make any assumption on the codimension of M (the dimension of the orthog-\nonal complement to the tangent space at each point.) In particular, the second fundamental form II,\nwhich is an extrinsic quantity describing how the tangent spaces bend locally is, at every x \u2208 M, a\nmap IIx : TxM 7\u2192 (TxM)\u22a5 (in this case of class C1 by Assumption 2) from the tangent space to\nits orthogonal complement (II(x) := B(x, x) in the notation of [16, p. 128].) Crucially, in this case,\nwe may no longer assume the dimension of the orthogonal complement (TxM)\u22a5 to be \ufb01nite.\nDenote by |IIx| = supr\u2208TxM\nkrk\u22641\nTheorem 4. Under Assumptions 1,2, if Fn,k is a solution to the k-\ufb02ats problem, then there is a\nconstant C that depends only on d, and suf\ufb01ciently large n0 such that, by setting\n\nkIIx(r)kX , the operator norm of IIx. We have:\n\n(cid:18) C\n\n\u221a\n2\n\n2\u03c0d\n\n(cid:19)d/(d+4) \u00b7 \u03ba4/(d+4)\nr1\n\nM\n\n,\n\n#\n\n(12)\n\n\u2265 1 \u2212 \u03b4,\n\n(13)\n\n2(d+4) \u00b7\nand Fn = Fn,kn, then for all n \u2265 n0 it is\n\nkn = n\n\nd\n\n\"\n\nP\n\nwhere \u03baM :=R\n\nM d\u00b5I(x) |IIx|2\n\n7\n\nE\u03c1(Fn) \u2264 2 (8\u03c0d)2/(d+4) C d/(d+4) \u00b7 n\u22122/(d+4) \u00b7\n\nln 1/\u03b4 \u00b7 \u03ba4/(d+4)\n\nM\n\n2\n\n\fNote that the better k-\ufb02ats bounds stem from the higher approximation power of d-\ufb02ats over points.\nAlthough this greatly complicates the setup and proofs, as well as the analysis of the constants, the\n\nresulting bounds are of order O(cid:0)n\u22122/(d+4)(cid:1), compared with the slower order O(cid:0)n\u22121/(d+2)(cid:1) of\n\nk-means.\n\n4.3 Discussion\n\nIn all the results, the \ufb01nal performance does not depend on the dimensionality of the embedding\nspace (which in fact can be in\ufb01nite), but only on the intrinsic dimension of the space on which the\ndata-generating distribution is de\ufb01ned. The key to these results is an approximation construction in\nwhich the Voronoi regions on the manifold (points closest to a given mean or \ufb02at) are guaranteed to\nhave vanishing diameter in the limit of k going to in\ufb01nity. Under our construction, a hypersurface is\napproximated ef\ufb01ciently by tracking the variation of its tangent spaces by using the second funda-\nmental form. Where this form vanishes, the Voronoi regions of an approximation will not be ensured\nto have vanishing diameter with k going to in\ufb01nity, unless certain care is taken in the analysis.\nAn important point of interest is that the approximations are controlled by averaged quantities,\nsuch as the total root curvature (k-\ufb02ats for surfaces of codimension one), total curvature (k-\ufb02ats\nin arbitrary codimensions), and d/(d + 2)-norm of the probability density (k-means), which are\nintegrated over the domain where the distribution is de\ufb01ned. Note that these types of quantities have\nbeen linked to provably tight approximations in certain cases, such as for convex manifolds [19, 12],\nin contrast with worst-case methods that place a constraint on a maximum curvature, or minimum\ninjectivity radius (for instance [1, 32].) Intuitively, it is easy to see that a constraint on an average\nquantity may be arbitrarily less restrictive than one on its maximum. A small dif\ufb01cult region (e.g.\nof very high curvature) may cause the bounds of the latter to substantially degrade, while the results\npresented here would not be adversely affected so long as the region is small.\nAdditionally, care has been taken throughout to analyze the behavior of the constants. In particular,\nthere are no constants in the analysis that grow exponentially with the dimension, and in fact, many\nhave polynomial, or slower growth. We believe this to be an important point, since this ensures that\nthe asymptotic bounds do not hide an additional exponential dependence on the dimension.\n\nReferences\n[1] William K Allard, Guangliang Chen, and Mauro Maggioni. Multiscale geometric methods for data sets\n\nii: Geometric multi-resolution analysis. Applied and Computational Harmonic Analysis, 1:1\u201338, 2011.\n\n[2] David Arthur and Sergei Vassilvitskii. k\u2013means++: the advantages of careful seeding. In Proceedings\nof the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA \u201907, pages 1027\u20131035,\nPhiladelphia, PA, USA, 2007. SIAM.\n\n[3] Franz Aurenhammer. Voronoi diagrams: A survey of a fundamental geometric data structure. ACM\n\nComput. Surv., 23:345\u2013405, September 1991.\n\n[4] Peter L. Bartlett, Tamas Linder, and Gabor Lugosi. The minimax distortion redundancy in empirical\n\nquantizer design. IEEE Transactions on Information Theory, 44:1802\u20131813, 1998.\n\n[5] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Comput., 15(6):1373\u20131396, 2003.\n\n[6] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: a geometric framework\n\nfor learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399\u20132434, 2006.\n\n[7] Shai Ben-David. A framework for statistical clustering with constant time approximation algorithms for\n\nk-median and k-means clustering. Mach. Learn., 66(2-3):243\u2013257, March 2007.\n\n[8] G\u00b4erard Biau, Luc Devroye, and G\u00b4abor Lugosi. On the performance of clustering in hilbert spaces. IEEE\n\nTransactions on Information Theory, 54(2):781\u2013790, 2008.\n\n[9] P. S. Bradley and O. L. Mangasarian. k-plane clustering. J. of Global Optimization, 16:23\u201332, January\n\n2000.\n\n[10] Joachim M. Buhmann. Empirical risk approximation: An induction principle for unsupervised learning.\n\nTechnical report, University of Bonn, 1998.\n\n[11] Joachim M. Buhmann. Information theoretic model validation for clustering. In International Symposium\n\non Information Theory, Austin Texas. IEEE, 2010. (in press).\n\n8\n\n\f[12] Kenneth L. Clarkson. Building triangulations using \u0001-nets. In Proceedings of the thirty-eighth annual\nACM symposium on Theory of computing, STOC \u201906, pages 326\u2013335, New York, NY, USA, 2006. ACM.\n[13] A. Cuevas and R. Fraiman. Set estimation. In New perspectives in stochastic geometry, pages 374\u2013397.\n\nOxford Univ. Press, Oxford, 2010.\n\n[14] A. Cuevas and A. Rodr\u00b4\u0131guez-Casal. Set estimation: an overview and some recent developments. In Recent\n\nadvances and trends in nonparametric statistics, pages 251\u2013264. Elsevier B. V., Amsterdam, 2003.\n\n[15] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized\ncuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and\ndata mining, KDD \u201904, pages 551\u2013556, New York, NY, USA, 2004. ACM.\n\n[16] M.P. DoCarmo. Riemannian geometry. Theory and Applications Series. Birkh\u00a8auser, 1992.\n[17] Allen Gersho and Robert M. Gray. Vector quantization and signal compression. Kluwer Academic\n\nPublishers, Norwell, MA, USA, 1991.\n\n[18] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer-\n\nVerlag New York, Inc., Secaucus, NJ, USA, 2000.\n\n[19] P. M. Gruber. Asymptotic estimates for best and stepwise approximation of convex bodies i. Forum\n\nMathematicum, 15:281\u2013297, 1993.\n\n[20] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002.\n[21] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer,\n\n2007.\n\n[22] Matthias Hein and Jean-Yves Audibert. Intrinsic dimensionality estimation of submanifolds in rd. In\nICML \u201905: Proceedings of the 22nd international conference on Machine learning, pages 289\u2013296, 2005.\n[23] V. De Silva J. B. Tenenbaum and J. C. Langford. A global geometric framework for nonlinear dimension-\n\nality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[24] Ravikrishna Kolluri, Jonathan Richard Shewchuk, and James F. O\u2019Brien. Spectral surface reconstruction\nIn Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on\n\nfrom noisy point clouds.\nGeometry processing, SGP \u201904, pages 11\u201321, New York, NY, USA, 2004. ACM.\n\n[25] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. Amer-\n\nican Mathematical Society, 2001.\n\n[26] David Levin. Mesh-independent surface interpolation. In Hamann Brunnett and Mueller, editors, Geo-\n\nmetric Modeling for Scienti\ufb01c Visualization, pages 37\u201349. Springer-Verlag, 2003.\n\n[27] Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129\u2013\n\n137, 1982.\n\n[28] J. B. MacQueen. Some methods for classi\ufb01cation and analysis of multivariate observations. In L. M. Le\nCam and J. Neyman, editors, Proc. of the \ufb01fth Berkeley Symposium on Mathematical Statistics and Prob-\nability, volume 1, pages 281\u2013297. University of California Press, 1967.\n\n[29] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse\ncoding. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML \u201909,\npages 689\u2013696, 2009.\n\n[30] A. Maurer and M. Pontil. K\u2013dimensional coding schemes in hilbert spaces.\n\nInformation Theory, 56(11):5839 \u20135846, nov. 2010.\n\nIEEE Transactions on\n\n[31] A. Maurer and M. Pontil. K-dimensional coding schemes in Hilbert spaces. IEEE Trans.Inf.Th, 56(11),\n\n2010.\n\n[32] Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis.\n\nAdvances in Neural Information Processing Systems 23, pages 1786\u20131794. MIT Press, 2010.\n\nIn\n\n[33] David Pollard. Strong consistency of k-means clustering. Annals of Statistics, 9(1):135\u2013140, 1981.\n[34] ST Roweis and LK Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290:2323\u20132326, 2000.\n\n[35] Florian Steinke, Matthias Hein, and Bernhard Sch\u00a8olkopf. Nonparametric regression between general\n\nRiemannian manifolds. SIAM J. Imaging Sci., 3(3):527\u2013563, 2010.\n\n[36] I. Steinwart and A. Christmann. Support vector machines. Information Science and Statistics. Springer,\n\nNew York, 2008.\n\n[37] Ulrike von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395\u2013416, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1191, "authors": [{"given_name": "Guillermo", "family_name": "Canas", "institution": null}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": null}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": null}]}