{"title": "On the Reliability of Clustering Stability in the Large Sample Regime", "book": "Advances in Neural Information Processing Systems", "page_first": 1465, "page_last": 1472, "abstract": "Clustering stability is an increasingly popular family of methods for performing model selection in data clustering. The basic idea is that the chosen model should be stable under perturbation or resampling of the data. Despite being reasonably effective in practice, these methods are not well understood theoretically, and present some difficulties. In particular, when the data is assumed to be sampled from an underlying distribution, the solutions returned by the clustering algorithm will usually become more and more stable as the sample size increases. This raises a potentially serious practical difficulty with these methods, because it means there might be some hard-to-compute sample size, beyond which clustering stability estimators 'break down' and become unreliable in detecting the most stable model. Namely, all models will be relatively stable, with differences in their stability measures depending mostly on random and meaningless sampling artifacts. In this paper, we provide a set of general sufficient conditions, which ensure the reliability of clustering stability estimators in the large sample regime. In contrast to previous work, which concentrated on specific toy distributions or specific idealized clustering frameworks, here we make no such assumptions. We then exemplify how these conditions apply to several important families of clustering algorithms, such as maximum likelihood clustering, certain types of kernel clustering, and centroid-based clustering with any Bregman divergence. In addition, we explicitly derive the non-trivial asymptotic behavior of these estimators, for any framework satisfying our conditions. This can help us understand what is considered a 'stable' model by these estimators, at least for large enough samples.", "full_text": "On the Reliability of Clustering Stability in the Large\n\nSample Regime - Supplementary Material\n\nOhad Shamir\u2020 and Naftali Tishby\u2020\u2021\n\n\u2020 School of Computer Science and Engineering\n\u2021 Interdisciplinary Center for Neural Computation\n\nThe Hebrew University\nJerusalem 91904, Israel\n\n{ohadsh,tishby}@cs.huji.ac.il\n\nA Exact Formulation of the Suf\ufb01cient Conditions\n\nIn this section, we give a mathematically rigorous formulation of the suf\ufb01cient conditions discussed\nin the main paper. For that we will need some additional notation.\n\nit will be convenient\n\nto de\ufb01ne a scaled version of our distance measure\n\nFirst of all,\ndD(Ak(S1), Ak(S2)) between clusterings. Formally, de\ufb01ne the random variable\nD (Ak(S1), Ak(S2)) := \u221amdD(Ak(S1), Ak(S2)) = \u221am Pr\nf\u02c6\u03b8\u2032,i(x)(cid:19) ,\ndm\nwhere \u03b8, \u03b8\u2032 \u2208 \u0398 are the solutions returned by Ak(S1), Ak(S2), and S1, S2 are random samples, each\nof size m, drawn i.i.d from the underlying distribution D. The scaling by the square root of the\nsample size will allow us to analyze the non-trivial asymptotic behavior of these distance measures,\nwhich without scaling simply converge to zero in probability as m \u2192 \u221e.\nFor some \u01eb > 0 and a set S \u2286 Rn, let B\u01eb(S) be the \u01eb-neighborhood of S, namely\n\nx\u223cD(cid:18)argmax\n\nf\u02c6\u03b8,i(x) 6= argmax\n\ni\n\ni\n\nIn particular, dm\n(Ak(S1), Ak(S2), Br/\u221am(\u222ai,jF\u03b80,i,j)) refers to the mass which switches clusters,\nD\nand is also inside an r/\u221am-neighborhood of the limit cluster boundaries (where the boundaries are\nde\ufb01ned with respect to f\u03b80(\u00b7)). Once again, when S1, S2 are random samples, we can think of it as\na random variable with respect to drawing and clustering S1, S2.\nConditions. The following conditions shall be assumed to hold:\n\n1. Consistency Condition: \u02c6\u03b8 converges in probability (over drawing and clustering a sample\nof size m, m \u2192 \u221e) to some \u03b80 \u2208 \u0398. Furthermore, the association of clusters to indices\n{1, . . . , k} is constant in some neighborhood of \u03b80.\n2. Central Limit Condition: \u221am(\u02c6\u03b8 \u2212 \u03b80) converges in distribution to a multivariate zero\n\nmean Gaussian random variable Z.\n\n1\n\nB\u01eb(S) :=(cid:26)x \u2208 X : inf\n\ny\u2208S kx \u2212 yk2 \u2264 \u01eb(cid:27) .\n\nIn this paper, when we talk about neighborhoods in general, we will always assume they are uniform\n(namely, contain an \u01eb-neighborhood for some positive \u01eb).\nWe will also need to de\ufb01ne the following variant of dm\n(Ak(S1), Ak(S2)), where we restrict our-\nD\nselves to the mass in some subset of Rn. Formally, we de\ufb01ne the restricted distance between two\nclusterings, with respect to a set B \u2208 Rn, as\nD (Ak(S1), Ak(S2), B) := \u221am Pr\n\nf\u02c6\u03b8,i(x) 6= argmax\n\ndm\n\n(1)\n\ni\n\nx\u223cD(cid:0)argmax\n\ni\n\nf\u02c6\u03b8\u2032,i(x) \u2227 x \u2208 B(cid:1).\n\n\f3. Regularity Conditions:\n\n(a) f\u03b8(x) is Suf\ufb01ciently Smooth: For any \u03b8 in some neighborhood of \u03b80, and any x in\nsome neighborhood of the cluster boundaries \u222ai,jF\u03b80,i,j, f\u03b8(x) is twice continuously\ndifferentiable with respect to \u03b8, with a non-zero \ufb01rst derivative and uniformly bounded\nsecond derivative for any x. Both f\u03b80 (x) and (\u2202/\u2202\u03b8)f\u03b80 (x) are twice differentiable\nwith respect to any x \u2208 X , with a uniformly bounded second derivative.\n(b) Limit Cluster Boundaries are Reasonably Nice: For any two clusters i, j, F\u03b80,i,j is\neither empty, or a compact, non-self-intersecting, orientable n\u2212 1 dimensional hyper-\nsurface in Rn with \ufb01nite positive volume, a boundary (edge), and with a neighborhood\ncontained in X in which the underlying density function p(\u00b7) is continuous. Moreover,\nthe gradient \u2207(f\u03b80,i(\u00b7) \u2212 f\u03b80,j(\u00b7)) has positive magnitude everywhere on F\u03b80,i,j.\n(c) Intersections of Cluster Boundaries are Relatively Negligible: For any two distinct\nnon-empty cluster boundaries F\u03b80,i,j, F\u03b80,i\u2032,j\u2032, we have that\n\n1\n\n1\n\n1dx ,\n\n1dx\n\n(d) Minimal Parametric Stability: It holds for some \u03b4 > 0 that\n\n\u01eb ZB\u01eb(F\u03b80,i,j\u222aF\u03b80,i\u2032 ,j\u2032 )\u2229B\u03b4(F\u03b80,i,j )\u2229B\u03b4(F\u03b80 ,i\u2032,j\u2032 )\n\n\u01eb ZB\u01eb(\u2202F\u03b80,i,j )\nconverge to 0 as \u01eb, \u03b4 \u2192 0 (in any manner), where \u2202F\u03b80,i is the edge of F\u03b80,i,j.\nPr `dm\nD (Ak(S1), Ak(S2), Br/\u221am (\u222ai,j F\u03b8 0,i,j))\u00b4 = O(r\u22123\u2212\u03b4) + o(1),\nwhere o(1) \u2192 0 as m \u2192 \u221e. Namely, the mass of D which switches between clusters\nis with high probability inside thin strips around the limit cluster boundaries, and this\nhigh probability increases at least polynomially as the width of the strips increase (see\nbelow for a further discussion of this).\n\nD (Ak(S1), Ak(S2)) 6= dm\n\nThe regularity assumptions are relatively mild, and can usually be inferred based on the consistency\nand central limit conditions, as well as the the speci\ufb01c clustering framework that we are considering.\nFor example, condition 3c and the assumptions on F\u03b80,i,j in condition 3b are ful\ufb01lled in a cluster-\ning framework where the clusters are separated by hyperplanes. As to condition 3d, suppose our\nclustering framework is such that the cluster boundaries depend on \u02c6\u03b8 in a smooth manner. Then the\nasymptotic normality of \u02c6\u03b8, with variance O(1/m), and the compactness of X , will generally imply\nthat the cluster boundaries obtained from clustering a sample are contained with high probability\ninside strips of width O(1/\u221am) around the limit cluster boundaries. More speci\ufb01cally, the asymp-\ntotic probability of this happening for strips of width r/\u221am will be exponentially high in r, due\nto the asymptotic normality of \u02c6\u03b8. As a result, the mass which switches between clusters, when we\ncompare two independent clusterings, will be in those strips with probability exponentially high in\nr. Therefore, condition 3d will hold by a large margin, since only polynomially high probability is\nrequired there.\n\nB Proofs - General Remarks\n\nThe proofs will use the additional notation and the suf\ufb01cient conditions, as presented in Sec. A.\nThroughout the proofs, we will sometimes use the stochastic order notation Op(\u00b7) and op(\u00b7) (cf.\n[8]), de\ufb01ned as follows. Let {Xm} and {Ym} be sequences of random vectors, de\ufb01ned on the same\nprobability space. We write Xm = Op(Ym) to mean that for each \u01eb > 0 there exists a real number\nM such that Pr(kXmk \u2265 MkYmk) < \u01eb if m is large enough. We write Xm = op(Ym) to mean that\nPr(kXmk \u2265 \u01ebkYmk) \u2192 0 for each \u01eb > 0. Notice that {Ym} may also be non-random. For example,\nXm = op(1) means that Xm \u2192 0 in probability. When we write for example Xm = Ym + op(1),\nwe mean that Xm \u2212 Ym = op(1).\nC Proof of Proposition 1\n\nBy condition 3a, f\u03b8(x) has a \ufb01rst order Taylor expansion with respect to any \u02c6\u03b8 close enough to \u03b80,\nwith a remainder term uniformly bounded for any x:\n\nf\u02c6\u03b8(x) = f\u03b80(x) +(cid:18) \u2202\n\n\u2202\u03b8\n\nf\u03b80(x)(cid:19)\u22a4\n\n(\u02c6\u03b8 \u2212 \u03b80) + o(k\u02c6\u03b8 \u2212 \u03b80k).\n\n(2)\n\n2\n\n\fBy the asymptotic normality assumption, \u221amk\u02c6\u03b8 \u2212 \u03b80k = Op(1), hence k\u02c6\u03b8 \u2212 \u03b80k = Op(1/\u221am).\n\nTherefore, we get from Eq. (2) that\n\n\u221am(cid:0)f\u02c6\u03b8(x) \u2212 f\u03b80(x)(cid:1) =(cid:18) \u2202\n\n\u2202\u03b8\n\nf\u03b80(x)(cid:19)\u22a4\n\n(\u221am(\u02c6\u03b8 \u2212 \u03b80)) + op(1),\n\n(3)\n\nwhere the remainder term op(1) does not depend on x. By regularity condition 3a and compactness\nof X , (\u2202/\u2202\u03b8)f\u03b80 (\u00b7) is a uniformly bounded vector-valued function from X to the Euclidean space\nin which \u0398 resides. As a result, the mapping \u02c6\u03b8 7\u2192 ((\u2202/\u2202\u03b8)f\u03b80 (\u00b7))\u22a4\u02c6\u03b8 is a mapping from \u0398, with\nthe metric induced by the Euclidean space in which it resides, to the space of all uniformly bounded\nRk-valued functions on X . We can turn the latter space into a metric space by equipping it with\nthe obvious extension of the supremum norm (namely, for any two functions f (\u00b7), g(\u00b7), kf \u2212 gk :=\nsupx\u2208X kf (x)\u2212 g(x)k\u221e, where k\u00b7k\u221e is the in\ufb01nity norm in Euclidean space). With this norm, the\nmapping above is a continuous mapping between two metric spaces. We also know that \u221am(\u02c6\u03b8\u2212\u03b80)\nconverges in distribution to a multivariate Gaussian random variable Z. By the continuous mapping\ntheorem [8] and Eq. (3), this implies that \u221am(f\u02c6\u03b8(\u00b7)\u2212f\u03b80 (\u00b7)) converges in distribution to a Gaussian\nprocess G(\u00b7), where\n\nG(\u00b7) :=(cid:18) \u2202\n\n\u2202\u03b8\n\nf\u03b80 (\u00b7)(cid:19)\u22a4\n\nZ.\n\n(4)\n\nD Proof of Thm. 1\n\nD.1 A High Level Description of the Proof\n\nThe full proof of Thm. 1 is rather long and technical, mostly due to the many technical subtleties\nthat need to be taken care of. Since these might obscure the main ideas, we present here separately\na general overview of the proof, without the \ufb01ner details.\n\nEdm\n\nEdm\nD\n\nEdm\nD\n\n(Ak(S1), Ak(S2)). As a result, we will have to take a more indirect route.\n\nm,q, scaled by \u221am, boils down to trying to assess the\nThe purpose of the stability estimator \u02c6\u03b7k\n\u201dexpected\u201d value of the random variable dm\n(Ak(S1), Ak(S2)): we estimate q instantiations of\nD\nD (Ak(S1), Ak(S2)), and take their average. Our goal is to show that this average, taking m \u2192 \u221e,\ndm\nis likely to be close to the value \\instab(Ak,D) as de\ufb01ned in the theorem. The most straightforward\nway to go about it is to prove that \\instab(Ak,D) actually equals limm\u2192\u221e\n(Ak(S1), Ak(S2)),\nand then use some large deviation bound to prove that \u221am \u02c6\u03b7k\nm,q is indeed close to it with high\nprobability, if q is large enough. Unfortunately, computing limm\u2192\u221e\nD (Ak(S1), Ak(S2)) is prob-\nlematic. The reason is that the convergence tools at our disposal deals with convergence in dis-\ntribution of random variables, but convergence in distribution does not necessarily imply conver-\ngence of expectations.\nIn other words, we can try and analyze the asymptotic distribution of\n(Ak(S1), Ak(S2)), but the expected value of this asymptotic distribution is not necessarily the\ndm\nD\nsame as limm\u2192\u221e\nHere is the basic idea: instead of analyzing the asymptotic expectation of dm\n(Ak(S1), Ak(S2)), we\nD\nanalyze the asymptotic expectation of a different random variable, dm\n(Ak(S1), Ak(S2), B), which\nD\nwas formally de\ufb01ned in Eq. (1). Informally, recall that dm\n(Ak(S1), Ak(S2)) is the mass of the un-\nD\nderlying distribution D which switches between clusters, when we draw and cluster two indepen-\n(Ak(S1), Ak(S2), B) measures the subset of this mass, which\ndent samples of size m. Then dm\nD\nlies inside some B \u2286 Rn. In particular, following the notation of Sec. A, we will pick B to be\n(Ak(S1), Ak(S2), Br/\u221am(\u222ai,jF\u03b80,i,j)) for some r > 0. In words, this constitutes strips of width\ndm\nD\nr/\u221am around the limit cluster boundaries. Writing the above expression for B as Br/\u221am, we have\nthat if r be large enough, then dm\n(Ak(S1), Ak(S2)) with\nD\nvery high probability over drawing and clustering a pair of samples, for any large enough sample\nsize m. Basically, this is because the \ufb02uctuations of the cluster boundaries, based on drawing and\nclustering a random sample of size m, cannot be too large, and therefore the mass which switches\nclusters is concentrated around the limit cluster boundaries, if m is large enough.\n(Ak(S1), Ak(S2), Br/\u221am) is that it is bounded\nThe advantage of the \u2019surrogate\u2019 random variable dm\nD\nfor any \ufb01nite r, unlike dm\n(Ak(S1), Ak(S2)). With bounded random variables, convergence in\nD\ndistribution does imply convergence of expectations, and as a result we are able to calcu-\n(Ak(S1), Ak(S2), Br/\u221am) explicitly. This will turn out to be very close to\nlate limm\u2192\u221e\n\n(Ak(S1), Ak(S2), Br/\u221am) is equal to dm\nD\n\nEdm\nD\n\n3\n\n\fEdm\nD\n\nEdm\nD\n\nthe proof\n\nis divided into two parts:\n\nm,q will be close to limm\u2192\u221e\n\n(Ak(S1), Ak(S2), Br/\u221am) and dm\nD\n\nm,q is an unbiased estimator of dm\nD\n\n\\instab(Ak,D) as it appears in the theorem (in fact, we can make it arbitrarily close to\\instab(Ak,D) by\nmaking r large enough). Using the fact that dm\n(Ak(S1), Ak(S2))\nD\nare equal with very high probability, we show that conditioned on a highly probable event,\n\u221am \u02c6\u03b7k\n(Ak(S1), Ak(S2), Br/\u221am), based on q instantiations, for\nany sample size m. As a result, using large deviation bounds, we get that \u221am \u02c6\u03b7k\nm,q is close to\n(Ak(S1), Ak(S2), Br/\u221am), with a high probability which does not depend on m. Therefore, as\ndm\nD\nm \u2192 \u221e, \u221am \u02c6\u03b7k\n(Ak(S1), Ak(S2), Br/\u221am) with high probability.\n\nin Subsec. D.2, we calculate\n(Ak(S1), Ak(S2), Br/\u221am) explicitly, while Subsec. D.3 executes the general plan out-\n\nBy picking r to scale appropriately with q, our theorem follows.\nFor convenience,\nlimm\u2192\u221e\nlined above to prove our theorem.\nA few more words are in order about the calculation of limm\u2192\u221e\n(Ak(S1), Ak(S2), Br/\u221am)\nin Subsec. D.2, since it is rather long and involved in itself. Our goal is to perform this calcu-\nlation without going through an intermediate step of explicitly characterizing the distribution of\nD (Ak(S1), Ak(S2), Br/\u221am). This is because the distribution might be highly dependent on the spe-\ndm\nci\ufb01c clustering framework, and thus it is unsuitable for the level of generality which we aim at (in\nother words, we do not wish to assume a speci\ufb01c clustering framework). The idea is as follows:\nrecall that dm\nD (Ak(S1), Ak(S2), Br/\u221am) is the mass of the underlying distribution D, inside strips of\nwidth r/\u221am around the limit cluster boundaries, which switches clusters when we draw and cluster\ntwo independent samples of size m. For any x \u2208 X , let Ax be the event that x switched clusters.\nThen we can write dm\nD\n\n(Ak(S1), Ak(S2), Br/\u221am), by Fubini\u2019s theorem, as:\n\nEdm\nD\n\nEdm\n\nD (Ak(S1), Ak(S2), Br/\u221am) = \u221amEZBr/\u221am\n\n1(Ax)p(x)dx =ZBr/\u221am\n\n\u221am Pr(Ax)p(x)dx.\n(5)\n\nThe heart of the proof is Lemma D.5, which considers what happens to the integral above inside a\nsingle strip near one of the limit cluster boundaries F\u03b80,i,j. The main body of the proof then shows\nhow the result of Lemma D.5 can be combined to give the asymptotic value of Eq. (5) when we\ntake the integral over all of Br/\u221am. The bottom line is that we can simply sum the contributions\nfrom each strip, because the intersection of these different strips is asymptotically negligible. All\nthe other lemmas in Subsec. D.2 develop technical results needed for our proof.\n\nFinally, let us describe the proof of Lemma D.5 in a bit more detail. It starts with an expression\nequivalent to the one in Eq. (5), and transforms it to an expression composed of a constant value,\nand a remainder term which converges to 0 as m \u2192 \u221e. The development can be divided into a\nnumber of steps. The \ufb01rst step is rewriting everything using the asymptotic Gaussian distribution\nof the cluster association function f\u02c6\u03b8(x) for each x, plus remainder terms (Eq. (13)). Since we are\nintegrating over x, special care is given to show that the convergence to the asymptotic distribution\nis uniform for all x in the domain of integration. The second step is to rewrite the integral (which is\nover a strip around the cluster boundary) as a double integral along the cluster boundary itself, and\nalong a normal segment at any point on the cluster boundary (Eq. (14)). Since the strips become\narbitrarily small as m \u2192 \u221e, the third step consists of rewriting everything in terms of a Taylor\nexpansion around each point on the cluster boundary (Eq. (16), Eq. (17) and Eq. (18)). The fourth\nand \ufb01nal step is a change of variables, and after a few more manipulations we get the required result.\n\nD.2 Part 1: Auxiliary Result\n\nAs described in the previous subsection, we will need an auxiliary result (Proposition D.1 below),\ncharacterizing the asymptotic expected value of dm\nD\nProposition D.1. Let r > 0.\nlimm\u2192\u221e\n\n(Ak(S1), Ak(S2), Br/\u221am(\u222ai,jF\u03b80,i,j)).\n\nAssuming the set of conditions from Sec. A holds,\n\nEdm\n\nD (Ak(S1), Ak(S2), Br/\u221am(\u222ai,jF\u03b80,i,j)) is equal to\n2(cid:18) 1\n\u221a\u03c0 \u2212 h(r)(cid:19) X1\u2264i 0,\ng(x + ynx)dydx + o(1),\n\ng(x)dx =\n\n(6)\n\n1\n\n1\n\n\u01eb ZB\u01eb(S)\n\n\u01eb ZSZ \u01eb\n\n\u2212\u01eb\n\nwhere nx is a unit normal vector to S at x, and o(1) \u2192 0 as \u01eb \u2192 0.\n\nProof. Let B\u2032\u01eb(S) be a strip around S, composed of all points which are on some normal to S and\nclose enough to S:\n\nB\u2032\u01eb(S) := {y \u2208 Rn : \u2203x \u2208 S,\u2203y \u2208 [\u2212\u01eb, \u01eb], y = x + ynx}.\n\nSince S is orientable, then for small enough \u01eb > 0, B\u2032\u01eb(S) is diffeomorphic to S \u00d7 [\u2212\u01eb, \u01eb].\nparticular, the map \u03c6 : S \u00d7 [\u2212\u01eb, \u01eb] 7\u2192 B\u2032\u01eb(S), de\ufb01ned by\n\nIn\n\n\u03c6(x, y) = x + ynx\n\nwill be a diffeomorphism. Let D\u03c6(x, y) be the Jacobian of \u03c6 at the point (x, y) \u2208 S \u00d7 [\u2212\u01eb, \u01eb]. Note\nthat D\u03c6(x, 0) = 1 for every x \u2208 S.\nWe now wish to claim that as \u01eb \u2192 0,\n\u01eb ZB\u01eb(S)\n\n\u01eb ZB\u2032\u01eb(S)\n\ng(x)dx + o(1).\n\ng(x)dx =\n\n(7)\n\n1\n\n1\n\nTo see this, we begin by noting that B\u2032\u01eb(S) \u2286 B\u01eb(S). Moreover, any point in B\u01eb(S)\\ B\u2032\u01eb(S) has the\nproperty that its projection to the closest point in S is not a normal to S, and thus must be \u01eb-close\nto the edge of S. As a result of regularity condition 3c for S, and the fact that g(\u00b7) is continuous\nand hence uniformly bounded in the volume of integration, we get that the integration of g(\u00b7) over\nB\u01eb \\ B\u2032\u01eb is asymptotically negligible (as \u01eb \u2192 0), and hence Eq. (7) is justi\ufb01ed.\nBy the change of variables theorem from multivariate calculus, followed by Fubini\u2019s theorem, and\nusing the fact that D\u03c6 is continuous and equals 1 on S \u00d7 {0},\n\n1\n\n\u01eb ZB\u2032\u01eb(S)\n\ng(x)dx =\n\ng(x + ynx)D\u03c6(x, y)dxdy\n\n1\n\n\u01eb ZS\u00d7[\u2212\u01eb,\u01eb]\n\u01eb Z \u01eb\n\u2212\u01eb(cid:18)ZS\n\u01eb Z \u01eb\n\u2212\u01eb(cid:18)ZS\n\n1\n\n1\n\n=\n\n=\n\ng(x + ynx)D\u03c6(x, y)dx(cid:19) dy\ng(x + ynx)dx(cid:19) dy + o(1),\nwhere o(1) \u2192 0 as \u01eb \u2192 0. Combining this with Eq. (7) yields the required result.\nLemma D.2. Let (gm : X 7\u2192 R)\u221em=1 be a sequence of integrable functions, such that gm(x) \u2192 0\nuniformly for all x as m \u2192 \u221e. Then for any i, j \u2208 {1, . . . , k}, i 6= j,\n\u221amgm(x)p(x)dx \u2192 0\n\nZBr/\u221am(F\u03b80,i,j )\n\nas m \u2192 \u221e\n\nProof. By the assumptions on (gm(\u00b7))\u221em=1, there exists a sequence of positive constants (bm)\u221em=1,\nconverging to 0, such that\n\nZBr/\u221am(F\u03b80,i,j )\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u221amgm(x)p(x)dx(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n5\n\n\u2264 bmZBr/\u221am(F\u03b80,i,j )\n\n\u221amp(x)dx.\n\n\fFor large enough m, p(x) is bounded and continuous in the volume of integration. Applying\n\nLemma D.1 with \u01eb = r/\u221am, we have that as m \u2192 \u221e,\n\np(x)dx = bm\u221amZF\u03b80,i,jZ r/\u221am\n\n\u2212r/\u221am\n\np(x + ynx)dydx + o(1)\n\nbm\u221amZBr/\u221am(F\u03b80,i,j )\n\u2264 bm\u221am\n\nC\n\u221am\n\n+ o(1) = bmC + o(1)\n\nfor some constant C dependant on r and the upper bound on p(\u00b7). Since bm converge to 0, we have\nthat the expression in the lemma converges to 0 as well.\nLemma D.3. Let (Xm) and (Ym) be a sequence of real random variables, such that Xm, Ym are\nde\ufb01ned on the same probability space, and Xm \u2212 Ym converges to 0 in probability. Assume that Ym\nconverges in distribution to a continuous random variable Y . Then | Pr(Xm \u2264 c) \u2212 Pr(Ym \u2264 c)|\nconverges to 0 uniformly for all c \u2208 R.\nProof. We will use the following standard fact (see for example section 7.2 of [4]): for any two real\nrandom variables A, B, any c \u2208 R and any \u01eb > 0, it holds that\n\nPr(A \u2264 c) \u2264 Pr(B \u2264 c + \u01eb) + Pr(|A \u2212 B| > \u01eb).\n\nFrom this inequality, it follows that for any c \u2208 R and any \u01eb > 0,\n| Pr(Xm \u2264 c) \u2212 Pr(Ym \u2264 c)| \u2264(cid:16) Pr(Ym \u2264 c + \u01eb) \u2212 Pr(Ym \u2264 c)(cid:17)\n\n+(cid:16) Pr(Ym \u2264 c) \u2212 Pr(Ym \u2264 c \u2212 \u01eb)(cid:17) + Pr(|Xm \u2212 Ym| \u2265 \u01eb).\n\n(8)\n\nWe claim that the r.h.s of Eq. (8) converges to 0 uniformly for all c, from which the lemma follows.\nTo see this, we begin by noticing that Pr(|Xm \u2212 Ym| \u2265 \u01eb) converges to 0 for any \u01eb by de\ufb01nition of\nconvergence in probability. Next, Pr(Ym \u2264 c\u2032) converges to Pr(Y \u2264 c\u2032) uniformly for all c\u2032 \u2208 R,\nsince Y is continuous (see section 1 of [6]). Moreover, since Y is a continuous random variable, we\nhave that its distribution function is uniformly continuous, hence Pr(Y \u2264 c + \u01eb) \u2212 Pr(Y \u2264 c) and\nPr(Y \u2264 c) \u2212 Pr(Y \u2264 c \u2212 \u01eb) converges to 0 as \u01eb \u2192 0, uniformly for all c. Therefore, by letting\nm \u2192 \u221e, and \u01eb \u2192 0 at an appropriate rate compared to m, we have that the l.h.s of Eq. (8) converges\nto 0 uniformly for all c.\nLemma D.4. Pr((cid:10)a,\u221am(f\u02c6\u03b8(x) \u2212 f\u03b80(x))(cid:11) < b) converges to Pr(ha, G(x)i < b) uniformly for\nany x \u2208 X , any a 6= 0 in some bounded subset of Rk, and any b \u2208 R.\nProof. By Eq. (3),\n\n\u221am(cid:0)f\u02c6\u03b8(x) \u2212 f\u03b80(x)(cid:1) =(cid:18) \u2202\n\n\u2202\u03b8\n\nf\u03b80(x)(cid:19)\u22a4\n\n(\u221am(\u02c6\u03b8 \u2212 \u03b80)) + op(1).\n\nWhere the remainder term does not depend on x. Thus, for any a in a bounded subset of Rk,\n\n(cid:10)a,\u221am(cid:0)f\u02c6\u03b8(x) \u2212 f\u03b80 (x)(cid:1)(cid:11) =*a(cid:18) \u2202\n\n\u2202\u03b8\n\nf\u03b80 (x)(cid:19)\u22a4\n\n,\u221am(\u02c6\u03b8 \u2212 \u03b80)+ + op(1),\n\n(9)\n\nWhere the convergence in probability is uniform for all bounded a and x \u2208 X .\nWe now need to use a result which tells us when is a convergence in distribution uniform. Using thm.\n4.2 in [6], we have that if a sequence of random vectors (Xm)\u221em=1 in Euclidean space converge to a\nrandom variable X in distribution, then Pr(hy, Xmi < b) converges to Pr(hy, Xi < b) uniformly\nfor any vector y and b \u2208 R. We note that a stronger result (Thm. 6 in [2]) apparently allows us to\nextend this to cases where Xm and X reside in some in\ufb01nite dimensional, separable Hilbert space\n(for example, if \u0398 is a subset of an in\ufb01nite dimensional reproducing kernel Hilbert space in kernel\nclustering). Therefore, recalling that \u221am(\u02c6\u03b8 \u2212 \u03b80) converges in distribution to a random normal\n\nvector Z, we have that uniformly for all x, a, b,\n\n6\n\n\fPr *a(cid:18) \u2202\n\n\u2202\u03b8\n\nf\u03b80(x)(cid:19)\u22a4\n\n,\u221am(\u02c6\u03b8 \u2212 \u03b80)+ < b! = Pr *a(cid:18) \u2202\n\nf\u03b80 (x)(cid:19)\u22a4\n= Pr (ha, G(x)i < b) + o(1)\n\n\u2202\u03b8\n\n, Z+ < b! + o(1)\n\n(10)\n\nHere we think of a((\u2202/\u2202\u03b8)f\u03b80(x))\u22a4 as the vector y to which we apply the theorem. By regularity\n\ncondition 3a, and assuming a 6= 0, we have that(cid:10)a((\u2202/\u2202\u03b8)f\u03b80(x))\u22a4, Z(cid:11) is a continuous real ran-\n\ndom variable for any x, unless Z = 0 in which case the lemma is trivial. Therefore, the conditions\nof Lemma D.3 apply: the two sides of Eq. (9) give us two sequences of random variables which\nconverge in probability to each other, and by Eq. (10) we have convergence in distribution of one of\nthe sequences to a \ufb01xed continuous random variable. Therefore, using Lemma D.3, we have that\n\nPr(cid:0)(cid:10)a,\u221am(cid:0)f\u02c6\u03b8(x) \u2212 f\u03b80(x)(cid:1)(cid:11) < b(cid:1) = Pr *a(cid:18) \u2202\n\n\u2202\u03b8\n\nf\u03b80(x)(cid:19)\u22a4\n\nwhere the convergence is uniform for any bounded a 6= 0, b and x \u2208 X .\nCombining Eq. (10) and Eq. (11) gives us the required result.\n\n,\u221am(\u02c6\u03b8 \u2212 \u03b80)+ < b! + o(1),\n\n(11)\n\nLemma D.5. Fix some two clusters i, j. Assuming the expression below is integrable, we have that\n\n2ZBr/\u221am(F\u03b80,i,j )\n= 2(cid:18) 1\n\n\u221a\u03c0 \u2212 h(r)(cid:19)ZF\u03b80,i,j\n\n\u221am Pr(f\u02c6\u03b8,i(x) \u2212 f\u02c6\u03b8,j(x) < 0) Pr(f\u02c6\u03b8,i(x) \u2212 f\u02c6\u03b8,j > 0)p(x)dx\n\np(x)pVar(Gi(x) \u2212 Gj(x))\nk\u2207(f\u03b80,i(x) \u2212 f\u03b80,j(x))k\n\ndx + o(1)\n\nwhere o(1) \u2192 0 as m \u2192 \u221e and h(r) = O(exp(\u2212r2)).\n\nProof. De\ufb01ne a \u2208 Rk as ai = 1, aj = \u22121, and 0 for any other entry. Applying Lemma D.4, with a\nas above, we have that uniformly for all x in some small enough neighborhood around F\u03b80,i,j:\n\nPr(f\u02c6\u03b8,i(x) \u2212 f\u02c6\u03b8,j(x) < 0)\n= Pr(cid:16)\u221am(f\u02c6\u03b8,i(x) \u2212 f\u03b80,i(x)) \u2212 \u221am(f\u02c6\u03b8,j(x) \u2212 f\u03b80,j(x)) < \u221am(f\u03b80,j(x) \u2212 f\u03b80,i(x))(cid:17)\n= Pr(Gi(x) \u2212 Gj(x) < \u221am(f\u03b80,j(x) \u2212 f\u03b80,i(x))) + o(1).\n\nwhere o(1) converges uniformly to 0 as m \u2192 \u221e.\nSince Gi(x)\u2212 Gj(x) has a zero mean normal distribution, we can rewrite the above (if Var(Gi(x)\u2212\nGj(x)) > 0) as\n\n<\n\nGi(x) \u2212 Gj(x)\n\nPr \n= \u03a6 \u221am(f\u03b80,j(x) \u2212 f\u03b80,i(x))\n\npVar(Gi(x) \u2212 Gj(x))\npVar(Gi(x) \u2212 Gj(x)) ! + o(1),\n\n\u221am(f\u03b80,j(x) \u2212 f\u03b80,i(x))\npVar(Gi(x) \u2212 Gj(x)) ! + o(1)\n\n(12)\n\nwhere \u03a6(\u00b7) is the cumulative standard normal distribution function. Notice that by some abuse of\nnotation, the expression is also valid in the case where Var(Gi(x) \u2212 Gj(x)) = 0. In that case,\nGi(x) \u2212 Gj(x) is equal to 0 with probability 1, and thus Pr(Gi(x) \u2212 Gj(x) < \u221am(f\u03b80,j(x) \u2212\nf\u03b80,i(x))) is 1 if f\u03b80,j(x) \u2212 f\u03b80,i(x)) \u2265 0 and 0 if f\u03b80,j(x) \u2212 f\u03b80,i(x)) < 0. This is equal to\nEq. (12) if we are willing to assume that \u03a6(\u221e) = 1, \u03a6(0/0) = 1, \u03a6(\u2212\u221e) = 0.\n\n7\n\n\fTherefore, we can rewrite the l.h.s of the equation in the lemma statement as\n\n2ZBr/\u221am(F\u03b80,i,j )\n\n\u221am\u03a6 \u221am(f\u03b80,i(x) \u2212 f\u03b80,j(x))\npVar(Gi(x) \u2212 Gj(x)) !\n\n 1 \u2212 \u03a6 \u221am(f\u03b80,i(x) \u2212 f\u03b80,j(x))\n\npVar(Gi(x) \u2212 Gj(x)) !! + \u221amo(1)p(x)dx.\n\nThe integration of the remainder term can be rewritten as o(1) by Lemma D.2, and we get that the\nexpression can be rewritten as:\n\n2ZBr/\u221am(F\u03b80,i,j )\n\n\u221am\u03a6 \u221am(f\u03b80,i(x) \u2212 f\u03b80,j(x))\npVar(Gi(x) \u2212 Gj(x)) !\n 1 \u2212 \u03a6 \u221am(f\u03b80,i(x) \u2212 f\u03b80,j(x))\n\npVar(Gi(x) \u2212 Gj(x)) !! p(x)dx + o(1).\n\n(13)\n\nOne can verify that the expression inside the integral is a continuous function of x, by the regularity\nconditions and the expression for G(\u00b7) as proven in Sec. C (namely Eq. (4)). We can therefore apply\nLemma D.1, and again take all the remainder terms outside of the integral by Lemma D.2, to get\nthat the above can be rewritten as\n\n2ZF\u03b80,i,jZ r/\u221am\n\n\u2212r/\u221am\n\n\u221am\u03a6 \u221am(f\u03b80,i(x + ynx) \u2212 f\u03b80,j(x + ynx))\npVar(Gi(x + ynx) \u2212 Gj(x + ynx)) !\n 1 \u2212 \u03a6 \u221am(f\u03b80,i(x + ynx) \u2212 f\u03b80,j(x + ynx))\npVar(Gi(x + ynx) \u2212 Gj(x + ynx)) !! p(x)dydx + o(1),\n\nwhere nx is a unit normal to F\u03b80,i,j at x.\nInspecting Eq. (14), we see that y ranges over an arbitrarily small domain as m \u2192 \u221e. This suggests\nthat we can rewrite the above using Taylor expansions, which is what we shall do next.\nLet us assume for a minute that Var(Gi(x) \u2212 Gj(x)) > 0 for some point x \u2208 F\u03b80,i,j. One can\nverify that by the regularity conditions and the expression for G(\u00b7) in Eq. (4), the expression\n\n(14)\n\n(15)\n\nis twice differentiable, with a uniformly bounded second derivative. Therefore, we can rewrite the\nexpression in Eq. (15) as its \ufb01rst-order Taylor expansion around each x \u2208 F\u03b80,i,j, plus a remainder\nterm which is uniform for all x:\n\nf\u03b80,i(\u00b7) \u2212 f\u03b80,j(\u00b7)\npVar(Gi(\u00b7) \u2212 Gj(\u00b7))\n\nf\u03b80,i(x + ynx) \u2212 f\u03b80,j(x + ynx)\npVar(Gi(x + ynx) \u2212 Gj(x + ynx))\n\n=\n\nf\u03b80,i(x) \u2212 f\u03b80,j(x)\npVar(Gi(x) \u2212 Gj(x))\n\n+ \u2207 f\u03b80,i(x) \u2212 f\u03b80,j(x)\n\npVar(Gi(x) \u2212 Gj(x))! ynx + O(y2).\n\nSince f\u03b80,i(x) \u2212 f\u03b80,j(x) = 0 for any x \u2208 F\u03b80,i,j, the expression reduces after a simple calculation\nto\n\nNotice that \u2207(f\u03b80,i(x) \u2212 f\u03b80,j(x)) (the gradient of f\u03b80,i(x) \u2212 f\u03b80,j(x)) has the same direction as\nnx (the normal to the cluster boundary). Therefore, the expression above can be rewritten, up to a\nsign, as\n\n\u2207(f\u03b80,i(x) \u2212 f\u03b80,j(x))\npVar(Gi(x) \u2212 Gj(x))\npVar(Gi(x) \u2212 Gj(x))(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\ny(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u2207(f\u03b80,i(x) \u2212 f\u03b80,j(x))\n\n8\n\nynx + O(y2).\n\n+ O(y2).\n\n\fAs a result, denoting s(x) := \u2207(f\u03b80,i(x) \u2212 f\u03b80,j(x))/pVar(Gi(x) \u2212 Gj(x)), we have that\n\u03a6 \u221am(f\u03b80,i(x + ynx) \u2212 f\u03b80,j(x + ynx))\npVar(Gi(x + ynx) \u2212 Gj(x + ynx)) ! 1 \u2212 \u03a6 \u221am(f\u03b80,i(x + ynx) \u2212 f\u03b80,j(x + ynx))\npVar(Gi(x + ynx) \u2212 Gj(x + ynx)) !!\n= \u03a6(cid:16)\u221am(cid:0)ks(x)ky + O(y2)(cid:1)(cid:17) 1 \u2212 \u03a6(cid:16)\u221am(cid:0)ks(x)ky + O(y2)(cid:1)(cid:17)!\n= \u03a6(cid:16)\u221am(cid:0)ks(x)ky(cid:1)(cid:17) 1 \u2212 \u03a6(cid:16)\u221am(cid:0)ks(x)ky(cid:1)(cid:17)! + O(\u221amy2).\n\n(17)\n\n(16)\n\nIn the preceding development, we have assumed that Var(Gi(x) \u2212 Gj(x)) > 0. However, notice\nthat the expressions in Eq. (16) and Eq. (17), without the remainder term, are both equal (to zero)\neven if Var(Gi(x) \u2212 Gj(x)) = 0 (with our previous abuse of notation that \u03a6(\u2212\u221e) = 0, \u03a6(\u221e) =\n1). Moreover, since y takes values in [\u2212r/\u221am, r/\u221am], the remainder term O(\u221amy2) is at most\nO(\u221amr/m) = O(r/\u221am), so it can be rewritten as o(1) which converges to 0 as m \u2192 \u221e.\n\nIn conclusion, and again using Lemma D.2 to take the remainder terms outside of the integral, we\ncan rewrite Eq. (14) as\n\n2ZF\u03b80,i,jZ r/\u221am\n\n\u221am\u03a6(cid:0)\u221amks(x)ky)(cid:1)(cid:0)1 \u2212 \u03a6(cid:0)\u221amks(x)ky)(cid:1)(cid:1) p(x)dydx + o(1).\nWe now perform a change of variables, letting zx = \u221amks(x)ky in the inner integral, and get\n\n\u2212r/\u221am\n\n(18)\n\n2ZF\u03b80,i,jZ rks(x)k\n\n\u2212rks(x)k\n\n1\n\nks(x)k\n\n\u03a6 (zx) (1 \u2212 \u03a6 (zx)) p(x)dzxdx + o(1),\n\nwhich is equal by the mean value theorem to\n\np(x)\nks(x)k\n\ndx! Z rks(x0)k\n\n2 ZF\u03b80 ,i,j\nfor some x0 \u2208 F\u03b80,i,j.\nBy regularity condition 3b, it can be veri\ufb01ed that ks(x)k is positive or in\ufb01nite for any x \u2208 F\u03b80,i,j.\nAs a result, as r \u2192 \u221e, we have that\n\n\u03a6 (zx0) (1 \u2212 \u03a6 (zx0)) dzx0! + o(1)\n\n\u2212rks(x0)k\n\n(19)\n\nZ rks(x0)k\n\n\u2212rks(x0)k\n\n\u03a6 (zx0) (1 \u2212 \u03a6 (zx0)) dzx0 \u2212\u2192Z \u221e\n\n\u2212\u221e\n\n\u03a6(zx0 )(1 \u2212 \u03a6(zx0 ))dzx0 =\n\n1\n\u221a\u03c0\n\n.\n\nand the convergence to 1/\u221a\u03c0 is at a rate of O(exp(\u2212r2)). Combining this with Eq. (19) gives us\n\nthe required result.\n\nProof of Proposition D.1. We can now turn to prove Proposition D.1 itself. For any x \u2208 X , let Ax\nbe the event (over drawing and clustering a sample pair) that x switched clusters. For any F\u03b80,i,j\nand sample size m, de\ufb01ne F m\n\u03b80,i,j to be the subset of F\u03b80,i,j, which is at a distance of at least m\u22121/4\nfrom any other cluster boundary (with respect to \u03b80). Formally,\n\nF m\n\n\u03b80,i,j :=(cid:26)x \u2208 F\u03b80,i,j : \u2200 ({i\u2032, j\u2032} 6= {i, j}, F\u03b80,i\u2032,j\u2032 6= \u2205) ,\n\ninf\n\ny\u2208F\u03b80,i\u2032 ,j\u2032 kx \u2212 yk \u2265 m\u22121/4(cid:27) .\n\n9\n\n\fLetting S1, S2 be two independent samples of size m, we have by Fubini\u2019s theorem that\n\nEdm\n\nD (Ak(S1), Ak(S2), Br/\u221am(\u222ai,jF\u03b80,i,j))\n= \u221amES1,S2ZBr/\u221am(\u222ai,j F\u03b80,i,j )\n=ZBr/\u221am(\u222ai,j F m\n\n1(Ax)p(x)dx =ZBr/\u221am(\u222ai,j F\u03b80,i,j )\n\u221am Pr(Ax)p(x)dx +ZBr/\u221am(\u222ai,j F\u03b80,i,j\\F m\n\n\u03b80,i,j )\n\n\u03b80,i,j )\n\n\u221am Pr(Ax)p(x)dx\n\n\u221am Pr(Ax)p(x)dx.\n\nAs to the \ufb01rst integral, notice that each point in F m\n\u03b80,i\u2032,j\u2032 by a distance of at least 2m\u22121/4. Therefore, for large enough m, Br/\u221am(F m\nF m\ndisjoint for each i, j, and we can rewrite the above as:\n\n\u03b80,i,j is separated from any point in any other\n\u03b80,i,j) are\n\n\u03b80,i,j )\n\n\u221am Pr(Ax)p(x)dx.\n\n\u221am Pr(Ax)p(x)dx +ZBr/\u221am(\u222ai,j F\u03b80,i,j\\F m\n\nX1\u2264i 0).\n\nThis is simply by de\ufb01nition of Ax: the probability that under one clustering, based on a random\nsample, x is more associated with cluster i, and that under a second clustering, based on another\nindependent random sample, x is more associated with cluster j.\nIn general, we will have more than two clusters. However, notice that any point x in Br/\u221am(F m\n\u03b80,i,j)\n(for some i, j) is much closer to F\u03b80,i,j than to any other cluster boundary. This is because its\ndistance to F\u03b80,i,j is on the order of 1/\u221am, while its distance to any other boundary is on the order\nof m\u22121/4. Therefore, if x does switch clusters, then it is highly likely to switch between cluster i and\ncluster j. Formally, by regularity condition 3d (which ensure that the cluster boundaries experience\nat most O(1/\u221am) \ufb02uctuations), we have that uniformly for any x,\n\nPr(Ax) = 2 Pr(f\u02c6\u03b8,i(x) \u2212 f\u02c6\u03b8,j(x) < 0) Pr(f\u02c6\u03b8,i(x) \u2212 f\u02c6\u03b8,j > 0) + o(1),\n\nwhere o(1) converges to 0 as m \u2192 \u221e.\nSubstituting this back to Eq. (20), using Lemma D.2 to take the remainder term outside the integral,\nand using the regularity condition 3c in the reverse direction to transform integrals over F m\n\u03b80,i,j\nback into F\u03b80,i,j with asymptotically negligible remainder terms, we get that the quantity we are\ninterested in can be written as\n\nX1\u2264i 0)p(x)dx + o(1).\n\nNow we can apply Lemma D.5 to each summand, and get the required result.\n\nD.3 Part 2: Proof of Thm. 1\n\nFor notational convenience, we will denote\n\ndm\nD (r) := dm\n\nD (Ak(S1), Ak(S2), Br/\u221am(\u222ai,jF\u03b80,i,j))\n\n10\n\n\f(Ak(S1\n\nIf \\instab(Ak,D) = 0, the proof of the\nwhenever the omitted terms are obvious from context.\nIn this special case, by de\ufb01nition of \\instab(Ak,D) in Thm. 1 and\ntheorem is straightforward.\nProposition D.1, we have that dm\n(r) converges in probability to 0 for any r. By regularity con-\nD\nq Pq\ndition 3d, for any \ufb01xed q, 1\ni )) converges in probability to 0 (because\ni=1 dm\n(Ak(S1\nD\ni ), Br/\u221am(\u222ai,jF\u03b80,i,j)) with arbitrarily high probabil-\ni )) = dm\ndm\ni ), Ak(S2\nD\nD\nity as r increases). Therefore, \u221am \u02c6\u03b7k\nm,q, which is a plug-in estimator of the expected value of\nq Pq\ni )), converges in probability to 0 for any \ufb01xed q as m \u2192 \u221e, and the the-\norem follows for this special case. Therefore, we will assume from now on that \\instab(Ak,D) > 0.\nWe need the following variant of Hoeffding\u2019s bound, adapted to conditional probabilities.\n\nD (Ak(S1\n\ni ), Ak(S2\n\ni ), Ak(S2\n\ni ), Ak(S2\n\ni=1 dm\n\n(Ak(S1\n\n1\n\nLemma D.6. Fix some r > 0. Let X1, . . . , Xq be real, nonnegative, independent and identically\ndistributed random variables, such that Pr(X1 \u2208 [0, r]) > 0. For any Xi, let Yi be a random\nvariable on the same probability space, such that Pr(Yi = Xi|Xi \u2208 [0, r]) = 1. Then for any\n\u03bd > 0,\n\n1\nq\n\nq\n\nXi=1\n\nPr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nXi \u2212 E[Y1|X1 \u2208 [0, r]](cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2265 \u03bd (cid:12)(cid:12)(cid:12) \u2200i, Xi \u2208 [0, r]! \u2264 2 exp(cid:18)\u2212\n\n2q\u03bd2\n\nr2 (cid:19) .\n\nProof. De\ufb01ne an auxiliary set of random variables Z1, . . . , Zq, such that Pr(Zi \u2264 a) = Pr(Xi \u2264\na|Xi \u2208 [0, r]) for any i, a. In words, Xi and Zi have the same distribution conditioned on the event\nXi \u2208 [0, r]. Also, we have that Yi has the same distribution conditioned on Xi \u2208 [0, r]. Therefore,\nE[Y1|X1 \u2208 [0, r]] = E[X1|X1 \u2208 [0, r]], and as a result E[Y1|X1 \u2208 [0, r]] = E[Z1]. Therefore, the\nprobability in the lemma above can be written as\n\n1\nq\n\nq\n\nXi=1\n\nPr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nZi \u2212 E[Zi](cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2265 \u03bd! ,\n\nwhere Zi are bounded in [0, r] with probability 1. Applying the regular Hoeffding\u2019s bound gives us\nthe required result.\n\n(Ak(S1\n\ni ), Ak(S2\n\nWe now turn to the proof of the theorem. Let Am\ni },\ni , S2\ni )). Namely, this is the event that for\ndm\ni ), Br/\u221am(\u222ai,j F\u03b80,i,j )) = dm\nD\nD\nall subsample pairs, the mass which switches clusters when we compare the two resulting clusterings\nis always in an r/\u221am-neighborhood of the limit cluster boundaries.\nSince p(\u00b7) is bounded, we have that dm\nD\nconstants depending only on D and \u03b80. Using the law of total expectation, this implies that\n\nr be the event that for all subsample pairs {S1\n(Ak(S1\n\n(r) is deterministically bounded by O(r), with implicit\n\ni ), Ak(S2\n\nr ](cid:12)(cid:12)(cid:12)(cid:12)\nD (r)|Am\nD (r)|Am\nr )(cid:19)(cid:18)E[dm\n\nE[dm\n\nD (r)] \u2212 E[dm\nr )E[dm\nPr(Am\n\n(cid:18)1 \u2212 Pr(Am\n\u2264 (1 \u2212 Pr(Am\n\nr ))O(r).\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n=(cid:12)(cid:12)(cid:12)(cid:12)\n=(cid:12)(cid:12)(cid:12)(cid:12)\n\nr ))E[dm\n\nD (r)|\u00acAm\n\nr ] \u2212 E[dm\n\nD (r)|Am\n\nr ] + (1 \u2212 Pr(Am\nD (r)|\u00acAm\n\nr ] \u2212 E[dm\n\nD (r)|Am\n\nr ](cid:12)(cid:12)(cid:12)(cid:12)\n\nr ](cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(21)\n\nFor any two events A, B, we have by the law of total probability that Pr(A) = Pr(B) Pr(A|B) +\nPr(Bc) Pr(A|Bc). From this it follows that Pr(A) \u2264 Pr(B) + Pr(A|Bc). As a result, for any\n\n11\n\n\f\u01eb > 0,\n\n\u221am \u02c6\u03b7k\n\nq\n\n\u01eb\n\n2#! .\n\n(22)\n\nq\n\n\u01eb\n\n1\nq\n\n1\nq\n\n\u2264\n\n\"(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u221am \u02c6\u03b7k\n\ni ), Ak(S2\n\ni ), Ak(S2\n\ndm\nD (Ak(S1\n\n>\n\n2!\nD (Ak(S1\ndm\n\ni )) \u2212 \\instab(Ak,D)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nWe will assume w.l.o.g that \u01eb/2 < \\instab(Ak,D).\n\n> \u01eb(cid:17)\ni )) \u2212 \\instab(Ak,D)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n> \u01ebi(cid:12)(cid:12)(cid:12)\nXi=1\n\nm,q \u2212 \\instab(Ak,D)(cid:12)(cid:12)(cid:12)\nXi=1\n\u221am \u02c6\u03b7k\nm,q \u2212 \\instab(Ak,D)(cid:12)(cid:12)(cid:12)\n\nPr(cid:16)(cid:12)(cid:12)(cid:12)\n\u2264 Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n+ Pr h(cid:12)(cid:12)(cid:12)\nPr(cid:16)(cid:12)(cid:12)(cid:12)\ntity \u01eb\u2032 for which \u01eb\u2032/2 < \\instab(Ak,D,).\nWe start by analyzing the conditional probability, forming the second summand in Eq. (22). Recall\ni }q\ni=1, uses an additional i.i.d sample S3\nthat \u02c6\u03b7k\ni , S2\ni ))/\u221amq \u2208 [0, 1]. This is achieved by\nof size m to empirically estimate Pq dm\ni ), Ak(S2\ncalculating the average percentage of instances in S3 which switches between clusterings. Thus,\nm,q is simply an empirical\nconditioned on the event appearing in the second summand of Eq. (22), \u02c6\u03b7k\naverage of m i.i.d random variables in [0, 1], whose expected value, denoted as v, is a strictly positive\nnumber in the range of (\\instab(Ak,D) \u00b1 \u01eb/2)/\u221am. Thus, the second summand of Eq. (22) refers to\nan event where this empirical average is at a distance of at least \u01eb/(2\u221am) from its expected value.\nWe can therefore apply a large deviation result to bound this probability. Since the expectation itself\nis a (generally decreasing) function of the sample size m, we will need something a bit stronger than\nthe regular Hoeffding\u2019s bound. Using a relative entropy version of Hoeffding\u2019s bound [5], we have\nthat the second summand in Eq. (22) is upper bounded by:\n\n> \u01eb(cid:17) in the equation above by replacing \u01eb with some smaller quan-\n\nm,q, after clustering the q subsample pairs {S1\n\nm,q \u2212 \\instab(Ak,D)(cid:12)(cid:12)(cid:12)\n\nOtherwise, we can upper bound\n\n(Ak(S1\n\nD\n\n\u221am (cid:12)(cid:12)(cid:12)(cid:12)\nexp(cid:18)\u2212mDkl(cid:20) v + \u01eb/2\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nv\n\n\u221am(cid:21)(cid:19) + exp(cid:18)\u2212mDkl(cid:20)max(cid:26)0,\n\nv\n\n\u221am(cid:21)(cid:19) ,\n\n(23)\n\nv \u2212 \u01eb/2\n\n\u221am (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nwhere Dkl[p||q] := \u2212p log(p/q)\u2212 (1\u2212 p) log((1\u2212 p)/(1\u2212 q)) for any q \u2208 (0, 1) and any p \u2208 [0, 1].\nUsing the fact that Dkl[p||q] \u2265 (p\u2212 q)2/2 max{p, q}, we get that Eq. (23) can be upper bounded by\na quantity which converges to 0 as m \u2192 \u221e. As a result, the second summand in Eq. (22) converges\nto 0 as m \u2192 \u221e.\nAs to the \ufb01rst summand in Eq. (22), using the triangle inequality and switching sides allows us to\nupper bound it by:\n\n1\nq\n\nq\n\nXi=1\n\nPr (cid:12)(cid:12)(cid:12)(cid:12)\n\ndm\nD (Ak(S1\n\ni ), Ak(S2\n\ni )) \u2212 E[dm\n\nD (r)|Am\n\nr ](cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2265\n\n\u01eb\n\n2 \u2212(cid:12)(cid:12)(cid:12)(cid:12)\n\nE[dm\n\nD (r)|Am\n\nr ] \u2212 E[dm\n\nD (r)](cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2212(cid:12)(cid:12)(cid:12)(cid:12)\n\nEdm\n\nD (r) \u2212 \\instab(Ak,D)(cid:12)(cid:12)(cid:12)(cid:12)\n\nBy the de\ufb01nition of \\instab(Ak,D) as appearing in Thm. 1 , and Proposition D.1,\nD (r) \u2212 \\instab(Ak,D) = O(h(r)) = O(exp(\u2212r2)).\n\nEdm\n\nlim\nm\u2192\u221e\n\n(cid:19) (24)\n\n(25)\n\nUsing Eq. (25) and Eq. (21), we can upper bound Eq. (24) by\n\n1\nq\n\nq\n\nXi=1\n\nPr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\ndm\nD (Ak(S1\n\ni ), Ak(S2\n\nD (r)|Am\ni )) \u2212 E[dm\n\u01eb\n2 \u2212 (1 \u2212 Pr(Am\n\n\u2265\n\nr ](cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nr ))O(r) \u2212 O(exp(\u2212r2)) \u2212 o(1)(cid:17) ,\n\n(26)\n\n12\n\n\fwhere o(1) \u2192 0 as m \u2192 \u221e. Moreover, by using the law of total probability and Lemma D.6, we\nhave that for any \u03bd > 0,\n\nq\n\n1\nq\n\nPr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nXi=1\n\u2264 (1 \u2212 Pr(Am\n\n\u2264 (1 \u2212 Pr(Am\n\nr )) + 2 Pr(Am\n\nr ](cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n> \u03bd(cid:12)(cid:12)(cid:12)\n\nAm\n\nr !\n\n(27)\n\ndm\nD (Ak(S1\n\ni ), Ak(S2\n\ni )) \u2212 E[dm\n\nq\n\n1\nq\n\n> \u03bd!\nD (Ak(S1\ndm\n\nr ](cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nD (r)|Am\nXi=1\nr2 (cid:19) .\n\n2q\u03bd2\n\nr ) Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nr ) exp(cid:18)\u2212\n\nr )) \u2217 1 + Pr(Am\n\ni ), Ak(S2\n\ni )) \u2212 E[dm\n\nD (r)|Am\n\nLemma D.6 can be applied because dm\nD\nIf m, r are such that\n\n(Ak(S1\n\ni ), Ak(S2\n\ni )) = dm\nD\n\n(r) for any i, if Am\n\nr occurs.\n\n\u01eb\n2 \u2212 (1 \u2212 Pr(Am\n\nr ))O(r) \u2212 O(exp(\u2212r2)) \u2212 o(1) > 0,\n\n(28)\n\nwe can substitute this expression instead of \u03bd in Eq. (27), and get that Eq. (26) is upper bounded by\n\n(1 \u2212 Pr(Am\n\nr )) + 2 Pr(Am\n\nr ) exp \u2212\n\n2q(cid:0) \u01eb\n\n2 \u2212 (1 \u2212 Pr(Am\n\nr ))O(r) \u2212 O(exp(\u2212r2))) \u2212 o(1)(cid:1)2\n\nr2\n\n! .\n\n(29)\n\nLet\n\ngm(r) :=\n\nPr\n\nS1,S2\u223cDm\n\n(dm\nD (r) 6= dm\n\nD (Ak(S1), Ak(S2)))\n\n,\n\ng(r) = lim\nm\u2192\u221e\n\ngm(r)\n\nBy regularity condition 3d, g(r) = O(r\u22123\u2212\u03b4) for some \u03b4 > 0. Also, we have that Pr(Am\nr ) =\nr ) = (1 \u2212 g(r))q for any \ufb01xed q. In consequence, as\n(1 \u2212 gm(r))q, and therefore limm\u2192\u221e Pr(Am\nm \u2192 \u221e, Eq. (29) converges to\n2 \u2212 (1 \u2212 (1 \u2212 g(r))q)O(r) \u2212 O(exp(\u2212r2))(cid:1)2\n! .\n(1 \u2212 (1 \u2212 g(r)))q) + 2(1 \u2212 g(r))q exp \u2212\n2q(cid:0) \u01eb\n\nr2\n\n(30)\n\nNow we use the fact that r can be chosen arbitrarily. In particular, let r = q1/(2+\u03b4/2), where \u03b4 > 0\nis the same quantity appearing in condition 3d. It follows that\n\n1 \u2212 (1 \u2212 g(r))q \u2264 qg(r) = O(q/r3+\u03b4) = O(cid:16)q1\u2212 3+\u03b4\n2+\u03b4/2(cid:17)\n(1 \u2212 (1 \u2212 g(r))q)O(r) = qg(r)O(r) = O(cid:16)q1\u2212 2+\u03b4\n2+\u03b4/2(cid:17) = O(q\u2212 \u03b4\nq/r2 = q1\u2212 1\nexp(\u2212r2) = exp(\u2212q\n\n1+\u03b4/4 ).\n\n1+\u03b4/4\n\n1\n\n4+\u03b4 )\n\nIt can be veri\ufb01ed that the equations above imply the validness of Eq. (28) for large enough m and q\n(and hence r). Substituting these equations into Eq. (30), we get an upper bound\n\nO(cid:16)q1\u2212 3+\u03b4\n\n2+\u03b4/2(cid:17) + exp(cid:18)\u22122q1\u2212 1\n\n1+\u03b4/4 (cid:16) \u01eb\n\n2 \u2212 O(cid:16)q\u2212 \u03b4\n\n4+\u03b4(cid:17) \u2212 O(cid:16)exp(\u2212q\n\n1\n\n1+\u03b4/4 )(cid:17)(cid:17)2(cid:19) .\n\nSince \u03b4 > 0, it can be veri\ufb01ed that the \ufb01rst summand asymptotically dominates the second summand\n(as q \u2192 \u221e), and can be bounded in turn by o(q\u22121/2).\nSummarizing, we have that the \ufb01rst summand in Eq. (22) converges to o(q\u22121/2) as m \u2192 \u221e, and the\nsecond summand in Eq. (22) converge to 0 as m \u2192 \u221e, for any \ufb01xed \u01eb > 0, and thus Pr(|\u221am \u02c6\u03b7k\nm,q\u2212\n\\instab(Ak,D)| > \u01eb) converges to o(q\u22121/2).\n\n13\n\n\fE Proof of Thm. 2 and Thm. 3\n\nThe tool we shall use for proving Thm. 2 and Thm. 3 is the following general central limit the-\norem for Z-estimators (Thm. 3.3.1 in [8]). We will \ufb01rst quote the theorem and then explain the\nterminology used.\nTheorem E.1 (Van der Vaart). Let \u03a8m and \u03a8 be random maps and a \ufb01xed map, respectively, from\na subset \u0398 of some Banach space into another Banach space such that as m \u2192 \u221e,\n\nk\u221am(\u03a8m \u2212 \u03a8)(\u02c6\u03b8) \u2212 \u221am(\u03a8m \u2212 \u03a8)(\u03b80)k\n\n1 + \u221amk\u02c6\u03b8 \u2212 \u03b80k\n\n\u2192 0\n\n(31)\n\n\u03b80\n\nZ.\n\nin probability, and such that the sequence \u221am(\u03a8m \u2212 \u03a8)(\u03b80) converges in distribution to a tight\nrandom element Z. Let \u03b8 7\u2192 \u03a8(\u03b8) be Fr\u00b4echet-differentiable at \u03b80 with an invertible derivative\n\u02d9\u03a8\u03b80, which is assumed to be a continuous linear operator1. If \u03a8(\u03b80) = 0 and \u03a8m(\u02c6\u03b8)/\u221am \u2192 0\nin probability, and \u02c6\u03b8 converges in probability to \u03b80, then \u221am(\u02c6\u03b8 \u2212 \u03b80) converges in distribution to\n\u2212 \u02d9\u03a8\u22121\nA Banach space is any complete normed vector space (possible in\ufb01nite dimensional). A tight ran-\ndom element essentially means that an arbitrarily large portion of its distribution lies in compact\nsets. This condition is trivial when \u0398 is a subset of Euclidean space. Fr\u00b4echet-differentiability of a\nfunction f : U 7\u2192 V at x \u2208 U, where U, V are Banach spaces, means that there exists a bounded\nlinear operator A : U 7\u2192 V such that\n\nThis is equivalent to regular differentiability in \ufb01nite dimensional settings.\n\nkf (x + h) \u2212 f (x) \u2212 A(h)kW\n\n= 0.\n\nlim\nh\u21920\n\nkhkU\n\nIt is important to note that the theorem is stronger than what we actually need, since we only consider\n\ufb01nite dimensional Euclidean spaces, while the theorem deals with possibly in\ufb01nite dimensional\nBanach spaces. In principle, it is possible to use this theorem to prove central limit theorems in\nin\ufb01nite dimensional settings, for example in kernel clustering where the associated reproducing\nkernel Hilbert space is in\ufb01nite dimensional. However, the required conditions become much less\ntrivial, and actually fail to hold in some cases (see below for further details).\n\nWe now turn to the proofs themselves. Since the proofs of Thm. 2 and Thm. 3 are almost identical,\nwe will prove them together, marking differences between them as needed. In order to allow uniform\nnotation in both cases, we shall assume that \u03c6(\u00b7) is the identity mapping in Bregman divergence\nclustering, and the feature map from X to H in kernel clustering.\nWith the assumptions that we made in the theorems, the only thing really left to show before applying\nThm. E.1 is that Eq. (31) holds. Notice that it is enough to show that\nm \u2212 \u03a8i)(\u03b80)k\n\nk\u221am(\u03a8i\n\nm \u2212 \u03a8i)(\u02c6\u03b8) \u2212 \u221am(\u03a8i\n1 + \u221amk\u02c6\u03b8 \u2212 \u03b80k\n\nfor any i \u2208 {1, . . . , k}. We will prove this in a slightly more complicated way than necessary, which\nalso treats the case of kernel clustering where H is in\ufb01nite-dimensional. By Lemma 3.3.5 in [8],\nsince X is bounded, it is suf\ufb01cient to show that for any i, there is some \u03b4 > 0 such that\n\n\u2192 0\n\n{\u03c8i\n\n\u02c6\u03b8,h(\u00b7) \u2212 \u03c8i\n\n\u03b80,h(\u00b7)}k\u02c6\u03b8\u2212\u03b80k\u2264\u03b4,h\u2208X\n\nis a Donsker class, where\n\n\u03c8i\n\n\u03b8,h(x) =(cid:26)h\u03b8i \u2212 \u03c6(x), \u03c6(h)i x \u2208 C\u03b8,i\n\notherwise.\n\n0\n\nIntuitively, a set of real functions {f (\u00b7)} from X (with any probability distribution D) to R is called\nDonsker if it satis\ufb01es a uniform central limit theorem. Without getting too much into the details,\n\n1A linear operator is automatically continuous in \ufb01nite dimensional spaces, not necessarily in in\ufb01nite di-\n\nmensional spaces.\n\n14\n\n\fthis means that if we sample i.i.d m elements from D, then (f (x1) + . . . + f (xm))/\u221am converges\nin distribution (as m \u2192 \u221e) to a Gaussian random variable, and the convergence is uniform over all\nf (\u00b7) in the set, in an appropriately de\ufb01ned sense.\nWe use the fact that if F and G are Donsker classes, then so are F + G and F \u00b7 G (see examples\n2.10.7 and 2.10.8 in [8]). This allows us to reduce the problem to showing that the following three\nfunction classes, from X to R, are Donsker:\n\n{h\u03b8i, \u03c6(h)i}k\u02c6\u03b8\u2212\u03b80k\u2264\u03b4,h\u2208X\n\n,\n\n{h\u03c6(\u00b7), \u03c6(h)i}h\u2208X\n\n,\n\n{1C\u03b8,i(\u00b7)}k\u02c6\u03b8\u2212\u03b80k\u2264\u03b4.\n\n(32)\n\nNotice that the \ufb01rst class is a set of bounded constant functions, while the third class is a set of\nindicator functions for all possible clusters. One can now use several tools to show that each class\nin Eq. (32) is Donsker. For example, consider a class of real functions on a bounded subset of some\nEuclidean space. By Thm. 8.2.1 in [3] (and its preceding discussion), the class is Donsker if any\nfunction in the class is differentiable to a suf\ufb01ciently high order. This ensures that the \ufb01rst class in\nEq. (32) is Donsker, because it is composed of constant functions. As to the second class in Eq. (32),\nthe same holds in the case of Bregman divergence clustering (where \u03c6(\u00b7) is the identity function),\nbecause it is then just a set of linear functions. For \ufb01nite dimensional kernel clustering, it is enough\nto show that {h\u00b7, \u03c6(h)i}h\u2208X is Donsker (namely, the same class of functions after performing the\ntransformation from X to \u03c6(X )). This is again a set of linear functions in Hk, a subset of some\n\ufb01nite dimensional Euclidean space, and so it is Donsker. In in\ufb01nite dimensional kernel clustering,\nour class of functions can be written as {k(\u00b7, h)}h\u2208X , where k(\u00b7,\u00b7) is the kernel function, so it is\nDonsker if the kernel function is differentiable to a suf\ufb01ciently high order.\n\nThe third class in Eq. (32) is more problematic. By Theorem 8.2.15 in [3] (and its preceding discus-\nsion), it suf\ufb01ces that the boundary of each possible cluster is composed of a \ufb01nite number of smooth\nsurfaces (differentiable to a high enough order) in some Euclidean space. In Bregman divergence\nclustering, the clusters are separated by hyperplanes, which are linear functions (see appendix A in\n[1]), and thus the class is Donsker. The same holds for \ufb01nite dimensional kernel clustering. This\nwill still be true for in\ufb01nite dimensional kernel clustering, if we can guarantee that any cluster in\nany solution close enough to \u03b80 in \u0398 will have smooth boundaries. Unfortunately, this does not hold\nin some important cases. For example, universal kernels (such as the Gaussian kernel) are capable\nof inducing cluster boundaries arbitrarily close in form to any continuous function, and thus our\nline of attack will not work in such cases. In a sense, this is not too surprising, since these kernels\ncorrespond to very \u2019rich\u2019 hypothesis classes, and it is not clear if a precise characterization of their\nstability properties, via central limit theorems, is at all possible.\n\nSummarizing the above discussion, we have shown that for the settings assumed in our theorem, all\nthree classes in Eq. (32) are Donsker and hence Eq. (31) holds. We now return to deal with the other\ningredients required to apply Thm. E.1.\n\nAs to the asymptotic distribution of \u221am(\u03a8m \u2212 \u03a8)(\u03b80), since \u03a8(\u03b80) = 0 by assumption, we have\nthat for any i \u2208 {1, . . . , k},\n\nwhere x1, . . . , xm is the sample by which \u03a8m is de\ufb01ned. The r.h.s of Eq. (33) is a sum of identically\ndistributed, independent random variables with zero mean, normalized by \u221am. As a result, by the\nstandard central limit theorem, \u221am(\u03a8i\nm\u2212\u03a8i)(\u03b80) converges in distribution to a zero mean Gaussian\nrandom vector Y , with covariance matrix\nVi =ZC\u03b80,i\n\np(x)(\u03c6(x) \u2212 \u03b80,i)(\u03c6(x) \u2212 \u03b80,i)\u22a4dx.\n\nMoreover, it is easily veri\ufb01ed that Cov(\u2206i(\u03b80, x), \u2206i\u2032 (\u03b80, x)) = 0 for any i 6= i\u2032. Therefore,\n\u221am(\u03a8m \u2212 \u03a8)(\u03b80) converges in distribution to a zero mean Gaussian random vector, whose co-\nvariance matrix V is composed of k diagonal blocks (V1, . . . , Vk), all other elements of V being\nzero.\n\nThus, we can use Thm. E.1 to get that \u221am(\u02c6\u03b8\u2212\u03b80) converges in distribution to a zero mean Gaussian\nrandom vector of the form \u2212 \u02d9\u03a8\u22121\nthe form \u02d9\u03a8\u22121\n\u03b80\n\nY , which is a Gaussian random vector with a covariance matrix of\n\nV \u02d9\u03a8\u22121\n\u03b80\n\n\u03b80\n\n.\n\n15\n\n\u221am(\u03a8i\n\nm \u2212 \u03a8i)(\u03b80) =\n\n1\n\u221am\n\nm\n\nXj=1\n\n\u2206i(\u03b80, xj).\n\n(33)\n\n\fF Proof of Thm. 4\n\nSince our algorithm returns a locally optimal solution with respect to the differentiable log-\nlikelihood function, we can frame it as a Z-estimator of the derivative of the log-likelihood function\nwith respect to the parameters, namely the score function\n\n\u03a8m(\u02c6\u03b8) =\n\n1\nm\n\n\u2202\n\u2202\u03b8\n\nlog(q(xi|\u02c6\u03b8)).\n\nm\n\nXi=1\n\nThis is a random mapping based on the sample x1, . . . , xm.\nSimilarly, we can de\ufb01ne \u03a8(\u00b7) as the \u2019asymptotic\u2019 score function with respect to the underlying\ndistribution D:\n\n\u03a8(\u02c6\u03b8) =ZX\n\n\u2202\n\u2202\u03b8\n\nlog(q(x|\u02c6\u03b8))p(x)dx.\n\nUnder the assumptions we have made, the model \u02c6\u03b8 returned by the algorithm satis\ufb01es \u03a8m(\u02c6\u03b8) = 0,\nand \u02c6\u03b8 converges in probability to some \u03b80 for which \u03a8(\u03b80) = 0. The asymptotic normality of\n\u221am(\u02c6\u03b8 \u2212 \u03b80) is now an immediate consequence of central limit theorems for \u2019maximum likelihood\u2019\n\nZ-estimators, such as Thm. 5.21 in [7].\n\nReferences\n[1] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of\n\nMachine Learning Research, 6:1705\u20131749, 2005.\n\n[2] P. Billingsley and F. Tops\u00f8e. Uniformity in weak convergence. Probability Theory and Related Fields,\n\n7:1\u201316, 1967.\n\nUniversity Press, 1999.\n\n[3] R. Dudley. Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics. Cambridge\n\n[4] G. R. Grimmet and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, 2001.\n[5] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American\n\n[6] R. R. Rao. Relations betwen weak and uniform convergence of measures with applications. The Annals of\n\nStatistical Association, 58(301):13\u201330, Mar. 1963.\n\nMathematical Statistics, 33(2):659\u2013680, June 1962.\n\n[7] A. W. V. D. Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\n[8] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes : With Applications to\n\nStatistics. Springer, 1996.\n\n16\n\n\f", "award": [], "sourceid": 500, "authors": [{"given_name": "Ohad", "family_name": "Shamir", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}