{"title": "Relating Leverage Scores and Density using Regularized Christoffel Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1663, "page_last": 1672, "abstract": "Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings.", "full_text": "Relating Leverage Scores and Density\nusing Regularized Christoffel Functions\n\nEdouard Pauwels\n\nIRIT-AOC\n\nUniversit\u00e9 Toulouse 3\n\nPaul Sabatier\n\nToulouse, France\n\nFrancis Bach\n\nINRIA\n\nEcole Normale Sup\u00e9rieure\nPSL Research University\n\nParis, France\n\nJean-Philippe Vert\n\nGoogle Brain\n\nCBIO Mines ParisTech\nPSL Research University\n\nParis, France\n\nAbstract\n\nStatistical leverage scores emerged as a fundamental tool for matrix sketching\nand column sampling with applications to low rank approximation, regression,\nrandom feature learning and quadrature. Yet, the very nature of this quantity is\nbarely understood. Borrowing ideas from the orthogonal polynomial literature, we\nintroduce the regularized Christoffel function associated to a positive de\ufb01nite kernel.\nThis uncovers a variational formulation for leverage scores for kernel methods and\nallows to elucidate their relationships with the chosen kernel as well as population\ndensity. Our main result quantitatively describes a decreasing relation between\nleverage score and population density for a broad class of kernels on Euclidean\nspaces. Numerical simulations support our \ufb01ndings.\n\n1\n\nIntroduction\n\nStatistical leverage scores have been historically used as a diagnosis tool for linear regression\n[16, 34, 10]. To be concrete, for a ridge regression problem with design matrix X and regularization\nparameter \u03bb > 0, the leverage score of each data point is given by the diagonal elements of X(X(cid:62)X +\n\u03bbI)\u22121X(cid:62). These leverage scores characterize the importance of the corresponding observations\nand are key to ef\ufb01cient subsampling with optimal approximation guarantees. Therefore, leverage\nscores emerged as a fundamental tool for matrix sketching and column sampling [22, 21, 13, 36],\nand play an important role in low rank matrix approximation [11, 6], regression [2, 28, 20], random\nfeature learning [29] and quadrature [7]. The notion of leverage score is seen as an intrinsic, setting-\ndependent quantity, and should eventually be estimated. In this work we elucidate a relation between\nleverage score and the learning setting (population measure and statistical model) when used with\nkernel methods.\nFor that purpose, we introduce a variant of the Christoffel function, a classical tool in polynomial\nalgebra which provides a bound for the evaluation at a given point of a given degree polynomial P\nin terms of an average value of P 2. The Christoffel function is an important object in the theory of\northogonal polynomials [32, 14] and found applications in approximation theory [26] and spectral\nanalysis of random matrices [5]. It is parametrized by the degree of polynomials considered and an\nassociated measure, and we know that, as the polynomial degree increases, it encodes information\nabout the support and the density of its associated measures, see [23, 24, 33] for the univariate case\nand [8, 9, 37, 38, 18, 19] in the multivariate case.\nThe variant we propose amounts to replacing the set of polynomials with \ufb01xed degree, used in the\nde\ufb01nition of the Christoffel function, by a set of function with bounded norm in a reproducing kernel\nHilbert space (RKHS)1. More precisely, given a density p on Rd and a regularization parameter\n1Kernelized Christoffel functions were \ufb01rst proposed by Laurent El Ghaoui and independently studied in [4].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u03bb > 0, we introduce C\u03bb : Rd \u2192 R, the regularized Christoffel function where \u03bb plays a similar\nrole as the degree for polynomials. The function C\u03bb turns out to have intrinsic connections with\nstatistical leverage scores, as the quantity 1/C\u03bb corresponds precisely to a notion of leverage used in\n[6, 2, 28, 7]. As a consequence, we uncover a variational formulation for leverage scores which helps\nelucidate connections with the RKHS and the density p on Rd.\nOur main contribution is a precise asymptotic expansion of C\u03bb as \u03bb \u2192 0 under restrictions on the\nRKHS. To give a concrete example, if we consider the Sobolev space of functions on Rd with squared\nintegrable derivatives of order up to s > d/2, we obtain, the asymptotic equivalent\n\nC\u03bb(z)\n\n\u223c\n\n\u03bb\u21920, \u03bb>0\n\nq\u22121\n0 \u03bbd/(2s)p(z)1\u2212d/2s,\n\nfor z a continuity point of p with p(z) > 0. Here q0 is an explicit constant which only depends on the\nRKHS. We recover scalings with respect to \u03bb which matches known estimates for the usual degrees\nof freedom [28, 7]. More importantly, we also obtain a precise spatial description of C\u03bb(z) (i.e.,\ndependency on z), and deduce that the leverage score is itself proportional to p(z)d/(2s)\u22121 in the\nlimit. Roughly speaking, large scores are given to low density regions (note that d/(2s) \u2212 1 < 0).\nThis result has several potential consequences for machine learning:\n(i) The Christoffel function could be used for density or support estimation. This has connections\nwith the spectral approach proposed in [35] for support learning. (ii) This could provide a more\nef\ufb01cient way to estimate leverage scores through density estimation. (iii) When leverage scores are\nused for sampling, the required sample size depends on the ratio between the maximum and the\naverage leverage scores [28, 7]. Our results imply that this ratio can be large if there exists low-density\nregions, while it remains bounded otherwise.\n\nOrganization of the paper. We introduce the regularized Christoffel function in Section 2 and\nexplicit connections with leverage scores and orthogonal polynomials. Our main result and assump-\ntions are described in abstract form in Section 3, they are presented as a general recipe to compute\nasymptotic expansions for the regularized Christoffel function. Section 3.3 describes an explicit\nexample and a precise asymptotic for an important class of RKHS related to Sobolev spaces. We\nillustrate our results numerically in Section 4. The proofs are postponed to Appendix B while\nAppendix A contains additional properties and simulations, and Appendix C contains further lemmas.\n\nNotations. Let d denote the ambient dimension, 0 denote the origin in Rd and\nC(Rd), L1(Rd), L2(Rd), L\u221e(Rd) denote the complex-valued function on Rd which are respec-\n(cid:82)\ntively continuous, absolutely integrable, square integrable, measurable and essentially bounded.\nRd f (x)e\u2212ix(cid:62)\u03c9dx. For\ng \u2208 L1(Rd), its inverse Fourier transform is x (cid:55)\u2192 1\nRd g(x)eix(cid:62)\u03c9d\u03c9. If f \u2208 L1(Rd) \u2229 C(Rd)\nand \u02c6f \u2208 L1(Rd), then inverse transform composed with direct transform leaves f unchanged. The\nFourier transform is extended to L2(Rd) by a density argument. It de\ufb01nes an isometry: if f \u2208 L2(Rd),\n\nFor any f \u2208 L1(Rd), let \u02c6f : Rd (cid:55)\u2192 C be its Fourier transform, \u02c6f : \u03c9 (cid:55)\u2192 (cid:82)\nParseval formula writes(cid:82)\n\nRd | \u02c6f (\u03c9)|2d\u03c9. See, e.g., [17, Chapter 11].\n\nRd |f (x)|2dx = 1\n\n(cid:82)\n\n(2\u03c0)d\n\n(2\u03c0)d\n\nWe identify x with a set of d real variables x1, . . . , xd. We associate to a multi-index \u03b2 =\n(\u03b2i)i=1,...,d \u2208 Nd the monomial x\u03b2 := x\u03b21\ni=1 \u03b2i. The\nlinear span of monomials forms the set of d-variate polynomials. The degree of a polynomial is\nthe highest of the degrees of its monomials with nonzero coef\ufb01cients (null for the null polyno-\nmial). A polynomial P is said to be homogeneous of degree 2s \u2208 N if for all \u03bb \u2208 R, x \u2208 Rd,\nP (\u03bbx) = \u03bb2sP (x), it is then composed only of monomials of degree 2s. See [14] for further details.\n\n2 . . . x\u03b2d\n\n1 x\u03b22\n\nd whose degree is |\u03b2| := (cid:80)d\n\n2 Regularized Christoffel function\n\n2.1 De\ufb01nition\n\n(cid:82)\n\nIn what follows, k is a positive de\ufb01nite, continuous, bounded, integrable, real-valued kernel on\nRd \u00d7 Rd and p is an integrable real function over Rd. We denote by H the RKHS associated to\nk which is assumed to be dense in L2(p), the normed space of functions, f : Rd (cid:55)\u2192 R, such that\nRd f 2(x)p(x)dx < +\u221e. This will be made more precise in Section 3.1.\n\n2\n\n\fFigure 1: The black lines represent a density and the corresponding Christoffel function. The colored\nlines are solutions of problem in (1), the corresponding z being represented by the dots. Outside\nthe support, the optimum is smooth and high values have small overlap with the support. Inside the\nsupport, the optimum is less smooth, it forms a peak, sharper in higher density regions.\n\nDe\ufb01nition 1 The regularized Christoffel function, is given for any \u03bb > 0, z \u2208 Rd by\n\nC\u03bb,p,k(z) = inf\nf\u2208H\n\nRd\n\nf (x)2p(x)dx + \u03bb(cid:107)f(cid:107)2\n\nH such that\n\nf (z) = 1 .\n\n(1)\n\n(cid:90)\n\nL2(p) + \u03bb(cid:107)f(cid:107)2\n\nIf there is no confusion about the kernel k and the density p, we will use the notation C\u03bb =\nC\u03bb,p,k. More compactly, setting for z \u2208 Rd, Hz = {f \u2208 H, f (z) = 1}, we have C\u03bb : z (cid:55)\u2192\ninf f\u2208Hz (cid:107)f(cid:107)2\nH. The value of (1) is intuitively connected to the density p. Indeed,\nthe constraint f (z) = 1 forces f to remains far from zero in a neighborhood of z. Increasing the p\nmeasure of this neighborhood increases the value of the Christoffel function. In low density regions,\nthe constraint has little effect which allows to consider smoother functions with little overlap with\nhigher density regions and decreases the overall cost. An illustration is given in Figure 1.\n\n2.2 Relation with orthogonal polynomials\n\nThe name Christoffel is borrowed from the orthogonal polynomial literature [32, 14, 26]. In this\ncontext, the Christoffel function is de\ufb01ned as follows for any degree l \u2208 N:\n\n(cid:90)\n\n\u039bl : z (cid:55)\u2192 min\nP\u2208Rl[X]\n\n(P (x))2p(x)dx\n\nsuch that P (z) = 1 ,\n\nwhere Rl[X] denotes the set of d variate polynomials of degree at most l. The regularized Christoffel\nfunction in (1) is a direct extension, replacing the polynomials of increasing degree by functions in a\nRKHS with increasing norm. \u039bl has connections with quadrature and interpolation [26], potential\ntheory and random matrices [5], orthogonal polynomials [26, 14]. Relating the asymptotic for large l\nand properties of p has also been a long lasting subject of research [23, 24, 33, 8, 9, 37, 38, 18, 19].\nThe idea of studying the relation between C\u03bb and p was directly inspired by these works.\n\n2.3 Relation with leverage scores for kernel methods\nThe (non-centered) covariance of p on H is the bilinear form Cov : H \u00d7 H \u2192 R given by:\n\n\u2200(f, g) \u2208 H2 , Cov(f, g) =\n\nf (x)g(x)p(x)dx .\n\n(cid:90)\n\nRd\n\nThe covariance operator \u03a3 : H \u2192 H is then de\ufb01ned such that for all f, g \u2208 H, Cov(f, g) =\n(cid:104)\u03a3f, g(cid:105)H . If \u03a3 is bounded with respect to (cid:107) \u00b7 (cid:107)H, then Lemma 5 in Appendix C shows that:\n\n\u2200z \u2208 Rd, C\u03bb(z) =(cid:0)(cid:10)k(z,\u00b7), (\u03a3 + \u03bbI)\u22121k(z,\u00b7)(cid:11)\n\n(cid:1)\u22121\n\n,\n\nH\n\nAs \u03bb \u2192 0, we typically have(cid:10)k(z,\u00b7), (\u03a3 + \u03bbI)\u22121k(z,\u00b7)(cid:11)\n(cid:10)k(z,\u00b7), \u03a3(\u03a3 + \u03bbI)\u22121k(z,\u00b7)(cid:11)\n\nwhich provides a direct link with leverage scores [7], as C\u03bb(z) is exactly the inverse of the population\nleverage score at z.\nH \u2192 +\u221e. It is worth emphasizing that\nspectral estimators (with other functions of the covariance operator than (\u03a3 + \u03bbI)\u22121) have been\nproposed for support inference in [35]. An example of such estimator has the form F\u03bb : z (cid:55)\u2192\nH, for which \ufb01nite level sets encode information about the support\nof p as \u03bb \u2192 0 [35]. Our main result should extend to broader classes of spectral functions.\n\n3\n\n0.000.250.500.751.00-2-1012xySolution1234FunctionDensityChristo\ufb00el\f2.4 Estimation from a discrete measure\n\nintegration with weight p by a discrete approximation of the form d\u03c1n =(cid:80)n\n\nPractical computation of the regularized Christoffel function requires to have access to the covariance\noperator \u03a3, which is not available in closed form in general. A plugin solution consists in replacing\ni=1 \u03b7i\u03b4xi, where for\neach i = 1, . . . , n, \u03b7i \u2208 R+ is a weight, xi \u2208 Rd and \u03b4xi denotes the Dirac measure at xi. We\nmay assume without loss of generality that the points are distinct. Given a kernel function k on\nRd \u00d7 Rd, let K = (k(xi, xj))i,j=1,...,n \u2208 Rn\u00d7n be the Gram matrix and Ki the i-th column of K\nfor i = 1, . . . , n. We have a closed form expression for the Christoffel function with plug-in measure\nd\u03c1n, for any \u03bb > 0, i = 1, . . . , n:\n\nC\u03bb,\u03c1n,k(xi) = inf\n\nf (xi)=1\n\n1\nn\n\n\u03b7j(f (xj))2 + \u03bb(cid:107)f(cid:107)2\n\nH =\n\nK(cid:62)\n\ni (K diag(\u03b7)K + \u03bbK)\n\n\u22121 Ki\n\n. (2)\n\n(cid:17)\u22121\n\n(cid:16)\n\nn(cid:88)\n\nj=1\n\nThis is a consequence of the representer theorem [30]; Lemma 5 allows to deal with the constraint\nexplicitely. Note that if \u03b7i > 0 for all i = 1, . . . , n, then the Christoffel function may be obtained as\na weighted diagonal element of a smoothing matrix, as for all i, thanks to the matrix inversion lemma,\nK(cid:62)\nii. This draws an important con-\nnection with statistical leverage score [21, 13] as it corresponds to the notion introduced for kernel\nridge regression [6, 2, 28]. It remains to choose \u03b7 so that d\u03c1n approximates integration with weight p.\n\n(cid:0)K(K + \u03bb Diag(\u03b7)\u22121)\u22121(cid:1)\n\ni (K diag(\u03b7)K + \u03bbK)\n\n\u22121 Ki = \u03b7\u22121\n\ni\n\nMonte Carlo approximation: Assuming that(cid:82)\n\nRd p(x)dx = 1, if one has the possibility to draw\nan i.i.d. sample (xi)i=1,...,n, with density p, then one can use \u03b7i = 1\nn for i = 1, . . . , n. The quality\nof this approximation is of order \u03bb\u22122n\u22121/2 (see Appendix A). If \u03bb2n1/2 is large enough, then we\nobtain a good estimation of the Christoffel function (note that better bounds could be obtained with\nrespect to \u03bb using tools from [6, 2, 28]).\n\nRiemann sums:\nIf the density p is piecewise smooth, one can approximate integrals with weight p\nby using a uniform grid and a Riemann sum with weights proportional to p. The bound in Eq. (8)\nalso holds, the quality of this approximation is typically of the order of n\u22121/d which is attractive in\ndimension 1 but quickly degrades in larger dimensions.\nDepending on properties of the integrand, quasi Monte Carlo methods could yeld faster quadrature\nrules [12], more quantitative deviation bounds and faster rates is left for future research.\n\n3 Relating regularized Christoffel functions to density\n\nWe \ufb01rst make precise our notations and assumptions in Section 3.1 and describe our main result\nin Section 3.2 using Assumption 2 which is given in abstract form. We then describe how this\nassumption is satis\ufb01ed by a broad class of kernels in Section 3.3.\n\n3.1 Assumptions\n\nAssumption 1\n\n1. The kernel k is translation invariant: for any x, y \u2208 Rd, k(x, y) = q(x \u2212 y) where\nq \u2208 L1(Rd) is the inverse Fourier transform of \u02c6q \u2208 L1(Rd) which is real valued and strictly\npositive.\n\n2. The density p \u2208 L1(Rd) \u2229 L\u221e(Rd) is \ufb01nite and nonnegative everywhere.\n\nUnder Assumption 1, k is a positive de\ufb01nite kernel by Bochner\u2019s theorem and we have an explicit\ncharacterization of the associated RKHS (see e.g. [35, Proposition 4]),\n\nH =\n\nf \u2208 C(Rd) \u2229 L2(Rd);\n\n,\n\n(3)\n\n(cid:110)\n\n(cid:90)\n\n| \u02c6f (\u03c9)|2\n\u02c6q(\u03c9)\n\nRd\n\nd\u03c9 < +\u221e(cid:111)\n\n4\n\n\fwith inner product\n\n(cid:104)\u00b7,\u00b7(cid:105)H : (f, g) (cid:55)\u2192 1\n(2\u03c0)d\n\n(cid:90)\n\n\u02c6f (\u03c9)\u00af\u02c6g(\u03c9)\n\nRd\n\n\u02c6q(\u03c9)\n\nd\u03c9 .\n\n(4)\n\nRemark 1 The assumption that \u02c6q \u2208 L1(Rd) implies by the Riemann-Lebesgue theorem that q is in\nC0(Rd), the set of continuous functions vanishing at in\ufb01nity. Since \u02c6q is strictly positive, its support\nis Rd and [31, Proposition 8] implies that k is c0-universal, i.e., that H is dense in C0(Rd) w.r.t. the\nuniform norm. As a result, H is also dense in L2(d\u03c1) for any probability measure \u03c1.\nRemark 2 For any f \u2208 H, we have by Cauchy-Schwartz inequality\n\n| \u02c6f (\u03c9)|d\u03c9\n\nand the last term is \ufb01nite by Assumption 1. Hence \u02c6f \u2208 L1(Rd) and we have f (0) =(cid:82)\n\n\u02c6f where the\nintegral is understood in the usual sense. In this setting any f \u2208 H is uniquely determined everywhere\non Rd by its Fourier transform and we have for any f \u2208 H, (cid:107)f(cid:107)L\u221e(Rd) \u2264 (cid:107) \u02c6f(cid:107)L1(Rd) \u2264 (cid:107)f(cid:107)H\n\n(cid:112)q(0).\n\n\u02c6q(\u03c9)d\u03c9,\n\nd\u03c9\n\nRd\n\nRd\n\nRd\n\nRd\n\n(cid:18)(cid:90)\n\n(cid:90)\n\n(cid:19)2 \u2264\n\n(cid:90)\n\n| \u02c6f (\u03c9)|2\n\u02c6q(\u03c9)\n\n3.2 Main result\n\nProblem (1) is related to a simpler variational problem with explicit solution. For any \u03bb > 0, let\n\nD(\u03bb) := min\nf\u2208H\n\n(5)\nNote that D(\u00b7) does not depend on p and corresponds to the Christoffel function at the origin 0, or\nany other points by translation invariance, for the Lebesgue measure on Rd. The solutions of (5) have\nan explicit description which proof is presented in Appendix B.2.\n\nH subject to f (0) = 1.\n\nRd\n\nf (x)2dx + \u03bb(cid:107)f(cid:107)2\n\nLemma 1 For any \u03bb > 0, D(\u03bb) =\n\n, and this value is attained by the function\n\n(cid:90)\n\n(cid:90)\n\nf\u03bb : x (cid:55)\u2192 D(\u03bb)\n\n(2\u03c0)d\n\n\u02c6q(\u03c9)ei\u03c9(cid:62)x\n\u02c6q(\u03c9) + \u03bb\n\nd\u03c9.\n\n(cid:82)Rd\n\n(2\u03c0)d\n\u02c6q(\u03c9)\n\u03bb+ \u02c6q(\u03c9) d\u03c9\n\n1\n\n(cid:90)\n\u03bb+\u02c6q(\u03c9) d\u03c9 \u2265 (cid:82)\n\n\u02c6q(\u03c9)\n\nRd\n\ntion 1 ensures that lim\u03bb\u21920 D(\u03bb) = 0 as(cid:82)\n\nRemark 3 We directly obtain D(\u03bb) \u2265 (2\u03c0)d\u03bb\nRd\n\nq(0) , for any \u03bb > 0. Finally, let us mention that Assump-\n2 which diverges as \u03bb \u2192 0.\n\n\u02c6q(\u03c9)\u2265\u03bb\n\nd\u03c9\n\nWe denote by g\u03bb the inverse Fourier transform of\nD(\u03bb).\n\u03bb+\u02c6q , i.e., g\u03bb = f\u03bb/D(\u03bb). It satis\ufb01es g\u03bb(0) = 1\nIntuitively, as \u03bb tends to 0, g\u03bb, should be approaching a Dirac in the sense that g\u03bb tends to 0\neverywhere except at the origin where it goes to +\u221e. The purpose of the next Assumption is to\nquantify this intuition.\n\n\u02c6q\n\nAssumption 2 For the kernel k given in Assumption 1 and f\u03bb given in Lemma 1, there exists\n\u03b5 : R+ \u2192 R+ such that, as \u03bb \u2192 0, \u03b5(\u03bb) \u2192 0, and\n\nf 2\n\u03bb(x)dx = o(\u03bbD(\u03bb)).\n\n(cid:107)x(cid:107)\u2265\u03b5(\u03bb)\n\nSee Section 3.3 for speci\ufb01c examples. We are now ready to describe the asymptotic inside the support\nof p, the proof is given in Appendix B.1.\nTheorem 1 Let q, k and p be given as in Assumption 1 and let C\u03bb be de\ufb01ned as in (1). If Assumption\n2 holds, then, for any z \u2208 Rd such that p(z) > 0 and p is continuous at z, we have\n\n(cid:18) \u03bb\n\np(z)\n\n(cid:19)\n\n.\n\nC\u03bb(z)\n\n\u223c\n\n\u03bb\u21920, \u03bb>0\n\np(z)D\n\n5\n\n\fProof sketch. The equivalent is shown by using the variational formulation in Eq. (1). A natural\ncandidate for the optimal function f is the optimizer obtained from Lebesgue measure in Eq. (5),\nscaled by p(z). Together with Assumption 2, this leads to the desired upper bound. In order to obtain\nthe corresponding lower bound, we consider Lebesgue measure restricted to a small ball around z.\nUsing linear algebra and expansions of operator inverses, we relate the optimal value directly to the\noptimal value D(\u03bb) of Eq. (5).\nThis result is complemented by the following which describes the asymptotic behavior outside the\nsupport of p, the proof is given in Appendix B.3.\nTheorem 2 Let q, k and p be given as in Theorem 1. Then, for any z \u2208 Rd, such that there exists\n\n\u0001 > 0 with(cid:82)\n\n(cid:107)z\u2212x(cid:107)\u2264\u0001 p(x)dx = 0, we have\n\n(i)\n\nC\u03bb(z)\n\n=\n\n\u03bb\u21920, \u03bb>0\n\n\u221a\nO(\n\n\u03bbD(\n\n\u221a\n\n\u03bb)).\n\nIf furthermore there exists a \u2265 0 and c > 0 such that, for any \u03c9 \u2208 Rd, \u02c6q(\u03c9) \u2265\nany such z \u2208 Rd, we have\n\nc\n\n1+(cid:107)\u03c9(cid:107)a , then, for\n\n(ii)\n\nC\u03bb(z)\n\n=\n\n\u03bb\u21920, \u03bb>0\n\nO(\u03bb).\n\nProof sketch. Since only an upper-bound is needed, we simply have to propose a candidate function\nfor f, and we build one from the solution of Eq. (5) for (i) and directly from properties of kernels\nfor (ii).\n\nRemark 4 Theorems 1 and 2 underline separation between the \u201cinside\u201d and the \u201coutside\u201d of the\nsupport of p and describes the fact that the convergence to 0 as \u03bb decreases is faster outside: (i),\n\u221a\nif log(D(\u03bb)) = \u03b1 log(\u03bb) + o(1) with \u03b1 < 1 (which is the case in most interesting situations), then\n\u03bb)) = o(D(\u03bb)). (ii), it holds that \u03bb = o(D(\u03bb)). Hence in most cases, the\nC\u03bb(z) = O(\nvalues of the Christoffel function outside of the support of p are negligible compared to the ones\ninside the support of p.\n\n\u03bbD(\n\n\u221a\n\nCombining Theorem 1 and 2 does not describe what happens in the limit case where neither of the\nconditions on z hold, for example on the boundary of the support or at discontinuity points of the\ndensity. We expect that this highly depends on the geometry of p and its support. In the polynomial\ncase on the simplex, the rate depends on the dimension of the largest face containing the point of\ninterest [38]. Settling down this question in the RKHS setting is left for future research.\n\n3.3 A general construction\n\nWe describe a class of kernels for which Assumptions 1 and 2 hold, and Theorem 1 can be applied,\nwhich includes Sobolev spaces. We also compute explicit equivalents for D(\u00b7) in (5). We \ufb01rst\nintroduce a de\ufb01nition and an assumption.\nDe\ufb01nition 2 For any s \u2208 N\u2217, a d-variate polynomial P of degree 2s is called 2s-positive if it satis\ufb01es\nthe following.\n\n\u2022 Let Q denote the 2s-homogeneous part of P (the sum of its monomial of degree 2s). Q is\n\n(strictly) positive on the unit sphere in Rd.\n\n\u2022 The polynomial R = P \u2212 Q satis\ufb01es R(x) \u2265 1 for all x \u2208 Rd.\n\nform(cid:81)d\n(cid:81)d\n\ni=1 w2\n\nRemark 5 If P is 2s-positive, then it is always greater than 1 and its 2s-homogeneous part is\nstrictly positive except at the origin. The positivity of Q forbids the use of polynomial P of the\ni ) which would allow to treat product kernels. Indeed, this would lead to Q(\u03c9) =\ni=1(1 + w2\ni which is not positive on the unit sphere. The last condition on R is not very restrictive as it\n\ncan be ensured by a proper rescaling of P if we have R > 0 only.\nAssumption 3 Let P be a 2s-positive, d-variate polynomial and let \u03b3 \u2265 1 be such that 2s\u03b3 > d.\nThe kernel k is given as in Assumption 1 with \u02c6q = 1\nP \u03b3 .\n\n6\n\n\fOne can check that q in Assumption 3 is well de\ufb01ned and satis\ufb01es Assumption 1. A famous example\nof such a kernel is the Laplace kernel (x, y) (cid:55)\u2192 e\u2212(cid:107)x\u2212y(cid:107) which amounts, up to a rescaling, to choose\nP of the form 1 + a(cid:107) \u00b7 (cid:107)2 for a > 0 and \u03b3 = d+1\n2 . In addition, Assumption 3 allows to capture the\nusual multi-dimensional Sobolev space of functions with square integrable partial derivatives up to\norder s, with s > 2/d, and the corresponding norm. We now provide the main result of this section.\n\nLemma 2 Assume that p and k are given as in Assumption 1 and 3. Then Assumption 2 is satis\ufb01ed.\nMore precisely, set q0 = 1\nfollowing holds true as \u03bb \u2192 0, \u03bb > 0 :\n\n1+Q(\u03c9)\u03b3 d\u03c9 and p = (cid:100)s\u03b3(cid:101), then for any l <(cid:0)1 \u2212 d\n\n(cid:1)/(8p) the\n\n(cid:82)\n\n(2\u03c0)d\n\nRd\n\n2s\u03b3\n\n1\n\n(i) D(\u03bb) \u223c \u03bb\nq0\n\nd\n\n2s\u03b3\n\n,\n\n(ii)\n\nf 2\n\u03bb(x)dx = o (\u03bbD(\u03bb)) .\n\n(cid:107)x(cid:107)\u2265\u03bbl\n\n(cid:90)\n\n(cid:90)\n\nRemark 6 If Q : \u03c9 (cid:55)\u2192 (cid:107)\u03c9(cid:107)2s, using spherical coordinate integration, we obtain\n1\n2d\u22121\u03c0 d\n\nrd\u22121\n1 + r2s\u03b3 d\u03c9 =\n\n1 + Q(\u03c9)\u03b3 d\u03c9 =\n\n2 \u0393(cid:0) d\n\n1\n2d\u22121\u03c0 d\n\n(cid:1)(cid:90) +\u221e\n\n(2\u03c0)d\n\nq0 =\n\nRd\n\n1\n\n1\n\n0\n\n2\n\n2 \u0393(cid:0) d\n\n2\n\n(cid:1)\n\n(cid:17) .\n\n(cid:16) d\u03c0\n\n2s\u03b3\n\n\u03c0\n\n2s\u03b3 sin\n\nThe proof is presented in Appendix B.4. We have the following corollary which is a direct application\nof Theorem 1. It explicits the asymptotic for the Christoffel function, in terms of the density p.\nCorollary 1 Assume that p and k are given as in Assumption 1 and 3 and that z \u2208 Rd is such that\np(z) > 0 and p is continuous at z. Then as \u03bb \u2192 0, \u03bb > 0,\n2s\u03b3 p(z)1\u2212 d\n\nC\u03bb(z) \u223c \u03bb\n\n.\n\nd\n\n2s\u03b3 1\nq0\n\n4 Numerical illustration\n\nIn this section we provide numerical evidence con\ufb01rming the rate described in Corollary 1. We use\nthe Mat\u00e9rn kernel, a parametric radial kernel allowing different values of \u03b3 in Assumption 3.\n\n4.1 Mat\u00e9rn kernel\n\nWe follow the description of [27, Section 4.2.1], note that the Fourier transform is normalized\ndifferently in our paper. For any \u03bd > 0 and l > 0, we let for any x \u2208 Rd,\n\n(cid:16)\u221a\n\nq\u03bd,l(x) =\n\n21\u2212\u03bd\n\u0393(\u03bd)\n\n2\u03bd(cid:107)x(cid:107)\nl\n\n(cid:17)\u03bd\n\n(cid:16)\u221a\n\nK\u03bd\n\n(cid:17)\n\n2\u03bd(cid:107)x(cid:107)\nl\n\n,\n\n(6)\n\nwhere K\u03bd is the modi\ufb01ed Bessel function of the second kind [1, Section 9.6]. This choice of q\nsatis\ufb01es Assumption 3, with s = 1 and \u03b3 = \u03bd + d\n2 . Indeed, for any \u03bd, l > 0, its Fourier transform is\ngiven for any \u03c9 \u2208 Rd\n\n\u02c6q\u03bd,l(\u03c9) =\n\n2d\u03c0 d\n\n2 \u0393 (\u03bd + d/2) (2\u03bd)\u03bd\n\n\u0393(\u03bd)l2\u03bd\n\nl2 + (cid:107)\u03c9(cid:107)2(cid:1)\u03bd+ d\n(cid:0) 2\u03bd\n\n1\n\n2\n\n.\n\n(7)\n\n4.2 Empirical validation of the convergence rate estimate\nCorollary 1 ensures that, given \u03bd, l > 0 and q in (6), as \u03bb \u2192 0, we have for appropriate z, C\u03bb(z) \u223c\n2\u03bd+d /q0(\u03bd, l). We use the Riemann sum plug-in approximation described in Section 2.4\n\u03bb\n2\u03bd+d p(z)\nto illustrate this result numerically. We perform extensive investigations with compactly supported\nsinusoidal density in dimension 1. Note that from Remark 6 we have the closed form expression\n\n2\u03bd\n\nd\n\n(cid:16) 2d\u03c0\n\n(cid:17) 1\n\n2\u03bd+d\n\nq0(\u03bd, l) =\n\nd\n2 \u0393(\u03bd+d/2)(2\u03bd)\u03bd\n\n\u0393(\u03bd)l2\u03bd\n\n1\n\n(2\u03bd+d) sin( d\u03c0\n\n2\u03bd+d )\n\n.\n\n7\n\n\fFigure 2: The target density p represented in red. We consider different choices of \u03bd and l for q as in\n(6). We use the Riemann sum plug-in approximation described in (2) with n = 2000. Left: the fact\nthat the estimate is close to the density is clear for small values of \u03bb. Right: the dotted line represents\nthe identity. This suggests that the rate estimate is of the correct order in \u03bb.\n\nRelation with the density: For a given choice of \u03bd, l > 0, as \u03bb \u2192 0, we should obtain for\nappropriate z that the quantity,\nis roughly equal to p(z). This is con\ufb01rmed\nnumerically as presented in Figure 2 (left), for different choices of the parameters \u03bd.\n\n\u03bbd/(d+2\u03bd)\n\n(cid:16) C\u03bb(z)q0(\u03bd,l)\n\n(cid:17)1+d/(2\u03bd)\n\n(cid:16)\n\nConvergence rate: For a given choice of \u03bd, l > 0, as \u03bb \u2192 0, we should obtain for appropriate z\nthat the quantity\nis roughly equal to \u03bb. Considering the same experiment\ncon\ufb01rms this \ufb01nding as presented in Figure 2 right, which suggests that the exponent in \u03bb is of the\ncorrect order.\n\np(z)2\u03bd/(2\u03bd+1) q0(\u03bd, l)\n\nC\u03bb(z)\n\n(cid:17) 2\u03bd+d\n\nd\n\nAdditional experiments: A piecewise constant density is considered in Appendix A which also\ncontains simulations suggesting that the asymptotic has a different nature for the Gaussian kernel for\nwhich we conjecture that our result does not hold.\n\n5 Conclusion and future work\n\nWe have introduced a notion of Christoffel function in RKHS settings. This allowed to derive precise\nasymptotic expansions for a quantity known as statistical leverage score which has a wide variety\nof applications in machine learning with kernel methods. Our main result states that the leverage\nscore is inversely proportional to a power of the population density at the considered point. This\nhas intuitive meaning as leverage score is a measure of the contribution of a given observation to a\nstatistical estimate. For densely populated region, a speci\ufb01c observation, which should have many\nclose neighbors, has less effect on a statistical estimate than observations in less populated areas of\nspace. Our observation gives a precise meaning to this statement and sheds new light on the relevance\nof the notion of leverage score. Furthermore, it is coherent with known results in the orthogonal\npolynomial literature from which the notion of Christoffel function was inspired.\nDirect extensions of this work include approximation bounds for our proposed plug-in estimate\nand tuning of the regularization parameter \u03bb. A related question is the relevance of the proposed\nvariational formulation for the statistical estimation of leverage scores when learning from random\nfeatures, in particular random Fourier features and density/support estimation. Another line of future\nresearch would be the extension of our estimates to broader classes of RKHS, for example, kernels\nwith product structure, such as the (cid:96)1 counterpart of the Laplace kernel. Finally, it would be interesting\nto extend the concepts to abstract topological spaces beyond Rd.\n\nAcknowledgements\n\nWe acknowledge support from the European Research Council (grant SEQUOIA 724063).\n\n8\n\n\u03bd=3,l=0.2\u03bd=4,l=0.2\u03bd=1,l=0.2\u03bd=2,l=0.2-2-1012-2-10120.000.250.500.751.000.000.250.500.751.00z(cid:16)C\u03bb(z)q0(\u03bd,l)\u03bb1/(1+2\u03bd)(cid:17)1+1/(2\u03bd)-6-5-4-3-2log10(\u03bb)1e-051e-031e-011e-051e-03\u03bb(cid:16)C\u03bb(z)p(z)2\u03bd/(2\u03bd+1)q0(\u03bd,l)(cid:17)1+2\u03bd\fReferences\n\n[1] M. Abramowitz and I. Stegun. Handbook of mathematical functions: with formulas, graphs,\n\nand mathematical tables. Courier Corporation, 1964.\n\n[2] A. Alaoui and M. Mahoney. Fast randomized kernel ridge regression with statistical guarantees.\n\nIn In Advances in Neural Information Processing Systems, pages 775\u2013783, 2015.\n\n[3] N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical\n\nsociety, 68(3):337\u2013404, 1950.\n\n[4] A. Askari, F. Yang, and L. E. Ghaoui. Kernel-based outlier detection using the inverse christoffel\n\nfunction. Technical report, 2018. arXiv preprint arXiv:1806.06775.\n\n[5] W. V. Assche. Asymptotics for orthogonal polynomials. Springer-Verlag Berlin Heidelberg,\n\n1987.\n\n[6] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning\n\nTheory, pages 185\u2013209, 2013.\n\n[7] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions.\n\nJournal of Machine Learning Research, 18(21):1\u201338, 2017.\n\n[8] L. Bos. Asymptotics for the Christoffel function for Jacobi like weights on a ball in Rm. New\n\nZealand Journal of Mathematics, 23(99):109\u2013116, 1994.\n\n[9] L. Bos, B. D. Vecchia, and G. Mastroianni. On the asymptotics of Christoffel functions for\ncentrally symmetric weight functions on the ball in Rd. Rend. Circ. Mat. Palermo, 2(52):277\u2013\n290, 1998.\n\n[10] S. Chatterjee and A. Hadi. In\ufb02uential observations, high leverage points, and outliers in linear\n\nregression. Statistical Science, 1(3):379\u2013393, 1986.\n\n[11] K. Clarkson and D. Woodruff. Low rank approximation and regression in input sparsity time.\n\nIn ACM symposium on Theory of computing, pages 81\u201390. ACM, 2013.\n\n[12] J. Dick, F. Y. Kuo, and I. H. Sloan. High-dimensional integration: the quasi-monte carlo way.\n\nActa Numerica, 22:133\u2013288, 2013.\n\n[13] P. Drineas, M. Magdon-Ismail, M. Mahoney, and D. Woodruff. Fast approximation of matrix\ncoherence and statistical leverage. Journal of Machine Learning Research, 13:3475\u20133506,\n2012.\n\n[14] C. Dunkl and Y. Xu. Orthogonal polynomials of several variables. Cambridge University Press,\n\nCambridge, UK, 2001.\n\n[15] M. Hardy. Combinatorics of partial derivatives. The electronic journal of combinatorics, 13(1),\n\n2006.\n\n[16] D. Hoaglin and Welsch. The hat matrix in regression and ANOVA. The American Statistician,\n\n32(1):17\u201322, 1978.\n\n[17] J. Hunter and B. Nachtergaele. Applied analysis. World Scienti\ufb01c Publishing Company, 2001.\n[18] A. Kro\u00f2 and D. S. Lubinsky. Christoffel functions and universality in the bulk for multivariate\n\northogonal polynomials. Canadian Journal of Mathematics, 65(3):600\u2013620, 2012.\n\n[19] J. Lasserre and E. Pauwels. The empirical Christoffel function in Statistics and Machine\n\nLearning. arXiv preprint arXiv:1701.02886, 2017.\n\n[20] P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. The Journal\n\nof Machine Learning Research, 16(1):861\u2013911, 2015.\n\n[21] M. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in\n\nMachine Learning, 3(2):123\u2013224, 2011.\n\n[22] M. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceed-\n\nings of the National Academy of Sciences, 106(3):697\u2013702, 2009.\n\n[23] A. M\u00e1t\u00e9 and P. Nevai. Bernstein\u2019s Inequality in Lp for 0 < p < 1 and (C, 1) Bounds for\n\nOrthogonal Polynomials. Annals of Mathematics, 111(1):145\u2013154., 1980.\n\n[24] A. M\u00e1t\u00e9, P. Nevai, and V. Totik. Szego\u2019s extremum problem on the unit circle. Annals of\n\nMathematics, 134(2):433\u201353, 1991.\n\n[25] S. Minsker. On some extensions of Bernstein\u2019s inequality for self-adjoint operators. arXiv\n\npreprint arXiv:1112.5448, 2011.\n\n9\n\n\f[26] P. N. P. G\u00e9za Freud, orthogonal polynomials and Christoffel functions. A case study. Journal\n\nof Approximation Theory, 48(1):3\u2013167, 1986.\n\n[27] C. Rasmussen and K. Williams. Gaussian Processes for Machine Learning. The MIT Press,\n\n2006.\n\n[28] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr\u00f6m computational regularization.\n\nIn Advances in Neural Information Processing Systems, pages 1657\u20131665, 2015.\n\n[29] A. Rudi and L. Rosasco. Generalization properties of learning with random features.\n\nAdvances in Neural Information Processing Systems, pages 3218\u20133228, 2017.\n\nIn\n\n[30] B. Sch\u00f6lkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In International\nconference on computational learning theory, pages 416\u2013426. Springer, Berlin, Heidelberg,\n2001.\n\n[31] B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. On the relation between universality, char-\nacteristic kernels and RKHS embedding of measures. In Thirteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, volume 9, pages 773\u2013780, 2010.\n\n[32] G. Szeg\u00f6. Orthogonal polynomials. In Colloquium publications, AMS, (23), fourth edition,\n\n1974.\n\n[33] V. Totik. Asymptotics for Christoffel functions for general measures on the real line. Journal\n\nd\u2019Analyse Math\u00e9matique, 81(1):283\u2013303, 2000.\n\n[34] P. Velleman and R. Welsch. Ef\ufb01cient computing of regression diagnostics. The American\n\nStatistician, 35(4):234\u2013242, 1981.\n\n[35] E. D. Vito, L. Rosasco, and A. Toigo. Learning sets with separating kernels. Applied and\n\nComputational Harmonic Analysis, 37(2):185\u2013217, 2014.\n\n[36] S. Wang and Z. Zhang. Improving cur matrix decomposition and the nystr\u00f6m approximation\n\nvia adaptive sampling. The Journal of Machine Learning Research, 14(1):2729\u20132769, 2013.\n\n[37] Y. Xu. Asymptotics for orthogonal polynomials and Christoffel functions on a ball. Methods\n\nand Applications of Analysis, 3(2):257\u2013272, 1996.\n\n[38] Y. Xu. Asymptotics of the Christoffel Functions on a Simplex in Rd. Journal of approximation\n\ntheory, 99(1):122\u2013133, 1999.\n\n10\n\n\f", "award": [], "sourceid": 853, "authors": [{"given_name": "Edouard", "family_name": "Pauwels", "institution": "IRIT"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Jean-Philippe", "family_name": "Vert", "institution": "Google"}]}