{"title": "Error Analysis of Generalized Nystr\u00f6m Kernel Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2541, "page_last": 2549, "abstract": "Nystr\\\"{o}m method has been used successfully to improve the computational efficiency of kernel ridge regression (KRR). Recently, theoretical analysis of Nystr\\\"{o}m KRR, including generalization bound and convergence rate, has been established based on reproducing kernel Hilbert space (RKHS) associated with the symmetric positive semi-definite kernel. However, in real world applications, RKHS is not always optimal and kernel function is not necessary to be symmetric or positive semi-definite. In this paper, we consider the generalized Nystr\\\"{o}m kernel regression (GNKR) with $\\ell_2$ coefficient regularization, where the kernel just requires the continuity and boundedness. Error analysis is provided to characterize its generalization performance and the column norm sampling is introduced to construct the refined hypothesis space. In particular, the fast learning rate with polynomial decay is reached for the GNKR. Experimental analysis demonstrates the satisfactory performance of GNKR with the column norm sampling.", "full_text": "Error Analysis of Generalized Nystr\u00f6m Kernel\n\nRegression\n\nHong Chen\n\nComputer Science and Engineering\n\nUniversity of Texas at Arlington\n\nArlington, TX, 76019\n\nchenh@mail.hzau.edu.cn\n\nHaifeng Xia\n\nMathematics and Statistics\n\nHuazhong Agricultural University\n\nWuhan 430070,China\n\nhaifeng.xia0910@gmail.com\n\nWeidong Cai\n\nSchool of Information Technologies\n\nUniversity of Sydney\nNSW 2006, Australia\n\ntom.cai@sydney.edu.au\n\nHeng Huang\n\nComputer Science and Engineering\n\nUniversity of Texas at Arlington\n\nArlington, TX, 76019\n\nheng@uta.edu\n\nAbstract\n\nNystr\u00f6m method has been successfully used to improve the computational ef\ufb01-\nciency of kernel ridge regression (KRR). Recently, theoretical analysis of Nystr\u00f6m\nKRR, including generalization bound and convergence rate, has been established\nbased on reproducing kernel Hilbert space (RKHS) associated with the symmetric\npositive semi-de\ufb01nite kernel. However, in real world applications, RKHS is not\nalways optimal and kernel function is not necessary to be symmetric or positive\nsemi-de\ufb01nite. In this paper, we consider the generalized Nystr\u00f6m kernel regression\n(GNKR) with (cid:96)2 coef\ufb01cient regularization, where the kernel just requires the conti-\nnuity and boundedness. Error analysis is provided to characterize its generalization\nperformance and the column norm sampling strategy is introduced to construct the\nre\ufb01ned hypothesis space. In particular, the fast learning rate with polynomial decay\nis reached for the GNKR. Experimental analysis demonstrates the satisfactory\nperformance of GNKR with the column norm sampling.\n\n1\n\nIntroduction\n\nThe high computational complexity makes kernel methods unfeasible to deal with large-scale data.\nRecently, the Nystr\u00f6m method and its alternatives (e.g., the random Fourier feature technique [15],\nthe sketching method [25]) have been used to scale up kernel ridge regression (KRR) [4, 23, 27]. The\nkey step of Nystr\u00f6m method is to construct a subsampled matrix, which only contains part columns\nof the original empirical kernel matrix. Therefore, the sampling criterion on the matrix column\naffects heavily on the learning performance. The subsampling strategies of Nystr\u00f6m method can be\ncategorized into two types: uniform sampling and non-uniform sampling. The uniform sampling is\nthe simplest strategy, which has shown satisfactory performance on some applications [16, 23, 24].\nFrom different theoretical aspects, several non-uniform sampling approaches have been proposed\nsuch as the square (cid:96)2 column-norm sampling [3, 4], the leverage score sampling [5, 8, 12], and the\nadaptive sampling [11]. Besides the sampling strategies, there exist learning bounds for Nystr\u00f6m\nkernel regression from three measurements: the matrix approximation [4, 5, 11], the coef\ufb01cient\napproximation [9, 10], and the excess generalization error [2, 16, 24].\nDespite rapid progress on theory and applications, the following critical issues should be further\naddressed for Nystr\u00f6m kernel regression.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u2022 Nystr\u00f6m regression with general kernel. The previous algorithms are mainly limited to\nKRR with symmetric and positive semi-de\ufb01nite kernels. For real-world applications, this\nrestriction may be not necessary. Several general kernels have shown the competitive\nperformance in machine learning, e.g., the inde\ufb01nite kernels for regularized algorithms\n[14, 20, 26] and PCA [13]. Therefore, it is important to formulate the learning algorithm for\nGeneralized Nystr\u00f6m Kernel Regression (GNKR).\n\u2022 Generalization analysis and sampling criterion. Previous theoretical results rely on the\nsymmetric positive semi-de\ufb01nite (SPSD) matrix associated with a Mercer kernel [17].\nHowever, this condition is not satis\ufb01ed for GNKR, which induces the additional dif\ufb01culty on\nerror analysis. Can we get the generalization error analysis for GNKR? It is also interesting\nto explore the sampling strategy for GNKR, e.g., the column-norm sampling in [3, 4].\n\nTo address the above issues, we propose the GNKR algorithm and investigate its theoretical properties\non generalization bound and learning rate. Inspired from the recent studies for data dependent\nhypothesis spaces [7, 19], we establish the error analysis for GNKR, which implies that the learning\nrate with polynomial decay can be reached under proper parameter selection. Meanwhile, we extend\nthe (cid:96)2 column norm subsampling in the linear regression [16, 22] to the GNKR setting.\nThe main contributions of this paper can be summarized as below:\n\n\u2022 GNKR with (cid:96)2 regularization. Due to the lack of Mercer condition associated with general\nkernel, coef\ufb01cient regularization becomes a natural choice to replace the kernel norm\nregularization in KRR. Note that Nystr\u00f6m approximation has the similar role with the (cid:96)1\nregularization in [7, 18, 20], which addresses the sample sparsity on hypothesis function.\nHence, we formulate GNKR by combining the Nystr\u00f6m method and the least squares\nregression with (cid:96)2 regularization in [19, 21].\n\u2022 Theoretical and empirical evaluations. From the view of learning with data dependent\nhypothesis spaces, theoretical analysis of GNKR is established to illustrate its generalization\nbound and learning rate. In particular, the fast learning rate arbitrarily close to O(m\u22121) is\nobtained under mild conditions, where m is the size of subsampled set. The effectiveness of\nGNKR is also supported by experiments on synthetic and real-world data sets.\n\n2 Related Works\n\nDue to the \ufb02exibility and adaptivity, least squares regression algorithms with general kernel have been\nproposed involving various types of regularization, e.g., the (cid:96)1-regularizer [18, 21], the (cid:96)2-regularizer\n[19, 20], and the elastic net regularization [7]. For the Mercer kernel, these algorithms are related\nclosely with the KRR, which has been well understand in learning theory. For the general kernel\nsetting, theoretical foundations of regression with coef\ufb01cient regularization have been studied recently\nvia the analysis techniques with the operator approximation [20] and the empirical covering numbers\n[7, 18, 19]. Although rich results on theoretical analysis, the previous works mainly focus on the\nprediction accuracy without considering the computation complexity for large scale data.\nNystr\u00f6m approximation has been studied extensively for kernel methods recently. Almost all existing\nstudies are relied on the fast approximation of SPSD matrix associated with a Mercer kernel. For the\n\ufb01xed design setting, the expectation of the excess generalization error is bounded for least square\nregression with the regularizer in RKHS [1, 2]. Recently, the probabilistic error bounds have been\nestimated for Nystr\u00f6m KRR in [16, 24]. In [24], the fast learning rate with O(m\u22121) is derived for the\n\ufb01xed design regression under the conditions on kernel matrix eigenvalues. In [16], the convergence\nrate is obtained under the capacity assumption and the regularity assumption. It is worthy notice\nthat the learning bound in [16] is based on the estimates of the sample error, the computation error,\nand the approximation error. Indeed, the computation error is related with the sampling subset and\ncan be considered as the hypothesis error in [18], which is induced by the variance of hypothesis\nspaces. Differently from previous works, our theoretical analysis of GNKR is dependent on general\ncontinuous kernel and (cid:96)2 coef\ufb01cient regularization.\n\n3 Generalized Nystr\u00f6m Kernel Regression\nLet \u03c1 be a probability distribution on Z := X \u00d7 Y, where X \u2282 Rd and Y \u2282 R are viewed as the\ninput space and the output space, respectively. Let \u03c1(\u00b7|x) be the conditional distribution of \u03c1 for\n\n2\n\n\ff\u03c1(x) =\n\nyd\u03c1(y|x)\n\nbased on the empirical risk\n\nEz(f ) =\n\n1\nn\n\n(yi \u2212 f (xi))2.\n\n(cid:90)\n\nY\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ngiven x \u2208 X and let F be a measurable function space on X . In statistical learning, the samples\nz := {zi}n\ni=1 are drawn independently and identically from an unknown distribution\n\u03c1. The task of least squares regression is to \ufb01nd a prediction function f : X \u2192 R such that the\nexpected risk\n\ni=1 = {(xi, yi)}n\n\n(cid:90)\n\nZ\n\nE(f ) =\n\n(y \u2212 f (x))2d\u03c1(x, y)\n\nas small as possible. From the viewpoint of approximation theory, this means to search a good\napproximation of the regression function\n\nLet K : X \u00d7 X \u2192 R be a continuous and bounded kernel function. Without loss of generality, we\nassume that \u03ba := sup\nx,x(cid:48)\u2208X\n\nK(x, x(cid:48)) \u2264 1 and for all |y| \u2264 1 for all y \u2208 Y throughout this paper.\n\nBesides the given samples z, the hypothesis function space is crucial to reach well learning perfor-\nmance. The following data dependent hypothesis space has been used for kernel regression with\ncoef\ufb01cient regularization:\n\n\u02dc\u03b1iK(xi, x) : \u02dc\u03b1 = (\u02dc\u03b11, ..., \u02dc\u03b1n) \u2208 Rn, x \u2208 X(cid:111)\n\n.\n\n(cid:110)\n\nHn =\n\nf (x) =\n\nn(cid:88)\n\ni=1\n\nGiven z, kernel regression with (cid:96)2 regularization [19, 20] is formulated as\n\n\u02dcfz = f \u02dc\u03b1z =\n\n\u02dc\u03b1z,iK(xi,\u00b7)\n\n(1)\n\nwith\n\n(cid:110) 1\n\nn\n\n\u02dc\u03b1z = arg min\n\n\u02dc\u03b1\u2208Rn\n\ni=1\n\n(cid:107)Knn \u02dc\u03b1 \u2212 Y (cid:107)2\n\n2 + \u03bbn\u02dc\u03b1T \u02dc\u03b1\n\n(cid:111)\n\n,\n\nwhere Knn = (K(xi, xj))n\nEven the positive semi-de\ufb01niteness is not required for the kernel, (3) also can be solved by the\nfollowing linear system (see Theorem 3.1 in [20])\n\ni,j=1, Y = (y1,\u00b7\u00b7\u00b7 , yn)T , and \u03bb > 0 is a regularization parameter.\n\n(K T\nwhere In is the n-order unit matrix.\nFrom the viewpoint of learning function in Hn, (1) can be rewritten as\n\nnnKnn + \u03bbn2In)\u02dc\u03b1 = K T\n\nnnY,\n\nwhere\n\n(cid:107)f(cid:107)2\n\n(cid:96)2\n\n= inf\n\n\u02dcfz = arg min\nf\u2208Hn\n\n(cid:110) n(cid:88)\n\n(cid:111)\n\n,\n\n(cid:96)2\n\n(cid:110)Ez(f ) + \u03bbn(cid:107)f(cid:107)2\nn(cid:88)\n\n\u02dc\u03b12\n\ni : f =\n\n\u02dc\u03b1iK(xi,\u00b7)\n\n(cid:111)\n\n.\n\ni=1\n\ni=1\n\n(2)\n\n(3)\n\nIn a standard implementation of (2), the computational complexity is O(n3). This computation\nrequirement becomes the bottleneck of (3) when facing large data sets. To reduce the computational\nburden, we consider to \ufb01nd the predictor in a smaller hypothesis space\n\nHm =\n\nf (x) =\n\n\u03b1iK(\u00afxi, x) : \u03b1 = (\u03b11, ..., \u03b1m) \u2208 Rm, x \u2208 X ,{\u00afxi}m\n\ni=1 \u2282 {xi}n\n\ni=1\n\n(cid:110)\n\nm(cid:88)\n\n(cid:111)\n\n.\n\ni=1\n\n3\n\n\fThe generalized Nystr\u00f6m kernel regression (GNKR) can be formulated as\n\n(4)\nDenote (Knm)ij = K(xi, \u00afxj), (Kmm)jk = K(\u00afxi, \u00afxj) for i \u2208 {1, ..., n}, j, k \u2208 {1, ..., m}. We can\ndeduce that\n\nfz = arg min\nf\u2208Hm\n\n(cid:96)2\n\n.\n\n(cid:111)\n\n(cid:110)Ez(f ) + \u03bbm(cid:107)f(cid:107)2\nm(cid:88)\n\n\u03b1z,iK(\u00afxi,\u00b7)\n\nfz =\n\nwith\n\ni=1\n\n(K T\n\nnmKnm + \u03bbmnIm)\u03b1z = K T\n\nnmY.\n\n(5)\n\nThe key problem of (4) is how to select the subset {\u00afxi}m\ni=1 such that the computational complexity\ncan be decreased ef\ufb01ciently while satisfactory accuracy can be guaranteed. For the KRR, there\nare several strategies to select the subset with different motivations [5, 11, 12]. In this paper we\npreliminarily consider the following two strategies with low computational complexity:\n\n{xi}n\n\n\u2022 Uniform Subsampling. The subset {\u00afxi}m\n\u2022 Column-norm Subsampling. The subset {\u00afxi}m\n\ni=1.\n\nprobabilities pi =\n\n(cid:80)n\n(cid:107)Ki(cid:107)2\ni=1 (cid:107)Ki(cid:107)2\n\ni=1 is drawn from {xi}n\n\ni=1 independently with\n\n, where Ki = (K(x1, xi), ..., K(xn, xi))T \u2208 Rn.\n\ni=1 is drawn uniformly at random from the input\n\nSome discussions for the column-norm subsampling will be provided in Section 4.\n\n4 Learning Theory Analysis\n\nIn this section, we will introduce our theoretical results on generalization bound and learning rate.\nThe detailed proofs can be found in the supplementary materials.\nInspired from analysis technique in [7, 19], we introduce the intermediate function for error decom-\nposition \ufb01rstly. Let F be the square integrable space on X with norm (cid:107) \u00b7 (cid:107)L2\n\u03c1X . For any bounded\ncontinuous kernel K : X \u00d7 X \u2192 R, the integral operator LK : F \u2192 F is de\ufb01ned as\n\nLKf (x) =\n\n(cid:90)\n(cid:110)\ng = LKf, f \u2208 F(cid:111)\n\nX\n\nH =\n\nwhere \u03c1X is the marginal distribution of \u03c1. Given F and LK, introduce the function space\n\nK(x, t)f (t)d\u03c1X (t),\u2200x \u2208 X ,\n\nwith (cid:107)g(cid:107)H = inf(cid:8)(cid:107)f(cid:107)L2\n(cid:110)E(LKf ) \u2212 E(f\u03c1) + \u03bb(cid:107)f(cid:107)2\n\n\u03c1X\n\n(cid:111)\n\nL2\n\n\u03c1X\n\n: g = LKf(cid:9).\n\nSince H is sample independent, the intermediate function can be constructed as g\u03bb = LKf\u03bb, where\n(6)\n\nf\u03bb = arg min\n\n.\n\nf\u2208F\n\nIn learning theory, usually g\u03bb is called as the regularized function and\n\nD(\u03bb) = inf\n\ng\u2208H{E(g) \u2212 E(f\u03c1) + \u03bb(cid:107)g(cid:107)2H} = E(LKf\u03bb) \u2212 E(f\u03c1) + \u03bb(cid:107)f\u03bb(cid:107)2\n\nL2\n\n\u03c1X\n\nis called the approximation error\nTo further bridge the gap between g\u03bb and fz, we construct the stepping stone function\n\nm(cid:88)\n\ni=1\n\n\u02c6g\u03bb =\n\n1\nm\n\nf\u03bb(\u00afxi)K(\u00afxi,\u00b7).\n\n(7)\n\nThe following condition on K is used in this paper, which has been well studied in learning theory\nliterature [18, 19]. Examples include Gaussian kernel, the sigmoid kernel [17], and the fractional\npower polynomials [13].\n\n4\n\n\fDe\ufb01nition 1 The kernel function K is a C s kernel with s > 0 if there exists some constant cs > 0,\nsuch that\n\n|K(t, x) \u2212 K(t, x(cid:48))| \u2264 cs(cid:107)x \u2212 x(cid:48)(cid:107)s\n\n2, \u2200t, x, x(cid:48) \u2208 X .\n\nThe de\ufb01nition of f\u03c1 tells us |f\u03c1(x)| \u2264 1, so it is natural to restrict the predictor to [\u22121, 1]. The\nprojection operator\n\n\u03c0(f )(x) = min{1, f (x)}I{f (x) \u2265 0} + max{\u22121, f (x)}I{f (x) < 0}\n\nhas been extensively used in learning theory analysis, e.g. [6].\nIt is a position to present our result on the generalization error bound.\nTheorem 1 Suppose that X is compact subset of Rd and K \u2208 C s(X \u00d7 X ) for some s > 0. For any\n0 < \u03b4 < 1, with con\ufb01dence 1 \u2212 \u03b4, there holds\nE(\u03c0(fz)) \u2212 E(f\u03c1) \u2264 \u02dcc1 log2(8/\u03b4)\nwhere constant \u02dcc1 is independent of m, n, \u03b4, and\n\n(1 + m\u22121\u03bb\u22121 + m\u22122\u03bb\u22122 + n\n\n\u2212 2\n2+p \u03bb\u22122)D(\u03bb) + n\n\n(cid:16)\n\n2+p \u03bb\n\n\u2212 p\n\n\u2212 2\n\n2+p\n\n(cid:17)\n\n,\n\n(cid:40) 2d/(d + 2s),\n\np =\n\n2d/(d + 2),\nd/s,\n\nif 0 < s \u2264 1;\nif 1 < s \u2264 1 + d/2;\nif s > 1 + d/2.\n\n(8)\n\nL2\n\n\u03c1X\n\n.\n\n(9)\n\nL2\n\n\u03c1X\n\n\u2212 p\n\n2+p\n\n\u2212 2\n\n2+p \u03bb\n\n(cid:17)\n\n(cid:16)\n\n\u2264 O\n\n} + n\n\nc(m, n, \u03bb) inf\nf\n\n{E(LKf ) \u2212 E(f\u03c1) + \u03bb(cid:107)f(cid:107)2\n\nTheorem 1 is a general result that applies to Lipschitz continuous kernel. Although the statement\nappears somewhat complicated at \ufb01rst sight, it yields fast convergence rate on the error when\nspecialized to particular kernels. Before doing so, let us provide a few heuristic arguments for\nintuition. Theorem 1 guarantees an upper bound of the form\n(cid:107)\u03c0(fz) \u2212 f\u03c1(cid:107)2\nNote that a smaller value of \u03bb reduces the approximation error term, but increases the second term\nassociated with the sample error. This inequality demonstrates that the proper \u03bb should be selected\nto balance the two terms. This quantitative relationship (9) also can be considered as the oracle\ninequality for GNKR, where the approximation error D(\u03bb) only can be obtained by an oracle knowing\nthe distribution.\nTheorem 1 tells us that the generalization bound of GNKR depends on the numbers of samples m, n,\nthe continuous degree s, and the approximation error D(\u03bb). In essential, the subsampling number m\nhas double impact on generalization error: one is the complexity of data dependent hypothesis space\nHm and the other is the selection of parameter \u03bb.\nNow we introduce the characterization of approximation error, which has been studied in [19, 20].\nDe\ufb01nition 2 The target function f\u03c1 can be approximated with exponent 0 < \u03b2 \u2264 1 in H if there\nexists a constant c\u03b2 \u2265 1 such that D(\u03bb) \u2264 c\u03b2\u03bb\u03b2 for any \u03bb > 0.\nIf the kernel is not symmetric or positive semi-de\ufb01nite, the approximation condition holds true for\n\u03c1X , where L \u02dcK is the integral operator associated with \u02dcK(u, v) =\n\u03b2 = 2r\nX K(u, x)K(v, x)d\u03c1X , (u, v) \u2208 X 2 (see [7]).\nNow we state our main results on the convergence rate.\nTheorem 2 Let X be a compact subset of Rd. Assume that f\u03c1 can be approximated with exponent\n0 < \u03b2 \u2264 1 in H and K \u2208 C s(X \u00d7 X ) for some s > 0. Choose m \u2264 n\n2+p and \u03bb = m\u2212\u03b8 for some\n\u03b8 > 0. For any 0 < \u03b4 < 1, with con\ufb01dence 1 \u2212 \u03b4, there holds\n\n3 when f\u03c1 \u2208 L\u2212r\n\n\u2208 L2\n\n(cid:82)\n\n\u02dcK\n\nE(\u03c0(fz)) \u2212 E(f\u03c1) \u2264 \u02dcc2 log2(8/\u03b4)m\u2212\u03b3,\n\nwhere constant \u02dcc2 is independent of m, \u03b4, and\n\n(cid:110)\n\n\u03b3 = min\n\n2 \u2212 p\u03b8\n2 + p\n\n1\n\n(cid:111)\n\n.\n\n, 2 + \u03b2\u03b8 \u2212 2\u03b8, \u03b2\u03b8, 1 + \u03b2\u03b8 \u2212 \u03b8\n\n5\n\n\fTheorem 2 states the polynomial convergence rate of GNKR and indicates its dependence on the\nsubsampling size m as n \u2265 m2+p. Similar observation also can be found in Theorem 2 [16] for\nNystr\u00f6m KRR, where the fast learning rate also is relied on the grow of m under \ufb01xed hypothesis\nspace complexity. However, even we do not consider the complexity of hypothesis space, the increase\nof m will add the computation complexity. Hence, a suitable size of m is a trade off between the\napproximation performance and the computation complexity. When p \u2208 (0, 2), m = n\n2+p means\nthat m can be chosen between n 1\n2 under the conditions in Theorem 4. In particular, the fast\nconvergence rate O(m\u22121) can be obtained as K \u2208 C\u221e, \u03b8 \u2192 1, and \u03b2 \u2192 1.\nThe most related works with Theorems 1 and 2 are presented in [16, 24], where learning bounds are\nestablished for Nystr\u00f6m KRR. Compared with the previous results, the features of this paper can be\nsummarized as below.\n\n4 and 1\n\n1\n\n\u2022 Learning model. This paper considered Nystr\u00f6m regression with data dependent hypothesis\nspace and coef\ufb01cient regularization, which can employ general kernel including the inde\ufb01-\nnite kernel and nonsymmetric kernel. However, the previous analysis just focuses on the\npositive semi-de\ufb01nite kernel and the regularizer in RKHS. For a \ufb01xed design KRR, the fast\nconvergence O(m\u22121) in [24] depends on the eigenvalue condition of kernel matrix. Differ-\nently from [24], our result relies on the Lipschitz continuity of kernel and the approximation\ncondition D(\u03bb) for the statistical learning setting.\n\n\u2022 Analysis technique. The previous analysis in [16, 24] utilizes the theoretical techniques for\noperator approximation and matrix decomposition, which depends heavily on the symmetric\npositive semi-de\ufb01nite kernel. For GNKR (4), the previous analysis is not valid directly since\nthe kernel is not necessary to satisfy the positive semi-de\ufb01nite or symmetric condition. The\n\ufb02exibility on kernel and the adaptivity on hypothesis space induce the additional dif\ufb01culty\non error analysis. Fortunately, the error analysis is obtained by incorporating the error\ndecomposition ideas in [7] and the concentration estimate techniques in [18, 19]. An\ninteresting future work is to establish the optimal bound of GNKR to extend Theorem 2 in\n[16] to the general setting.\n\nthe objective function S(p) := S(p1, ..., pn) = (cid:80)n\n\nFor the proofs of Theorem 1 and 2, the key idea is using \u02c6g\u03bb as the stepping stone function to bridge fz\nand g\u03bb. Additionally, the connection between g\u03bb = LKf\u03bb and f\u03c1 has been well studied in learning\ntheory. Hence, the proofs in Appendix follow from the approximation decomposition.\nIn remainder of this section, we present a simple analysis for column-norm subsampling.\nGiven the full samples z = {(xi, yi)}n\ni=1 and sampling number m, the key of subsampling is to\nselect a subset of z with strong inference ability. In other words, we should select the subset with\nsmall divergence with the full sample estimator. Following this idea, the optimal subsampling\ncriterion is studied in [28, 22] for the linear regression. Given z = {zi}n\ni=1 and Knn, we introduce\n2 by extending (16) in [28] to\ni=1 are the sampling probabilities with respect to {xi}n\nthe kernel-based setting. Here {pi}n\ni=1 and\nnn)ii, i \u2208 {1, ..., n} are basic leverage values obtained from\nLii = (Knn(K T\ni \u03b10 + \u03b5i, i = 1, ..., n, \u03b10 \u2208 Rn, where {\u03b5i}n\n(2). For the \ufb01xed design setting, assume that yi = K T\nare drawn identically and independently from N (0, \u03c32). Then, for \u03bb = 0, min\ni=1\nS(p1, ..., pn) can\nEtr((Knn)T (diag(p))\u22121Knn), which is related with the A-optimality or\nbe transformed as min\nA-criterion for subset selection in [22].\nWhen Lii \u2192 0 for any i \u2208 {1, ..., n}, we can get the following sampling probabilities.\nTheorem 3 When hii = o(1) for 1 \u2264 i \u2264 n, the minimizer of S(p1, ..., pn) can be approximated by\n\nnnKnn + \u03bbn2In)\u22121K T\n\n(cid:107)Ki(cid:107)2\n\np\n\n1\u2212Lii\n\ni=1\n\npi\n\np\n\n(cid:80)n\n(cid:107)Ki(cid:107)2\ni=1 (cid:107)Ki(cid:107)2\n\npi =\n\n, i \u2208 {1, ..., n}.\n\nUsually, the leverage values are computed by fast approximation algorithms [1, 16] since Lii involves\nthe inverse matrix. Different from the leverage values, the sampling probabilities in Theorem 3 can\nbe computed directly, which just involves the (cid:96)2 column-norm of empirical matrix.\n\n6\n\n\fTable 1: Average RMSE of GNKR with Gaussian(G)/Epanechnikov(E) kernel under different\nsampling strategies and sampling size. US:=Uniform subsampling, CS: Column-norm subsampling.\n\nFunction\n\nf1(x) = x sin x\n\nx \u2208 [0, 2\u03c0]\n\nf2(x) = sin x\nx \u2208 [\u22122\u03c0, 2\u03c0]\n\nx\n\nf3(x) = sign(x)\n\nx \u2208 [\u22123, 3]\n\nf4(x) = cos(ex) + sin x\n\nx\n\nx \u2208 [\u22122, 4]\n\nAlgorithm\n\nG-GNKR-US\nG-GNKR-CS\nE-GNKR-US\nE-GNKR-CS\nG-GNKR-US\nG-GNKR-CS\nE-GNKR-US\nE-GNKR-CS\nG-GNKR-US\nG-GNKR-CS\nE-GNKR-US\nE-GNKR-CS\nG-GNKR-US\nG-GNKR-CS\nE-GNKR-US\nE-GNKR-CS\n\n(cid:93)300\n0.03412\n0.03420\n0.10159\n0.09941\n0.03442\n0.03444\n0.04786\n0.04607\n0.29236\n0.29319\n0.16170\n0.16500\n0.34916\n0.34909\n0.22298\n0.21624\n\n(cid:93)400\n0.03145\n0.03086\n0.09653\n0.09414\n0.03434\n0.03423\n0.04191\n0.03865\n0.29102\n0.29071\n0.15822\n0.15579\n0.35158\n0.35171\n0.21012\n0.20783\n\n(cid:93)500\n0.02986\n0.02954\n0.09081\n0.08908\n0.03418\n0.03419\n0.04073\n0.03709\n0.29009\n0.28983\n0.15537\n0.15205\n0.35155\n0.35168\n0.20265\n0.20024\n\n(cid:93)600\n0.02919\n0.02911\n0.08718\n0.08631\n0.03409\n0.03408\n0.03692\n0.03573\n0.28908\n0.28975\n0.15188\n0.15201\n0.35148\n0.35133\n0.19977\n0.19698\n\n(cid:93)700\n0.02897\n0.02890\n0.08515\n0.08450\n0.03404\n0.03397\n0.03582\n0.03510\n0.28867\n0.28903\n0.15086\n0.14949\n0.35156\n0.35153\n0.19414\n0.19260\n\n(cid:93)800\n0.02906\n0.02878\n0.08278\n0.08237\n0.03400\n0.03397\n0.03493\n0.03441\n0.28839\n0.28833\n0.14889\n0.14698\n0.35140\n0.35145\n0.19126\n0.18996\n\n(cid:93)900\n0.02896\n0.02891\n0.08198\n0.08118\n0.03398\n0.03396\n0.03470\n0.03316\n0.28755\n0.28797\n0.14730\n0.14597\n0.35136\n0.35141\n0.18916\n0.18702\n\n(cid:93)1000\n0.02908\n0.02889\n0.08024\n0.07898\n0.03395\n0.03389\n0.03440\n0.03383\n0.28742\n0.28768\n0.14726\n0.14566\n0.35139\n0.35138\n0.18560\n0.18662\n\n5 Experimental Analysis\n\nSince kernel regression with different types of regularization has been well studied in [7, 20, 21], this\nsection just presents the empirical evaluation of GNKR to illustrate the roles of sampling strategy and\n\nkernel function. Gaussian kernel KG(x, t) = exp(cid:8) \u2212 (cid:107)x\u2212t(cid:107)2\ndata. Epanechnikov kernel KE(x, t) =(cid:0)1 \u2212 (cid:107)x\u2212t(cid:107)2\n\n(cid:9) is used for simulated data and real\n\n(cid:1)\n\n+ is used in the simulated experiment. Here, \u03c3\ndenotes the scale parameter selected form [10\u22125 : 10 : 104]. Following the discussion on parameter\nselection in [16], we select the regularization parameter of GNKR from [10\u221215 : 10 : 10\u22123]. The\nbest results are reported according to the measure of Root Mean Squared Error (RMSE).\n\n2\u03c32\n\n2\n\n2\n\n2\u03c32\n\n5.1 Experiments on synthetic data\nFollowing the empirical studies in [20, 21], we design simulation experiments on f1(x) = x sin x, x \u2208\nx , x \u2208 [\u22122\u03c0, 2\u03c0], f3(x) = sign(x), x \u2208 [\u22123, 3], and f4(x) = cos(ex) +\n[0, 2\u03c0], f2(x) = sin x\nx , x \u2208 [\u22122, 4]. The function fi is considered as the truly regression function for 1 \u2264 i \u2264 4. Note\nsin x\nthat f1, f2 are smooth, f3 is not continuous, and f4 embraces a highly oscillatory part. First, we select\n10000 points randomly from the preset interval and generate the dependent variable y according to\nthe corresponding function. Then we divided these data into two parts with equal size. we chose one\npart as the training samples and the other is regarded as testing samples. For the training samples, the\noutput y is contaminated by Gaussian noise N (0, 1). For each function and each kernel, we run the\nexperiment 20 times. The average RMSE is shown in Table 1. The results indicate that the column\nnorm subsampling can achieve the satisfactory performance. In particular, GNKR with the inde\ufb01nite\nEpanechnikov kernel has better performance than Gaussian kernel for the noncontinuous function f3\nand the non-\ufb02at function f4. This observation is consistent with the empirical result in [21].\n\n5.2 Experiments on real data\n\nIn order to better evaluate the empirical performance, four data sets are used in our study including\nthe Wine Quality, CASP, Year Prediction datasets (http://archive.ics.uci.edu/ml/) and the census-house\ndataset (http://www.cs.toronto.edu/ delve/data/census-house/desc.html). The detailed information\nabout the data sets are showed in Table 2. Firstly, each data set is standardized by subtracting its\nmean and dividing its standard deviation. Then, each input vector is unitized. For CASP and Year\nPrediction, 20000 samples are drawn randomly from data sets, where half is used for training and the\nrest is for testing. For other datasets, we random select part samples to training and use the rest part\nas test set. Table 3 reports the average RMSE over ten trials.\nTable 3 shows the performance of two sampling strategies. For CASP, and Year Prediction, we can\nsee that GNKR with 100 selected samples can achieve the satisfactory performance, which reduce\nthe computation complexity of (2) ef\ufb01ciently. Additionally, the competitive performance of GNKR\nwith Epanechnikov kernel is demonstrated via the experimental results on the four data sets. These\nempirical examples support the effectiveness of the proposed method.\n\n7\n\n\fTable 2: Statistics of data sets\n\nDataset\n\nWine Quality\nYear Prediction\n\n#Features\n\n#Instances\n\n12\n90\n\n4898\n515345\n\n#Train\n2000\n10000\n\n#Test\n2898\n10000\n\nDataset\nCASP\n\ncensus-house\n\n#Feature\n\n#Instance\n\n9\n139\n\n45730\n22784\n\n#Train\n10000\n12000\n\n#Test\n10000\n10784\n\nTable 3: Average RMSE (\u00d710\u22123) with Gaussian(G)/Epanechnikov(E) kernel under different sampling\nlevels and strategies. US:=Uniform subsampling, CS: Column-norm subsampling.\n\nFunction\n\nWine Quality\n\nCASP\n\nYear Prediction\n\ncensus-house\n\nAlgorithm\n\nG-GNKR-US\nG-GNKR-CS\nE-GNKR-US\nE-GNKR-CS\nG-GNKR-US\nG-GNKR-CS\nE-GNKR-US\nE-GNKR-CS\nG-GNKR-US\nG-GNKR-CS\nE-GNKR-US\nE-GNKR-CS\nG-GNKR-US\nG-GNKR-CS\nE-GNKR-US\nE-GNKR-CS\n\n(cid:93)50\n14.567\n14.563\n13.990\n13.969\n9.275\n9.220\n4.282\n4.206\n8.806\n8.806\n7.013\n7.006\n111.084\n111.083\n102.731\n102.703\n\n(cid:93)100\n14.438\n14.432\n13.928\n13.899\n9.238\n9.196\n4.196\n4.249\n8.802\n8.801\n6.842\n6.861\n111.083\n111.080\n99.535\n99.528\n\n(cid:93)200\n14.382\n14.394\n13.807\n13.798\n9.205\n9.205\n4.213\n4.206\n8.798\n8.798\n6.739\n6.804\n111.082\n111.080\n99.698\n99.697\n\n(cid:93)400\n14.292\n14.225\n13.636\n13.601\n9.222\n9.193\n4.153\n4.182\n8.795\n8.793\n6.700\n6.705\n111.079\n111.079\n99.718\n99.716\n\n(cid:93)600\n14.189\n14.138\n13.473\n13.445\n9.204\n9.198\n4.181\n4.172\n8.792\n8.792\n6.676\n6.697\n111.077\n111.075\n99.715\n99.714\n\n(cid:93)800\n14.103\n14.014\n13.381\n13.362\n9.207\n9.199\n4.174\n4.165\n8.790\n8.789\n6.671\n6.663\n111.074\n111.071\n99.714\n99.714\n\n(cid:93)1000\n13.936\n13.936\n13.217\n13.239\n9.205\n9.198\n4.180\n4.118\n8.782\n8.781\n6.637\n6.662\n111.071\n111.068\n99.713\n99.712\n\n6 Conclusion\n\nThis paper focuses on the learning theory analysis of Nystr\u00f6m kernel regression. One key difference\nwith the previous related work is that GNKR uses general continuous kernel function and (cid:96)2 coef\ufb01cient\nregularization. The stepping-stone functions are constructed to overcome the analysis dif\ufb01culty\ninduced by the difference. The learning bound with fast convergence is derived under mild conditions\nand empirical analysis is provided to verify our theoretical analysis.\n\nAcknowledgments\n\nThis work was partially supported by U.S. NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628,\nNSF-IIS 1619308, NSF-IIS 1633753, NIH AG049371, and by National Natural Science Foundation\nof China (NSFC) 11671161. We thank the anonymous NIPS reviewers for insightful comments.\n\nReferences\n[1] A. Alaoui and M. W. Mahoney. Fast randomized kernel methods with statistical guarantees. In\n\nNIPS, pp. 775\u2013783, 2015.\n\n[2] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In COLT, 2013.\n\n[3] P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo algorithms for matrices I:\nComputing a low-rank approximation to a matrix. SIAM Journal on Computing, pp. 158\u2013183,\n2006.\n\n[4] P. Drineas and M.W. Mahoney. On the Nystr\u00f6m method for approximating a Gram matrix for\n\nimproved kernel-based learning. J. Mach. Learn. Res., 6: 2153\u20132175, 2005.\n\n[5] P. Drineas, M. Magdon-Ismail, M.W. Mahoney, and D.P. Woodruff. Fast approximation of\n\nmatrix coherence and statistical leverage. J. Mach. Learn. Res., 13: 3475\u20133506, 2012.\n\n[6] M. Eberts and I. Steinwart. Optimal learning rates for least squares SVMs using Gaussian\n\nkernels. In NIPS, pp. 1539\u20131547, 2011.\n\n[7] Y. Feng, S. Lv, H. Huang, and J. Suykens. Kernelized elastic net regularization: generalization\n\nbounds and sparse recovery. Neural Computat., 28: 1\u201338, 2016.\n\n[8] A. Gittens and M.W. Mahoney. Revisiting the Nystr\u00f6m method for improved large-scale machine\n\nlearning. In ICML, pp. 567\u2013575, 2013\n\n8\n\n\f[9] C.J. Hsieh, S. Si, and I.S. Dhillon. Fast prediction for large scale kernel machines. In NIPS, pp.\n\n3689\u20133697, 2014.\n\n[10] R. Jin, T. Yang, M. Mahdavi, Y. Li, and Z. Zhou. Improved bounds for the Nystr\u00f6m method\n\nwith application to kernel classi\ufb01cation. IEEE Trans. Inf. Theory, 59(10): 6939\u20136949, 2013.\n\n[11] S. Kumar, M. Mohri, and A. Talwalkar. Sampling methods for the Nystr\u00f6m method. J. Mach.\n\nLearn. Res., 13: 981\u20131006, 2012.\n\n[12] W. Lim, M. Kim, H. Park, and K. Jung. Double Nystr\u00f6m method: An ef\ufb01cient and accurate\n\nNystr\u00f6m scheme for large-scale data sets. In ICML, pp. 1367\u20131375, 2015.\n\n[13] C. Liu. Gabor-based kernel PCA with fractional power polynomial models for face recognition.\n\nIEEE Trans. Pattern Anal. Mach. Intell., 26: 572\u2013581, 2004.\n\n[14] E. Pekalska and B. Haasdonk. Kernel discriminant analysis with positive de\ufb01nite and inde\ufb01nite\n\nkernels. IEEE Trans. Pattern. Anal. Mach. Intell.,31: 1017\u20131032, 2009.\n\n[15] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, pp. 1177\u2013\n\n1184, 2007.\n\n[16] A. Rudi, R. Camoriano, R. Rosasco. Less is more: Nystr\u00f6m computation regularization. In\n\nNIPS, 1657\u20131665, 2015.\n\n[17] B. Sch\u00f6lkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regularization,\n\nOptimization, and Beyond . MIT Press, 2001.\n\n[18] L. Shi, Y. Feng, and D.X. Zhou. Concentration estimates for learning with (cid:96)1-regularizer and\n\ndata dependent hypothesis spaces. Appl. Comput. Harmon. Anal., 31(2): 286\u2013302, 2011.\n\n[19] L. Shi. Learning theory estimates for coef\ufb01cient-based regularized regression. Appl. Comput.\n\nHarmon. Anal., 34(2): 252\u2013265, 2013.\n\n[20] H. Sun and Q. Wu. Least square regression with inde\ufb01nite kernels and coef\ufb01cient regularization.\n\nAppl. Comput. Harmon. Anal., 30(1): 96\u2013109, 2011.\n\n[21] H. Sun and Q. Wu. Sparse representation in kernel machines. IEEE Trans. Neural Netw. Learning\n\nSyst., 26(10): 2576\u20132582, 2015.\n\n[22] Y. Wang and A. Singh. Minimax subsampling for estimation and prediction in low-dimensional\n\nlinear regression. arXiv, 2016 (https://arxiv.org/pdf/1601.02068v2.pdf).\n\n[23] C. Williams and M. Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In NIPS,\n\npp. 682\u2013688, 2001.\n\n[24] T. Yang, Y.F. Li, M. Mahdavi, R. Jin, and Z.H. Zhou. Nystr\u00f6m method vs random Fourier\n\nfeatures: A theoretical and empirical comparison. In NIPS, 2012, pp. 485\u2013493.\n\n[25] Y. Yang, M. Pilanci and M. J. Wainwright. Randomized sketches for kernels: Fast and optimal\n\nnon-parametric regression. arxiv:1501.06195, 2015.(http://arxiv.org/abs/1501.06195).\n\n[26] Y. Ying, C. Campbell, and M. Girolami. Analysis of SVM with inde\ufb01nite kernels. In NIPS, pp.\n\n2205\u20132213, 2009.\n\n[27] K. Zhang, I.W. Tsang, and J.T. Kwok. Improved Nystr\u00f6m low-rank approximation and error\n\nanalysis. In ICML, pp. 1232\u20131239, 2008.\n\n[28] R. Zhu, P. Ma, M.W. Mahoney, and B. Yu. Optimal subsampling approaches for large sample\n\nlinear regression. arXiv:1509.05111, 2015 (http://arxiv.org/abs/1509.05111).\n\n9\n\n\f", "award": [], "sourceid": 1324, "authors": [{"given_name": "Hong", "family_name": "Chen", "institution": "University of Texas"}, {"given_name": "Haifeng", "family_name": "Xia", "institution": "Huazhong Agricultural University"}, {"given_name": "Heng", "family_name": "Huang", "institution": "University of Texas Arlington"}, {"given_name": "Weidong", "family_name": "Cai", "institution": "University of Sydney"}]}