{"title": "Less is More: Nystr\u00f6m Computational Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1657, "page_last": 1665, "abstract": "We study Nystr\u00f6m type subsampling approaches to large scale kernel methods, and prove learning bounds in the statistical learning setting, where random sampling and high probability estimates are considered. In particular, we prove that these approaches can achieve optimal learning bounds, provided the subsampling level is suitably chosen. These results suggest a simple incremental variant of Nystr\u00f6m kernel ridge regression, where the subsampling level controls at the same time regularization and computations. Extensive experimental analysis shows that the considered approach achieves state of the art performances on benchmark large scale datasets.", "full_text": "Less is More: Nystr\u00a8om Computational Regularization\n\nAlessandro Rudi\u2020\n\nRaffaello Camoriano\u2020\u2021\n\nLorenzo Rosasco\u2020\u25e6\n\n\u2020Universit`a degli Studi di Genova - DIBRIS, Via Dodecaneso 35, Genova, Italy\n\u2021Istituto Italiano di Tecnologia - iCub Facility, Via Morego 30, Genova, Italy\n\u25e6Massachusetts Institute of Technology and Istituto Italiano di Tecnologia\n\nLaboratory for Computational and Statistical Learning, Cambridge, MA 02139, USA\n{ale rudi, lrosasco}@mit.edu\nraffaello.camoriano@iit.it\n\nAbstract\n\nWe study Nystr\u00a8om type subsampling approaches to large scale kernel methods,\nand prove learning bounds in the statistical learning setting, where random sam-\npling and high probability estimates are considered. In particular, we prove that\nthese approaches can achieve optimal learning bounds, provided the subsampling\nlevel is suitably chosen. These results suggest a simple incremental variant of\nNystr\u00a8om Kernel Regularized Least Squares, where the subsampling level im-\nplements a form of computational regularization, in the sense that it controls at\nthe same time regularization and computations. Extensive experimental analy-\nsis shows that the considered approach achieves state of the art performances on\nbenchmark large scale datasets.\n\n1\n\nIntroduction\n\nKernel methods provide an elegant and effective framework to develop nonparametric statistical\napproaches to learning [1]. However, memory requirements make these methods unfeasible when\ndealing with large datasets. Indeed, this observation has motivated a variety of computational strate-\ngies to develop large scale kernel methods [2\u20138].\nIn this paper we study subsampling methods, that we broadly refer to as Nystr\u00a8om approaches. These\nmethods replace the empirical kernel matrix, needed by standard kernel methods, with a smaller\nmatrix obtained by (column) subsampling [2, 3]. Such procedures are shown to often dramatically\nreduce memory/time requirements while preserving good practical performances [9\u201312]. The goal\nof our study is two-fold. First, and foremost, we aim at providing a theoretical characterization of the\ngeneralization properties of such learning schemes in a statistical learning setting. Second, we wish\nto understand the role played by the subsampling level both from a statistical and a computational\npoint of view. As discussed in the following, this latter question leads to a natural variant of Kernel\nRegularized Least Squares (KRLS), where the subsampling level controls both regularization and\ncomputations.\nFrom a theoretical perspective, the effect of Nystr\u00a8om approaches has been primarily character-\nized considering the discrepancy between a given empirical kernel matrix and its subsampled ver-\nsion [13\u201319]. While interesting in their own right, these latter results do not directly yield infor-\nmation on the generalization properties of the obtained algorithm. Results in this direction, albeit\nsuboptimal, were \ufb01rst derived in [20] (see also [21,22]), and more recently in [23,24]. In these latter\npapers, sharp error analyses in expectation are derived in a \ufb01xed design regression setting for a form\nof Kernel Regularized Least Squares. In particular, in [23] a basic uniform sampling approach is\nstudied, while in [24] a subsampling scheme based on the notion of leverage score is considered.\nThe main technical contribution of our study is an extension of these latter results to the statistical\nlearning setting, where the design is random and high probability estimates are considered. The\n\n1\n\n\fmore general setting makes the analysis considerably more complex. Our main result gives opti-\nmal \ufb01nite sample bounds for both uniform and leverage score based subsampling strategies. These\nmethods are shown to achieve the same (optimal) learning error as kernel regularized least squares,\nrecovered as a special case, while allowing substantial computational gains. Our analysis highlights\nthe interplay between the regularization and subsampling parameters, suggesting that the latter can\nbe used to control simultaneously regularization and computations. This strategy implements a\nform of computational regularization in the sense that the computational resources are tailored to\nthe generalization properties in the data. This idea is developed considering an incremental strat-\negy to ef\ufb01ciently compute learning solutions for different subsampling levels. The procedure thus\nobtained, which is a simple variant of classical Nystr\u00a8om Kernel Regularized Least Squares with uni-\nform sampling, allows for ef\ufb01cient model selection and achieves state of the art results on a variety\nof benchmark large scale datasets.\nThe rest of the paper is organized as follows. In Section 2, we introduce the setting and algorithms\nwe consider. In Section 3, we present our main theoretical contributions. In Section 4, we discuss\ncomputational aspects and experimental results.\n\n(cid:90)\n\n2 Supervised learning with KRLS and Nystr\u00a8om approaches\nLet X\u00d7R be a probability space with distribution \u03c1, where we view X and R as the input and output\nspaces, respectively. Let \u03c1X denote the marginal distribution of \u03c1 on X and \u03c1(\u00b7|x) the conditional\ndistribution on R given x \u2208 X. Given a hypothesis space H of measurable functions from X to R,\nthe goal is to minimize the expected risk,\n\nf\u2208HE(f ),\n\nmin\n\nE(f ) =\n\nX\u00d7R\n\n(f (x) \u2212 y)2d\u03c1(x, y),\n\n(1)\n\nprovided \u03c1 is known only through a training set of (xi, yi)n\ni=1 sampled identically and independently\naccording to \u03c1. A basic example of the above setting is random design regression with the squared\nloss, in which case\n\ni = 1, . . . , n,\n\nyi = f\u2217(xi) + \u0001i,\n\n(2)\nwith f\u2217 a \ufb01xed regression function, \u00011, . . . , \u0001n a sequence of random variables seen as noise, and\nx1, . . . , xn random inputs.\nIn the following, we consider kernel methods, based on choosing a\nhypothesis space which is a separable reproducing kernel Hilbert space. The latter is a Hilbert space\nH of functions, with inner product (cid:104)\u00b7,\u00b7(cid:105)H, such that there exists a function K : X \u00d7 X \u2192 R with\nthe following two properties: 1) for all x \u2208 X, Kx(\u00b7) = K(x,\u00b7) belongs to H, and 2) the so called\nreproducing property holds: f (x) = (cid:104)f, Kx(cid:105)H, for all f \u2208 H, x \u2208 X [25]. The function K, called\nreproducing kernel, is easily shown to be symmetric and positive de\ufb01nite, that is the kernel matrix\n(KN )i,j = K(xi, xj) is positive semide\ufb01nite for all x1, . . . , xN \u2208 X, N \u2208 N. A classical way to\nderive an empirical solution to problem (1) is to consider a Tikhonov regularization approach, based\non the minimization of the penalized empirical functional,\n\n(f (xi) \u2212 yi)2 + \u03bb(cid:107)f(cid:107)2H, \u03bb > 0.\n\n(3)\n\nn(cid:88)\n\ni=1\n\nmin\nf\u2208H\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nThe above approach is referred to as Kernel Regularized Least Squares (KRLS) or Kernel Ridge\nRegression (KRR). It is easy to see that a solution \u02c6f\u03bb to problem (3) exists, it is unique and the\nrepresenter theorem [1] shows that it can be written as\n\n\u02c6f\u03bb(x) =\n\n\u02c6\u03b1iK(xi, x) with\n\n\u02c6\u03b1 = (Kn + \u03bbnI)\u22121y,\n\n(4)\n\nn(cid:88)\n\nwhere x1, . . . , xn are the training set points, y = (y1, . . . , yn) and Kn is the empirical kernel matrix.\nNote that this result implies that we can restrict the minimization in (3) to the space,\n\nHn = {f \u2208 H | f =\n\n\u03b1iK(xi,\u00b7), \u03b11, . . . , \u03b1n \u2208 R}.\n\nStoring the kernel matrix Kn, and solving the linear system in (4), can become computationally\nunfeasible as n increases. In the following, we consider strategies to \ufb01nd more ef\ufb01cient solutions,\n\ni=1\n\n2\n\n\fbased on the idea of replacing Hn with\n\nHm = {f | f =\n\nm(cid:88)\n\ni=1\n\n\u03b1iK(\u02dcxi,\u00b7), \u03b1 \u2208 Rm},\n\nwhere m \u2264 n and {\u02dcx1, . . . , \u02dcxm} is a subset of the input points in the training set. The solution \u02c6f\u03bb,m\nof the corresponding minimization problem can now be written as,\n\n\u02c6f\u03bb,m(x) =\n\n\u02dc\u03b1iK(\u02dcxi, x) with\n\n\u02dc\u03b1 = (K(cid:62)\n\nnmKnm + \u03bbnKmm)\u2020K(cid:62)\n\nnmy,\n\n(5)\n\nm(cid:88)\n\ni=1\n\nwhere A\u2020 denotes the Moore-Penrose pseudoinverse of a matrix A, and (Knm)ij = K(xi, \u02dcxj),\n(Kmm)kj = K(\u02dcxk, \u02dcxj) with i \u2208 {1, . . . , n} and j, k \u2208 {1, . . . , m} [2]. The above approach is\nrelated to Nystr\u00a8om methods and different approximation strategies correspond to different ways to\nselect the inputs subset. While our framework applies to a broader class of strategies, see Sec-\ntion C.1, in the following we primarily consider two techniques.\nPlain Nystr\u00a8om. The points {\u02dcx1, . . . , \u02dcxm} are sampled uniformly at random without replacement\nfrom the training set.\nApproximate leverage scores (ALS) Nystr\u00a8om. Recall that the leverage scores associated to the\ntraining set points x1, . . . , xn are\n\nli(t) = (Kn(Kn + tnI)\u22121)ii,\n\ni \u2208 {1, . . . , n}\n\ni=1,\n\n(li(t))n\n\n(6)\nfor any t > 0, where (Kn)ij = K(xi, xj). In practice, leverage scores are onerous to compute\nand approximations (\u02c6li(t))n\ni=1 can be considered [16, 17, 24] . In particular, in the following we are\ninterested in suitable approximations de\ufb01ned as follows:\nDe\ufb01nition 1 (T -approximate leverage scores). Let (li(t))n\nleverage scores with con\ufb01dence \u03b4, when with probability at least 1 \u2212 \u03b4,\n\nthe training set for a given t. Let \u03b4 > 0, t0 > 0 and T \u2265 1. We say that ((cid:98)li(t))n\nli(t) \u2264(cid:98)li(t) \u2264 T li(t) \u2200i \u2208 {1, . . . , n}, t \u2265 t0.\n\ni=1 be the leverage scores associated to\ni=1 are T -approximate\n\ndependently with replacement, and with probability to be selected given by Pt(i) = \u02c6li(t)/(cid:80)\n\nGiven T -approximate leverage scores for t > \u03bb0, {\u02dcx1, . . . , \u02dcxm} are sampled from the training set in-\n\u02c6lj(t).\nIn the next section, we state and discuss our main result showing that the KRLS formulation based on\nplain or approximate leverage scores Nystr\u00a8om provides optimal empirical solutions to problem (1).\n\n1\nT\n\nj\n\n3 Theoretical analysis\n\nIn this section, we state and discuss our main results. We need several assumptions. The \ufb01rst basic\nassumption is that problem (1) admits at least a solution.\nAssumption 1. There exists an fH \u2208 H such that\n\nE(fH) = min\n\nf\u2208HE(f ).\n\nNote that, while the minimizer might not be unique, our results apply to the case in which fH is the\nunique minimizer with minimal norm. Also, note that the above condition is weaker than assuming\nthe regression function in (2) to belong to H. Finally, we note that the study of the paper can be\nadapted to the case in which minimizers do not exist, but the analysis is considerably more involved\nand left to a longer version of the paper.\nThe second assumption is a basic condition on the probability distribution.\nAssumption 2. Let zx be the random variable zx = y \u2212 fH(x), with x \u2208 X, and y distributed\n2 p!M p\u22122\u03c32 for any p \u2265 2,\naccording to \u03c1(y|x). Then, there exists M, \u03c3 > 0 such that E|zx|p \u2264 1\nalmost everywhere on X.\n\nThe above assumption is needed to control random quantities and is related to a noise assumption in\nthe regression model (2). It is clearly weaker than the often considered bounded output assumption\n\n3\n\n\f[25], and trivially veri\ufb01ed in classi\ufb01cation.\nThe last two assumptions describe the capacity (roughly speaking the \u201csize\u201d) of the hypothesis space\ninduced by K with respect to \u03c1 and the regularity of fH with respect to K and \u03c1. To discuss them,\nwe \ufb01rst need the following de\ufb01nition.\nDe\ufb01nition 2 (Covariance operator and effective dimensions). We de\ufb01ne the covariance operator as\n\n(cid:90)\n\nC : H \u2192 H,\n\n(cid:104)f, Cg(cid:105)H =\n\nMoreover, for \u03bb > 0, we de\ufb01ne the random variable Nx(\u03bb) =(cid:10)Kx, (C + \u03bbI)\u22121Kx\n\nf (x)g(x)d\u03c1X (x) ,\n\nX\n\n\u2200 f, g \u2208 H.\n\n(cid:11)\nH with x \u2208 X\n\ndistributed according to \u03c1X and let\n\nN (\u03bb) = ENx(\u03bb),\n\nN\u221e(\u03bb) = sup\nx\u2208X\n\nNx(\u03bb).\n\nWe add several comments. Note that C corresponds to the second moment operator, but we refer to\nit as the covariance operator with an abuse of terminology. Moreover, note that N (\u03bb) = Tr(C(C +\n\u03bbI)\u22121) (see [26]). This latter quantity, called effective dimension or degrees of freedom, can be seen\nas a measure of the capacity of the hypothesis space. The quantity N\u221e(\u03bb) can be seen to provide a\nuniform bound on the leverage scores in Eq. (6). Clearly, N (\u03bb) \u2264 N\u221e(\u03bb) for all \u03bb > 0.\nAssumption 3. The kernel K is measurable, C is bounded. Moreover, for all \u03bb > 0 and a Q > 0,\n\nN\u221e(\u03bb) < \u221e,\nN (\u03bb) \u2264 Q\u03bb\u2212\u03b3,\n\n0 < \u03b3 \u2264 1.\n\n(7)\n(8)\n\nMeasurability of K and boundedness of C are minimal conditions to ensure that the covariance\noperator is a well de\ufb01ned linear, continuous, self-adjoint, positive operator [25]. Condition (7) is\nsatis\ufb01ed if the kernel is bounded supx\u2208X K(x, x) = \u03ba2 < \u221e, indeed in this case N\u221e(\u03bb) \u2264 \u03ba2/\u03bb\nfor all \u03bb > 0. Conversely, it can be seen that condition (7) together with boundedness of C imply\nthat the kernel is bounded, indeed 1\n\n\u03ba2 \u2264 2(cid:107)C(cid:107)N\u221e((cid:107)C(cid:107)).\n\nBoundedness of the kernel implies in particular that the operator C is trace class and allows to\nuse tools from spectral theory. Condition (8) quanti\ufb01es the capacity assumption and is related to\ncovering/entropy number conditions (see [25] for further details). In particular, it is known that\ncondition (8) is ensured if the eigenvalues (\u03c3i)i of C satisfy a polynomial decaying condition \u03c3i \u223c\n\u2212 1\n\u03b3 . Note that, since the operator C is trace class, Condition (8) always holds for \u03b3 = 1. Here,\ni\nfor space constraints and in the interest of clarity we restrict to such a polynomial condition, but the\nanalysis directly applies to other conditions including exponential decay or a \ufb01nite rank conditions\n[26]. Finally, we have the following regularity assumption.\nAssumption 4. There exists s \u2265 0, 1 \u2264 R < \u221e, such that (cid:107)C\u2212sfH(cid:107)H < R.\n\nThe above condition is fairly standard, and can be equivalently formulated in terms of classical\nconcepts in approximation theory such as interpolation spaces [25].\nIntuitively, it quanti\ufb01es the\ndegree to which fH can be well approximated by functions in the RKHS H and allows to control\nthe bias/approximation error of a learning solution. For s = 0, it is always satis\ufb01ed. For larger\ns, we are assuming fH to belong to subspaces of H that are the images of the fractional compact\noperators C s. Such spaces contain functions which, expanded on a basis of eigenfunctions of C,\nhave larger coef\ufb01cients in correspondence to large eigenvalues. Such an assumption is natural in\nview of using techniques such as (4), which can be seen as a form of spectral \ufb01ltering, that estimate\nstable solutions by discarding the contribution of small eigenvalues [27]. In the next section, we\nare going to quantify the quality of empirical solutions of Problem (1) obtained by schemes of the\nform (5), in terms of the quantities in Assumptions 2, 3, 4.\n\n1If N\u221e(\u03bb) is \ufb01nite, then N\u221e((cid:107)C(cid:107)) = supx\u2208X(cid:107)(C + (cid:107)C(cid:107)I)\u22121Kx(cid:107)2 \u2265 1/2(cid:107)C(cid:107)\u22121supx\u2208X(cid:107)Kx(cid:107)2, there-\n\nfore K(x, x) \u2264 2(cid:107)C(cid:107)N\u221e((cid:107)C(cid:107)).\n\n4\n\n\f3.1 Main results\n\nIn this section, we state and discuss our main results, starting with optimal \ufb01nite sample error bounds\nfor regularized least squares based on plain and approximate leverage score based Nystr\u00a8om subsam-\npling.\nTheorem 1. Under Assumptions 1, 2, 3, and 4, let \u03b4 > 0, v = min(s, 1/2), p = 1 + 1/(2v + \u03b3)\nand assume\n\n(cid:107)C(cid:107) log\nThen, the following inequality holds with probability at least 1 \u2212 \u03b4,\n\n+\n\nn \u2265 1655\u03ba2 + 223\u03ba2 log\n\n6\u03ba2\n\u03b4\n\n(cid:18) 38p\n(cid:32)\n\n2(cid:107)C(cid:107) +\n\n(cid:19)p\n(cid:115)\n\n.\n\n114\u03ba2p\n(cid:107)C(cid:107)\u03b4\n\nM \u03ba(cid:112)(cid:107)C(cid:107) +\n\n(cid:33)\n\nQ\u03c32\n(cid:107)C(cid:107)\u03b3\n\nlog\n\n6\n\u03b4\n\n,\n\n(9)\n\nE( \u02c6f\u03bb,m) \u2212 E(fH) \u2264 q2 n\n\n\u2212 2v+1\n\n2v+\u03b3+1 , with q = 6R\n\nwith \u02c6f\u03bb,m as in (5), \u03bb = (cid:107)C(cid:107)n\n\n\u2212 1\n\n2v+\u03b3+1 and\n\n1. for plain Nystr\u00a8om\n\nm \u2265 (67 \u2228 5N\u221e(\u03bb)) log\n\n12\u03ba2\n\u03bb\u03b4\n\n;\n\n2. for ALS Nystr\u00a8om and T -approximate leverage scores with subsampling probabilities P\u03bb,\n\nt0 \u2265 19\u03ba2\n\nn log 12n\n\n\u03b4 and\n\nm \u2265 (334 \u2228 78T 2N (\u03bb)) log\n\n48n\n\n\u03b4\n\n.\n\nWe add several comments. First, the above results can be shown to be optimal in a minimax sense.\nIndeed, minimax lower bounds proved in [26, 28] show that the learning rate in (9) is optimal un-\nder the considered assumptions (see Thm. 2, 3 of [26], for a discussion on minimax lower bounds\nsee Sec. 2 of [26]). Second, the obtained bounds can be compared to those obtained for other reg-\nularized learning techniques. Techniques known to achieve optimal error rates include Tikhonov\nregularization [26, 28, 29], iterative regularization by early stopping [30, 31], spectral cut-off regu-\nlarization (a.k.a. principal component regression or truncated SVD) [30, 31], as well as regularized\nstochastic gradient methods [32]. All these techniques are essentially equivalent from a statistical\npoint of view and differ only in the required computations. For example, iterative methods allow\nfor a computation of solutions corresponding to different regularization levels which is more ef\ufb01-\ncient than Tikhonov or SVD based approaches. The key observation is that all these methods have\nthe same O(n2) memory requirement. In this view, our results show that randomized subsampling\nmethods can break such a memory barrier, and consequently achieve much better time complexity,\nwhile preserving optimal learning guarantees. Finally, we can compare our results with previous\nanalysis of randomized kernel methods. As already mentioned, results close to those in Theorem 1\nare given in [23, 24] in a \ufb01xed design setting. Our results extend and generalize the conclusions of\nthese papers to a general statistical learning setting. Relevant results are given in [8] for a different\napproach, based on averaging KRLS solutions obtained splitting the data in m groups (divide and\nconquer RLS). The analysis in [8] is only in expectation, but considers random design and shows\nthat the proposed method is indeed optimal provided the number of splits is chosen depending on\nthe effective dimension N (\u03bb). This is the only other work we are aware of establishing optimal\nlearning rates for randomized kernel approaches in a statistical learning setting. In comparison with\nNystr\u00a8om computational regularization the main disadvantage of the divide and conquer approach is\ncomputational and in the model selection phase where solutions corresponding to different regular-\nization parameters and number of splits usually need to be computed.\nThe proof of Theorem 1 is fairly technical and lengthy. It incorporates ideas from [26] and tech-\nniques developed to study spectral \ufb01ltering regularization [30, 33]. In the next section, we brie\ufb02y\nsketch some main ideas and discuss how they suggest an interesting perspective on regularization\ntechniques including subsampling.\n\n3.2 Proof sketch and a computational regularization perspective\n\nA key step in the proof of Theorem 1 is an error decomposition, and corresponding bound, for any\n\ufb01xed \u03bb and m. Indeed, it is proved in Theorem 2 and Proposition 2 that, for \u03b4 > 0, with probability\n\n5\n\n\f(cid:12)(cid:12)(cid:12)E( \u02c6f\u03bb,m) \u2212 E(fH)\n\n(cid:12)(cid:12)(cid:12)1/2 (cid:46) R\n\nFigure 1: Validation errors associated to 20 \u00d7 20 grids of values for m (x axis) and \u03bb (y axis) on\npumadyn32nh (left), breast cancer (center) and cpuSmall (right).\nat least 1 \u2212 \u03b4,\n\n(cid:32)\n\nM(cid:112)N\u221e(\u03bb)\n\n(cid:114)\n\n\u03c32N (\u03bb)\n\n(cid:33)\n\n+ RC(m)1/2+v + R\u03bb1/2+v.\n(10)\nThe \ufb01rst and last term in the right hand side of the above inequality can be seen as forms of sample\nand approximation errors [25] and are studied in Lemma 4 and Theorem 2. The mid term can be\nseen as a computational error and depends on the considered subsampling scheme. Indeed, it is\nshown in Proposition 2 that C(m) can be taken as,\n\nlog\n\n6\n\u03b4\n\n+\n\nn\n\nn\n\n(cid:12)(cid:12)(cid:12)(cid:12) (67 \u2228 5N\u221e(t)) log\n\n(cid:27)\n\n12\u03ba2\n\n\u2264 m\n\n,\n\nt\u03b4\n\n(cid:12)(cid:12)(cid:12)(cid:12) 78T 2N (t) log\n\n\u2264 m\n\n48n\n\n\u03b4\n\n(cid:27)\n\n,\n\n(cid:26)\n(cid:26) 19\u03ba2\n\nCpl(m) = min\nfor the plain Nystr\u00a8om approach, and\n\nt > 0\n\nCALS(m) = min\n\nlog\n\n\u2264 t \u2264 (cid:107)C(cid:107)\n\n12n\n\n\u03b4\n\nn\n\nfor the approximate leverage scores approach. The bounds in Theorem 1 follow by: 1) minimizing\nin \u03bb the sum of the \ufb01rst and third term 2) choosing m so that the computational error is of the\nsame order of the other terms. Computational resources and regularization are then tailored to the\ngeneralization properties of the data at hand. We add a few comments. First, note that the error bound\nin (10) holds for a large class of subsampling schemes, as discussed in Section C.1 in the appendix.\nThen speci\ufb01c error bounds can be derived developing computational error estimates. Second, the\nerror bounds in Theorem 2 and Proposition 2, and hence in Theorem 1, easily generalize to a larger\nclass of regularization schemes beyond Tikhonov approaches, namely spectral \ufb01ltering [30]. For\nspace constraints, these extensions are deferred to a longer version of the paper. Third, we note that,\nin practice, optimal data driven parameter choices, e.g. based on hold-out estimates [31], can be\nused to adaptively achieve optimal learning bounds.\nFinally, we observe that a different perspective is derived starting from inequality (10), and noting\nthat the role played by m and \u03bb can also be exchanged. Letting m play the role of a regularization\nparameter, \u03bb can be set as a function of m and m tuned adaptively. For example, in the case of a\nplain Nystr\u00a8om approach, if we set\n\n\u03bb =\n\nlog m\n\nm\n\n,\n\nand m = 3n\n\n1\n\n2v+\u03b3+1 log n,\n\nthen the obtained learning solution achieves the error bound in Eq. (9). As above, the subsampling\nlevel can also be chosen by cross-validation. Interestingly, in this case by tuning m we naturally\ncontrol computational resources and regularization. An advantage of this latter parameterization\nis that, as described in the following, the solution corresponding to different subsampling levels is\neasy to update using Cholesky rank-one update formulas [34]. As discussed in the next section,\nin practice, a joint tuning over m and \u03bb can be done starting from small m and appears to be\nadvantageous both for error and computational performances.\n\n4\n\nIncremental updates and experimental analysis\n\nIn this section, we \ufb01rst describe an incremental strategy to ef\ufb01ciently explore different subsampling\nlevels and then perform extensive empirical tests aimed in particular at: 1) investigating the sta-\ntistical and computational bene\ufb01ts of considering varying subsampling levels, and 2) compare the\n\n6\n\nm2004006008001000\u03bb10-610-410-2100RMSE0.0320.03250.0330.03350.0340.03450.035m50100150200250300\u03bb10-1210-1010-810-610-4Classification Error0.040.050.060.070.080.090.1m10002000300040005000\u03bb10-1510-1410-1310-12RMSE152025\fi=1, Subsampling (\u02dcxj)m\n\nInput: Dataset (xi, yi)n\nj=1,\nRegularization Parameter \u03bb.\nOutput: Nystr\u00a8om KRLS estimators {\u02dc\u03b11, . . . , \u02dc\u03b1m}.\nCompute \u03b31; R1 \u2190 \u221a\n(cid:19)\n(cid:18)Rt\u22121\nfor t \u2208 {2, . . . , m} do\nCompute At, ut, vt;\nRt \u2190\n0\n0\n0\n\u02dc\u03b1t \u2190 R\u22121\n(R\u2212(cid:62)\n(Aty));\n\nRt \u2190 cholup(Rt, ut,(cid:48) +(cid:48));\nRt \u2190 cholup(Rt, vt,(cid:48) \u2212(cid:48));\n\n\u03b31;\n\n;\n\nend for\n\nt\n\nt\n\nAlgorithm 1: Incremental Nystr\u00a8om KRLS.\n\nFigure 2: Model selection time on the\ncpuSmall dataset. m \u2208 [1, 1000]\nand T = 50, 10 repetitions.\n\nperformance of the algorithm with respect to state of the art solutions on several large scale bench-\nmark datasets. Throughout this section, we only consider a plain Nystr\u00a8om approach, deferring to\nfuture work the analysis of leverage scores based sampling techniques. Interestingly, we will see\nthat such a basic approach can often provide state of the art performances.\n\n4.1 Ef\ufb01cient incremental updates\n\nAlgorithm 1 ef\ufb01ciently compute solutions corresponding to different subsampling levels, by exploit-\ning rank-one Cholesky updates [34]. The proposed procedure allows to ef\ufb01ciently compute a whole\nregularization path of solutions, and hence perform fast model selection2 (see Sect. A). In Algo-\nrithm 1, the function cholup is the Cholesky rank-one update formula available in many linear\nalgebra libraries. The total cost of the algorithm is O(nm2 + m3) time to compute \u02dc\u03b12, . . . , \u02dc\u03b1m,\nwhile a naive non-incremental algorithm would require O(nm2T + m3T ) with T is the number of\nanalyzed subsampling levels. The following are some quantities needed by the algorithm: A1 = a1\nand At = (At\u22121 at) \u2208 Rn\u00d7t, for any 2 \u2264 t \u2264 m. Moreover, for any 1 \u2264 t \u2264 m, gt =\n1 + \u03b3t and\n\n\u221a\n\nut = (ct/(1 + gt), gt),\nvt = (ct/(1 + gt), \u22121),\n\nat = (K(\u02dcxt, x1), . . . , K(\u02dcxt, xn)),\nbt = (K(\u02dcxt, \u02dcx1), . . . , K(\u02dcxt, \u02dcxt\u22121)),\n\n4.2 Experimental analysis\n\nct = A(cid:62)\n\u03b3t = a(cid:62)\n\nt\u22121at + \u03bbnbt,\nt at + \u03bbnK(\u02dcxt, \u02dcxt).\n\nWe empirically study the properties of Algorithm 1, considering a Gaussian kernel of width \u03c3. The\nselected datasets are already divided in a training and a test part3. We randomly split the training\npart in a training set and a validation set (80% and 20% of the n training points, respectively) for\nparameter tuning via cross-validation. The m subsampled points for Nystr\u00a8om approximation are se-\nlected uniformly at random from the training set. We report the performance of the selected model\non the \ufb01xed test set, repeating the process for several trials.\nInterplay between \u03bb and m. We begin with a set of results showing that incrementally explor-\ning different subsampling levels can yield very good performance while substantially reducing the\ncomputational requirements. We consider the pumadyn32nh (n = 8192, d = 32), the breast\ncancer (n = 569, d = 30), and the cpuSmall (n = 8192, d = 12) datasets4. In Figure 1, we\nreport the validation errors associated to a 20 \u00d7 20 grid of values for \u03bb and m. The \u03bb values are\nlogarithmically spaced, while the m values are linearly spaced. The ranges and kernel bandwidths,\n\nchosen according to preliminary tests on the data, are \u03c3 = 2.66, \u03bb \u2208(cid:2)10\u22127, 1(cid:3), m \u2208 [10, 1000] for\npumadyn32nh, \u03c3 = 0.9, \u03bb \u2208(cid:2)10\u221212, 10\u22123(cid:3), m \u2208 [5, 300] for breast cancer, and \u03c3 = 0.1,\n\u03bb \u2208 (cid:2)10\u221215, 10\u221212(cid:3), m \u2208 [100, 5000] for cpuSmall. The main observation that can be derived\n\nfrom this \ufb01rst series of tests is that a small m is suf\ufb01cient to obtain the same results achieved with\nthe largest m. For example, for pumadyn32nh it is suf\ufb01cient to choose m = 62 and \u03bb = 10\u22127\nto obtain an average test RMSE of 0.33 over 10 trials, which is the same as the one obtained using\nm = 1000 and \u03bb = 10\u22123, with a 3-fold speedup of the joint training and validation phase. Also,\nit is interesting to observe that for given values of \u03bb, large values of m can decrease the perfor-\nmance. This observation is consistent with the results in Section 3.1, showing that m can play the\n\n2The code for Algorithm 1 is available at lcsl.github.io/NystromCoRe.\n3In the following we denote by n the total number of points and by d the number of dimensions.\n4www.cs.toronto.edu/\u02dcdelve and archive.ics.uci.edu/ml/datasets\n\n7\n\nm12014016008001000Time (s)020406080100120Incremental Nystr\u00f6mBatch Nystr\u00f6m\fTable 1: Test RMSE comparison for exact and approximated kernel methods. The results for KRLS,\nBatch Nystr\u00a8om, RF and Fastfood are the ones reported in [6]. ntr is the size of the training set.\n\nDataset\n\nntr\n\nd\n\nCPU\n\nCT slices (axial)\n\nInsurance Company\n\n5822\n6554\n42800\nYear Prediction MSD 463715\n522910\n\nForest\n\n85\n21\n384\n90\n54\n\nIncremental\nNystr\u00a8om RBF\n\n0.23180 \u00b1 4 \u00d7 10\u22125\n2.8466 \u00b1 0.0497\n7.1106 \u00b1 0.0772\n0.10470 \u00b1 5 \u00d7 10\u22125\n0.9638 \u00b1 0.0186\n\nKRLS\nRBF\n0.231\n7.271\nNA\nNA\nNA\n\nBatch\n\nNystr\u00a8om RBF\n\n0.232\n6.758\n60.683\n0.113\n0.837\n\nRF\nRBF\n0.266\n7.103\n49.491\n0.123\n0.840\n\nFastfood Fastfood\n\nRBF\n0.264\n7.366\n43.858\n0.115\n0.840\n\nFFT\n0.266\n4.544\n58.425\n0.106\n0.838\n\nKRLS\nFastfood\nMatern Matern\n0.234\n0.235\n4.211\n4.345\n14.868\nNA\n0.116\nNA\nNA\n0.976\n\nrole of a regularization parameter. Similar results are obtained for breast cancer, where for\n\u03bb = 4.28 \u00d7 10\u22126 and m = 300 we obtain a 1.24% average classi\ufb01cation error on the test set over\n20 trials, while for \u03bb = 10\u221212 and m = 67 we obtain 1.86%. For cpuSmall, with m = 5000 and\n\u03bb = 10\u221212 the average test RMSE over 5 trials is 12.2, while for m = 2679 and \u03bb = 10\u221215 it is\nonly slightly higher, 13.3, but computing its associated solution requires less than half of the time\nand approximately half of the memory.\nRegularization path computation. If the subsampling level m is used as a regularization parameter,\nthe computation of a regularization path corresponding to different subsampling levels becomes cru-\ncial during the model selection phase. A naive approach, that consists in recomputing the solutions\nof Eq. 5 for each subsampling level, would require O(m2nT + m3LT ) computational time, where\nT is the number of solutions with different subsampling levels to be evaluated and L is the number\nof Tikhonov regularization parameters. On the other hand, by using the incremental Nystr\u00a8om al-\ngorithm the model selection time complexity is O(m2n + m3L) for the whole regularization path.\nWe experimentally verify this speedup on cpuSmall with 10 repetitions, setting m \u2208 [1, 5000]\nand T = 50. The model selection times, measured on a server with 12 \u00d7 2.10GHz Intel(cid:114) Xeon(cid:114)\nE5-2620 v2 CPUs and 132 GB of RAM, are reported in Figure 2. The result clearly con\ufb01rms the\nbene\ufb01cial effects of incremental Nystr\u00a8om model selection on the computational time.\nPredictive performance comparison. Finally, we consider the performance of the algorithm on\nseveral large scale benchmark datasets considered in [6], see Table 1. \u03c3 has been chosen on the\nbasis of preliminary data analysis. m and \u03bb have been chosen by cross-validation, starting from\n\nsmall subsampling values up to mmax = 2048, and considering \u03bb \u2208 (cid:2)10\u221212, 1(cid:3). After model se-\n\nlection, we retrain the best model on the entire training set and compute the RMSE on the test set.\nWe consider 10 trials, reporting the performance mean and standard deviation. The results in Table\n1 compare Nystr\u00a8om computational regularization with the following methods (as in [6]):\n\n\u2022 Kernel Regularized Least Squares (KRLS): Not compatible with large datasets.\n\u2022 Random Fourier features (RF): As in [4], with a number of random features D = 2048.\n\u2022 Fastfood RBF, FFT and Matern kernel: As in [6], with D = 2048 random features.\n\u2022 Batch Nystr\u00a8om: Nystr\u00a8om method [3] with uniform sampling and m = 2048.\n\nThe above results show that the proposed incremental Nystr\u00a8om approach behaves really well, match-\ning state of the art predictive performances.\n\nAcknowledgments\nThe work described in this paper is supported by the Center for Brains, Minds and Machines\n(CBMM), funded by NSF STC award CCF-1231216; and by FIRB project RBFR12M3AC, funded\nby the Italian Ministry of Education, University and Research.\n\nReferences\n[1] Bernhard Sch\u00a8olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regular-\n\nization, Optimization, and Beyond (Adaptive Computation and Machine Learning). MIT Press, 2002.\n\n[2] Alex J. Smola and Bernhard Sch\u00a8olkopf. Sparse Greedy Matrix Approximation for Machine Learning. In\n\nICML, pages 911\u2013918. Morgan Kaufmann, 2000.\n\n[3] C. Williams and M. Seeger. Using the Nystr\u00a8om Method to Speed Up Kernel Machines. In NIPS, pages\n\n682\u2013688. MIT Press, 2000.\n\n[4] Ali Rahimi and Benjamin Recht. Random Features for Large-Scale Kernel Machines. In NIPS, pages\n\n1177\u20131184. Curran Associates, Inc., 2007.\n\n8\n\n\f[5] J. Yang, V. Sindhwani, H. Avron, and M. W. Mahoney. Quasi-Monte Carlo Feature Maps for Shift-\n\nInvariant Kernels. In ICML, volume 32 of JMLR Proceedings, pages 485\u2013493. JMLR.org, 2014.\n\n[6] Quoc V. Le, Tam\u00b4as Sarl\u00b4os, and Alexander J. Smola. Fastfood - Computing Hilbert Space Expansions in\n\nloglinear time. In ICML, volume 28 of JMLR Proceedings, pages 244\u2013252. JMLR.org, 2013.\n\n[7] Si Si, Cho-Jui Hsieh, and Inderjit S. Dhillon. Memory Ef\ufb01cient Kernel Approximation. In ICML, vol-\n\nume 32 of JMLR Proceedings, pages 701\u2013709. JMLR.org, 2014.\n\n[8] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Divide and Conquer Kernel Ridge Regression.\n\nIn COLT, volume 30 of JMLR Proceedings, pages 592\u2013617. JMLR.org, 2013.\n\n[9] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystrom Method. In NIPS, pages 1060\u20131068, 2009.\n[10] Mu Li, James T. Kwok, and Bao-Liang Lu. Making Large-Scale Nystr\u00a8om Approximation Possible. In\n\nICML, pages 631\u2013638. Omnipress, 2010.\n\n[11] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved Nystr\u00a8om Low-rank Approximation and Error\n\nAnalysis. ICML, pages 1232\u20131239. ACM, 2008.\n\n[12] Bo Dai, Bo Xie 0002, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable\n\nKernel Methods via Doubly Stochastic Gradients. In NIPS, pages 3041\u20133049, 2014.\n\n[13] Petros Drineas and Michael W. Mahoney. On the Nystr\u00a8om Method for Approximating a Gram Matrix for\n\nImproved Kernel-Based Learning. JMLR, 6:2153\u20132175, December 2005.\n\n[14] A. Gittens and M. W. Mahoney. Revisiting the Nystrom method for improved large-scale machine learn-\n\ning. 28:567\u2013575, 2013.\n\n[15] Shusen Wang and Zhihua Zhang. Improving CUR Matrix Decomposition and the Nystr\u00a8om Approxima-\n\ntion via Adaptive Sampling. JMLR, 14(1):2729\u20132769, 2013.\n\n[16] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast approximation\n\nof matrix coherence and statistical leverage. JMLR, 13:3475\u20133506, 2012.\n\n[17] Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford.\n\nUniform Sampling for Matrix Approximation. In ITCS, pages 181\u2013190. ACM, 2015.\n\n[18] Shusen Wang and Zhihua Zhang. Ef\ufb01cient Algorithms and Error Analysis for the Modi\ufb01ed Nystrom\n\nMethod. In AISTATS, volume 33 of JMLR Proceedings, pages 996\u20131004. JMLR.org, 2014.\n\n[19] S. Kumar, M. Mohri, and A. Talwalkar. Sampling methods for the Nystr\u00a8om method. JMLR, 13(1):981\u2013\n\n1006, 2012.\n\n[20] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the Impact of Kernel Approximation on\n\nLearning Accuracy. In AISTATS, volume 9 of JMLR Proceedings, pages 113\u2013120. JMLR.org, 2010.\n\n[21] R Jin, T. Yang, M. Mahdavi, Y. Li, and Z. Zhou.\n\nImproved Bounds for the Nystr\u00a8om Method With\n\nApplication to Kernel Classi\ufb01cation. Information Theory, IEEE Transactions on, 59(10), Oct 2013.\n\n[22] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr\u00a8om Method vs Ran-\n\ndom Fourier Features: A Theoretical and Empirical Comparison. In NIPS, pages 485\u2013493, 2012.\n\n[23] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In COLT, volume 30, 2013.\n[24] A. Alaoui and M. W. Mahoney. Fast randomized kernel methods with statistical guarantees. arXiv, 2014.\n[25] I. Steinwart and A. Christmann. Support Vector Machines. Springer New York, 2008.\n[26] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foun-\n\ndations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[27] L. Lo Gerfo, Lorenzo Rosasco, Francesca Odone, Ernesto De Vito, and Alessandro Verri. Spectral Algo-\n\nrithms for Supervised Learning. Neural Computation, 20(7):1873\u20131897, 2008.\n\n[28] I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In COLT,\n\n2009.\n\n[29] S. Mendelson and J. Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1), 2010.\n[30] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of\n\ncomplexity, 23(1):52\u201372, 2007.\n\n[31] A. Caponnetto and Yuan Yao. Adaptive rates for regularization operators in learning theory. Analysis and\n\nApplications, 08, 2010.\n\n[32] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational\n\nMathematics, 8(5):561\u2013596, 2008.\n\n[33] Alessandro Rudi, Guillermo D. Canas, and Lorenzo Rosasco. On the Sample Complexity of Subspace\n\nLearning. In NIPS, pages 2067\u20132075, 2013.\n\n[34] Gene H. Golub and Charles F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1019, "authors": [{"given_name": "Alessandro", "family_name": "Rudi", "institution": null}, {"given_name": "Raffaello", "family_name": "Camoriano", "institution": "IIT - UNIGE"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova"}]}