{"title": "FALKON: An Optimal Large Scale Kernel Method", "book": "Advances in Neural Information Processing Systems", "page_first": 3888, "page_last": 3898, "abstract": "Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially $O(n)$ memory and $O(n\\sqrt{n})$ time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures.", "full_text": "FALKON: An Optimal Large Scale Kernel Method\n\nAlessandro Rudi \u2217\n\nINRIA \u2013 Sierra Project-team,\n\u00b4Ecole Normale Sup\u00b4erieure, Paris\n\nLuigi Carratino\nUniversity of Genoa\n\nGenova, Italy\n\nLorenzo Rosasco\nUniversity of Genoa,\nLCSL, IIT & MIT\n\nAbstract\n\nKernel methods provide a principled way to perform non linear, nonparametric\nlearning. They rely on solid functional analytic foundations and enjoy optimal\nstatistical properties. However, at least in their basic form, they have limited\napplicability in large scale scenarios because of stringent computational require-\nments in terms of time and especially memory. In this paper, we take a substantial\nstep in scaling up kernel methods, proposing FALKON, a novel algorithm that\nallows to ef\ufb01ciently process millions of points. FALKON is derived combining\nseveral algorithmic principles, namely stochastic subsampling, iterative solvers and\npreconditioning. Our theoretical analysis shows that optimal statistical accuracy\nis achieved requiring essentially O(n) memory and O(n\nn) time. An extensive\nexperimental analysis on large scale datasets shows that, even with a single ma-\nchine, FALKON outperforms previous state of the art solutions, which exploit\nparallel/distributed architectures.\n\n\u221a\n\n1\n\nIntroduction\n\nThe goal in supervised learning is to learn from examples a function that predicts well new data.\nNonparametric methods are often crucial since the functions to be learned can be non-linear and\ncomplex Kernel methods are probably the most popular among nonparametric learning methods, but\ndespite excellent theoretical properties, they have limited applications in large scale learning because\nof time and memory requirements, typically at least quadratic in the number of data points.\nOvercoming these limitations has motivated a variety of practical approaches including gradient\nmethods, as well accelerated, stochastic and preconditioned extensions, to improve time complexity\n[1, 2, 3, 4, 5, 6]. Random projections provide an approach to reduce memory requirements, popular\nmethods including Nystr\u00a8om [7, 8], random features [9], and their numerous extensions. From a\ntheoretical perspective a key question has become to characterize statistical and computational trade-\noffs, that is if, or under which conditions, computational gains come at the expense of statistical\naccuracy. In particular, recent results considering least squares, show that there are large class of\nproblems for which, by combining Nystr\u00a8om or random features approaches [10, 11, 12, 13, 14, 15]\nwith ridge regression, it is possible to substantially reduce computations, while preserving the\nsame optimal statistical accuracy of exact kernel ridge regression (KRR). While statistical lower\nbounds exist for this setting, there are no corresponding computational lower bounds. The state of\nthe art approximation of KRR, for which optimal statistical bounds are known, typically requires\ncomplexities that are roughly O(n2) in time and memory (or possibly O(n) in memory, if kernel\ncomputations are made on the \ufb02y).\nIn this paper, we propose and study FALKON, a new algorithm that, to the best of our knowledge,\nhas the best known theoretical guarantees. At the same time FALKON provides an ef\ufb01cient approach\nto apply kernel methods on millions of points, and tested on a variety of large scale problems\n\u2217E-mail: alessandro.rudi@inria.fr. This work was done when A.R. was working at Laboratory of\n\nComputational and Statistical Learning (Istituto Italiano di Tecnologia).\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u221a\n\noutperform previously proposed methods while utilizing only a fraction of computational resources.\nMore precisely, we take a substantial step in provably reducing the computational requirements,\nshowing that, up to logarithmic factors, a time/memory complexity of O(n\nn) and O(n) is suf\ufb01cient\nfor optimal statistical accuracy. Our new algorithm, exploits the idea of using Nystr\u00a8om methods\nto approximate the KRR problem, but also to ef\ufb01ciently compute a preconditioning to be used in\nconjugate gradient. To the best of our knowledge this is the \ufb01rst time all these ideas are combined\nand put to fruition. Our theoretical analysis derives optimal statistical rates both in a basic setting and\nunder benign conditions for which fast rates are possible. The potential bene\ufb01ts of different sampling\nstrategies are also analyzed. Most importantly, the empirical performances are thoroughly tested\non available large scale data-sets. Our results show that, even on a single machine, FALKON can\noutperforms state of the art methods on most problems both in terms of time ef\ufb01ciency and prediction\naccuracy. In particular, our results suggest that FALKON could be a viable kernel alternative to deep\nfully connected neural networks for large scale problems.\nThe rest of the paper is organized as follows. In Sect. 2 we give some background on kernel methods.\nIn Sect. 3 we introduce FALKON, while in Sect. 4 we present and discuss the main technical results.\nFinally in Sect. 5 we present experimental results.\n\n2 Statistical and Computational Trade-offs in Kernel Methods\n\nWe consider the supervised learning problem of estimating a function from random noisy samples. In\nstatistical learning theory, this can be formalized as the problem of solving\n(f (x) \u2212 y)2d\u03c1(x, y),\n\nsolutions. Ideally, a good empirical solution (cid:98)f should have small excess risk\n\n(1)\ni=1 from \u03c1, which is \ufb01xed but unknown and where, H is a space of candidate\n\ngiven samples (xi, yi)n\n\nf\u2208HE(f ),\n\nE(f ) =\n\ninf\n\n(cid:90)\n\nR((cid:98)f ) = E((cid:98)f ) \u2212 inf\n\nf\u2208HE(f ),\n\nsince this implies it will generalize/predict well new data.\nIn this paper, we are interested in\nboth computational and statistical aspects of the above problem. In particular, we investigate the\ncomputational resources needed to achieve optimal statistical accuracy, i.e. minimal excess risk. Our\nfocus is on the most popular class of nonparametric methods, namely kernel methods.\nKernel methods and ridge regression. Kernel methods consider a space H of functions\n\nn(cid:88)\n\ni=1\n\n(2)\n\n(4)\n\nf (x) =\n\n\u03b1jK(x, xi),\n\n(3)\n\nwhere K is a positive de\ufb01nite kernel 2. The coef\ufb01cients \u03b11, . . . , \u03b1n are typically derived from a\nconvex optimization problem, that for the square loss is\n\n(cid:98)fn,\u03bb = argmin\n\nf\u2208H\n\n1\nn\n\ni=1\n\nn(cid:88)\n(f (xi) \u2212 yi)2 + \u03bb(cid:107)f(cid:107)2H,\n\n(Knn + \u03bbnI) \u03b1 =(cid:98)y,\n\nand de\ufb01nes the so called kernel ridge regression (KRR) estimator [16]. An advantage of least squares\napproaches is that they reduce computations to a linear system\n\nwhere Knn is an n \u00d7 n matrix de\ufb01ned by (Knn)ij = K(xi, xj) and(cid:98)y = (y1, . . . yn). We next\n\ncomment on computational and statistical properties of KRR.\nComputations. Solving Eq. (5) for large datasets is challenging. A direct approach requires O(n2) in\nspace, to allocate Knn, O(n2) kernel evaluations, and O(n2cK + n3) in time, to compute and invert\nKnn (cK is the kernel evaluation cost assumed constant and omitted throughout).\n\nStatistics. Under basic assumptions, KRR achieves an error R((cid:98)f\u03bbn ) = O(n\u22121/2), for \u03bbn = n\u22121/2,\n\nwhich is optimal in a minimax sense and can be improved only under more stringent assumptions\n[17, 18].\n\n(5)\n\n2K is positive de\ufb01nite, if the matrix with entries K(xi, xj) is positive semide\ufb01nite \u2200x1, . . . , xN , N \u2208 N [16]\n\n2\n\n\f(6)\n\nThe question is then if it is possible to achieve the statistical properties of KRR, with less computations.\nGradient methods and early stopping. A natural idea is to consider iterative solvers and in\nparticular gradient methods, because of their simplicity and low iteration cost. A basic example is\ncomputing the coef\ufb01cients in (3) by\n\n\u03b1t = \u03b1t\u22121 + \u03c4 [(Knn\u03b1t\u22121 \u2212(cid:98)y) + \u03bbn\u03b1t\u22121] ,\n\nfor a suitable step-size choice \u03c4.\nComputations.\nIn this case, if t is the number of iterations, gradient methods require O(n2t) in\ntime, O(n2) in memory and O(n2) in kernel evaluations, if the kernel matrix is stored. Note that, the\nkernel matrix can also be computed on the \ufb02y with only O(n) memory, but O(n2t) kernel evaluations\nare required. We note that, beyond the above simple iteration, several variants have been considered\nincluding accelerated [1, 19] and stochastic extensions [20].\nStatistics. The statistical properties of iterative approaches are well studied and also in the case where\n\u03bb is set to zero, and regularization is performed by choosing a suitable stopping time [21]. In this\nlatter case, the number of iterations can roughly be thought of 1/\u03bb and O(\nn) iterations are needed\nfor basic gradient descent, O(n1/4) for accelerated methods and possible O(1) iterations/epochs\nfor stochastic methods. Importantly, we note that unlike most optimization studies, here we are\nconsidering the number of iterations needed to solve (1), rather than (4).\nWhile the time complexity of these methods dramatically improves over KRR, and computations can\nbe done in blocks, memory requirements (or number of kernel evaluations) still makes the application\nto large scale setting cumbersome. Randomization provides an approach to tackle this challenge.\nRandom projections. The rough idea is to use random projections to compute Knn only approx-\nimately. The most popular examples in this class of approaches are Nystr\u00a8om [7, 8] and random\nfeatures [9] methods. In the following we focus in particular on a basic Nystr\u00a8om approach based on\nconsidering functions of the form\n\n\u221a\n\n(cid:101)\u03b1iK(x,(cid:101)xi), with {(cid:101)x1, . . . ,(cid:101)xM} \u2286 {x1, . . . , xn},\n\nM(cid:88)\n\ni=1\n\n(cid:101)f\u03bb,M (x) =\nH(cid:101)\u03b1 = z,\n\n(7)\n\n(8)\n\nz = K(cid:62)\n\nnM \u02c6y.\n\nnM KnM + \u03bbnKMM ,\n\nwhere H = K(cid:62)\n\nde\ufb01ned considering only a subset of M training points sampled uniformly. In this case, there are only\nM coef\ufb01cients that, following the approach in (4), can be derived considering the linear system\n\nHere KnM is the n \u00d7 M matrix with (KnM )ij = K(xi,(cid:101)xj) and KMM is the M \u00d7 M matrix with\n(KMM )ij = K((cid:101)xi,(cid:101)xj). This method consists in subsampling the columns of Knn and can be seen\n\nas a particular form of random projections.\nComputations. Direct methods for solving (8) require O(nM 2) in time to form K(cid:62)\nnM KnM and\nO(M 3) for solving the linear system, and only O(nM ) kernel evaluations. The naive memory\nrequirement is O(nM ) to store KnM , however if K(cid:62)\nnM KnM is computed in blocks of dimension at\nmost M \u00d7 M only O(M 2) memory is needed. Iterative approaches as in (6) can also be combined\nwith random projections [22, 23, 24] to slightly reduce time requirements (see Table. 1, or Sect. F in\nthe appendix, for more details).\nStatistics. The key point though, is that random projections allow to dramatically reduce memory\nrequirements as soon as M (cid:28) n and the question arises of whether this comes at expenses of\nstatistical accuracy. Interestingly, recent results considering this question show that there are large\nclasses of problems for which M = \u02dcO(\nn) suf\ufb01ces for the same optimal statistical accuracy of the\nexact KRR [11, 12, 13].\nIn summary, in this case the computations needed for optimal statistical accuracy are reduced from\nO(n2) to O(n\nn) kernel evaluations, but the best time complexity is basically O(n2). In the rest of\nthe paper we discuss how this requirement can indeed be dramatically reduced.\n\n\u221a\n\n\u221a\n\n3 FALKON\n\nOur approach is based on a novel combination of randomized projections with iterative solvers plus\npreconditioning. The main novelty is that we use random projections to approximate both the problem\nand the preconditioning.\n\n3\n\n\fPreliminaries: preconditioning and KRR. We begin recalling the basic idea behind precondition-\ning. The key quantity is the condition number, that for a linear system is the ratio between the largest\nand smallest singular values of the matrix de\ufb01ning the problem [25]. For example, for problem (5)\nthe condition number is given by\n\ncond(Knn + \u03bbnI) = (\u03c3max + \u03bbn)/(\u03c3min + \u03bbn),\n\nwith \u03c3max, \u03c3min largest and smallest eigenvalues of Knn, respectively. The importance of the condition\nnumber is that it captures the time complexity of iteratively solving the corresponding linear system.\nFor example, if a simple gradient descent (6) is used, the number of iterations needed for an \u0001 accurate\nsolution of problem (5) is\n\nt = O(cond(Knn + \u03bbnI) log(1/\u0001)).\n\n\u221a\n\nn log n are needed to achieve a solution with good statistical\nIt is shown in [23] that in this case t =\nproperties. Indeed, it can be shown that roughly t \u2248 1/\u03bb log( 1\nn and\n\u0001 = 1/n. The idea behind preconditioning is to use a suitable matrix B to de\ufb01ne an equivalent linear\nsystem with better condition number. For (5), an ideal choice is B such that\n\n\u221a\n\u0001 ) are needed where \u03bb = 1/\n\nBB(cid:62) = (Knn + \u03bbnI)\u22121\n\n(9)\nand B(cid:62)(Knn + \u03bbnI)B \u03b2 = B(cid:62) \u02c6y. Clearly, if \u03b2\u2217 solves the latter problem, \u03b1\u2217 = B\u03b2\u2217 is a solution\nof problem (5). Using a preconditioner B as in (9) one iteration is suf\ufb01cient, but computing the B is\ntypically as hard as the original problem. The problem is to derive preconditioning such that (9) might\nhold only approximately, but that can be computed ef\ufb01ciently. Derivation of ef\ufb01cient preconditioners\nfor the exact KRR problem (5) has been the subject of recent studies, [3, 4, 26, 5, 6]. In particular,\n[4, 26, 5, 6] consider random projections to approximately compute a preconditioner. Clearly,\nwhile preconditioning (5) leads to computational speed ups in terms of the number of iterations,\nrequirements in terms of memory/kernel evaluation are the same as standard kernel ridge regression.\nThe key idea to tackle this problem is to consider an ef\ufb01cient preconditioning approach for problem (8)\nrather than (5).\nBasic FALKON algorithm. We begin illustrating a basic version of our approach. The key\ningredient is the following preconditioner for Eq. (8),\n\n(cid:16) n\n\nM\n\nBB(cid:62) =\n\n(cid:17)\u22121\n\nK 2\n\nMM + \u03bbnKMM\n\n,\n\n(10)\n\nwhich is itself based on a Nystr\u00a8om approximation3. The above preconditioning is a natural approxi-\nnM KnM + \u03bbnKMM )\u22121 and\nmation of the ideal preconditioning of problem (8) that is BB(cid:62) = (K(cid:62)\nreduces to it if M = n. Our theoretical analysis, shows that M (cid:28) n suf\ufb01ces for deriving optimal\nstatistical rates. In its basic form FALKON is derived combining the above preconditioning and\ngradient descent,\n\n(cid:98)f\u03bb,M,t(x) =\n\nM(cid:88)\n\ni=1\n\n\u03b1t,iK(x,(cid:101)xi), with \u03b1t = B\u03b2t\n\nB(cid:62)(cid:2)K(cid:62)\n\nand\n\nnM (KnM (B\u03b2k\u22121) \u2212(cid:98)y) + \u03bbnKMM (B\u03b2k\u22121)(cid:3) ,\n\n\u03b2k = \u03b2k\u22121 \u2212 \u03c4\nn\n\n(12)\nfor t \u2208 N, \u03b20 = 0 and 1 \u2264 k \u2264 t and a suitable chosen \u03c4. In practice, a re\ufb01ned version of FALKON\nis preferable where a faster gradient iteration is used and additional care is taken in organizing\ncomputations.\nFALKON. The actual version of FALKON we propose is Alg. 1 (see Sect. A, Alg. 2 for the complete\nalgorithm). It consists in solving the system B(cid:62)HB\u03b2 = B(cid:62)z via conjugate gradient [25], since it is\na fast gradient method and does not require to specify the step-size. Moreover, to compute B quickly,\nwith reduced numerical errors, we consider the following strategy\n\n(11)\n\nB =\n\nT \u22121A\u22121,\n\n1\u221a\nn\n\nT = chol(KMM ), A = chol\n\nT T (cid:62) + \u03bbI\n\n,\n\n(13)\n\nwhere chol() is the Cholesky decomposition (in Sect. A the strategy for non invertible KMM ).\n\n3 For the sake of simplicity, here we assume KMM to be invertible and the Nystr\u00a8om centers selected with\n\nuniform sampling from the training set, see Sect. A and Alg. 2 in the appendix for the general algorithm.\n\n4\n\n(cid:18) 1\n\nM\n\n(cid:19)\n\n\fAlgorithm 1 MATLAB code for FALKON. It requires O(nM t + M 3) in time and O(M 2) in memory.\nSee Sect. A and Alg. 2 in the appendixes for the complete algorithm.\nInput: Dataset X = (xi)n\ncomputing the kernel matrix given two sets of points, regularization parameter \u03bb, number of iterations t.\nOutput: Nystr\u00a8om coef\ufb01cients \u03b1.\n\ni=1 \u2208 Rn, centers C = (\u02dcxj)M\n\nj=1 \u2208 RM\u00d7D, KernelMatrix\n\ni=1 \u2208 Rn\u00d7D, \u02c6y = (yi)n\n\nfunction alpha = FALKON(X, C, Y, KernelMatrix, lambda, t)\n\nn = size(X,1); M = size(C,1); KMM = KernelMatrix(C,C);\nT = chol(KMM + eps*M*eye(M));\nA = chol(T*T\u2019/M + lambda*eye(M));\n\nfunction w = KnM_times_vector(u, v)\n\nw = zeros(M,1); ms = ceil(linspace(0, n, ceil(n/M)+1));\nfor i=1:ceil(n/M)\n\nKr = KernelMatrix( X(ms(i)+1:ms(i+1),:), C );\nw = w + Kr\u2019*(Kr*u + v(ms(i)+1:ms(i+1),:));\n\nend\n\nend\n\nBHB = @(u) A\u2019\\(T\u2019\\(KnM_times_vector(T\\(A\\u), zeros(n,1))/n) + lambda*(A\\u));\nr = A\u2019\\(T\u2019\\KnM_times_vector(zeros(M,1), Y/n));\nalpha = T\\(A\\conjgrad(BHB, r, t));\n\nend\n\nComputations.\nin Alg. 1, B is never built explicitly and A, T are two upper-triangular matrices, so\nA\u2212(cid:62)u, A\u22121u for a vector u costs M 2, and the same for T . The cost of computing the preconditioner\nis only 4\n3 M 3 \ufb02oating point operations (consisting in two Cholesky decompositions and one product\nof two triangular matrices). Then FALKON requires O(nM t + M 3) in time and the same O(M 2)\nmemory requirement of the basic Nystr\u00a8om method, if matrix/vector multiplications at each iteration\nare performed in blocks. This implies O(nM t) kernel evaluations are needed.\nThe question remains to characterize M and the number of iterations needed for good statistical\naccuracy. Indeed, in the next section we show that roughly O(n\nn) computations and O(n) memory\nare suf\ufb01cient for optimal accuracy. This implies that FALKON is currently the most ef\ufb01cient kernel\nmethod with the same optimal statistical accuracy of KRR, see Table 1.\n\n\u221a\n\n4 Theoretical Analysis\n\nIn this section, we characterize the generalization properties of FALKON showing it achieves the\noptimal generalization error of KRR, with dramatically reduced computations. This result is given in\nThm. 3 and derived in two steps. First, we study the difference between the excess risk of FALKON\nand that of the basic Nystr\u00a8om (8), showing it depends on the condition number induced by the\npreconditioning, hence on M (see Thm.1). Deriving these results requires some care, since differently\nto standard optimization results, our goal is to solve (1) i.e. achieve small excess risk, not to minimize\n\nthe empirical error. Second, we show that choosing M = (cid:101)O(1/\u03bb) allows to make this difference as\n\nsmall as e\u2212t/2 (see Thm.2). Finally, recalling that the basic Nystr\u00a8om for \u03bb = 1/\nn has essentially\nthe same statistical properties of KRR [13], we answer the question posed at the end of the last\nsection and show that roughly log n iterations are suf\ufb01cient for optimal statistical accuracy. Following\nthe discussion in the previous section this means that the computational requirements for optimal\n\nn) in time/kernel evaluations and (cid:101)O(n) in space. Later in this section faster rates\n\naccuracy are (cid:101)O(n\n\n\u221a\n\n\u221a\n\nunder further regularity assumptions are also derived and the effect of different selection methods for\nthe Nystr\u00a8om centers considered. The proofs for this section are provided in Sect. E of the appendixes.\n\n4.1 Main Result\n\nThe \ufb01rst result is interesting in its own right since it corresponds to translating optimization guarantees\ninto statistical results. In particular, we derive a relation the excess risk of the FALKON algorithm\n\n(cid:98)f\u03bb,M,t from Alg. 1 and the Nystr\u00a8om estimator (cid:101)f\u03bb,M from Eq. (8) with uniform sampling.\n\n5\n\n\fAlgorithm\nSVM / KRR + direct method\nKRR + iterative [1, 2]\nDoubly stochastic [22]\nPegasos / KRR + sgd [27]\nKRR + iter + precond [3, 28, 4, 5, 6]\nDivide & Conquer [29]\nNystr\u00a8om, random features [7, 8, 9]\nNystr\u00a8om + iterative [23, 24]\nNystr\u00a8om + sgd [20]\nFALKON (see Thm. 3)\n\ntrain time\n\nkernel evaluations memory\n\ntest time\n\n\u221a\nn3\nn2\u221a\nn2 4\nn\nn\n\nn2\nn2\nn2\nn2\nn2\n\u221a\nn2\nn\nn\n\nn2\nn2\n\nn\n\nn2\u221a\nn2\n\u221a\nn2\n\u221a\nn\nn\n\u221a\nn\nn\n\u221a\nn\nn\n\u221a\nn\nn\nn\nn\n\nn2\nn2\nn\nn\nn\nn\nn\nn\nn\nn\n\nn\nn\nn\nn\nn\n\u221a\nn\n\u221a\nn\n\u221a\nn\n\u221a\nn\nn\n\nTable 1: Computational complexity required by different algorithms, for optimal generalization.\nLogarithmic terms are not showed.\nTheorem 1. Let n, M \u2265 3, t \u2208 N, 0 < \u03bb \u2264 \u03bb1 and \u03b4 \u2208 (0, 1]. Assume there exists \u03ba \u2265 1 such that\nK(x, x) \u2264 \u03ba2 for any x \u2208 X. Then, the following inequality holds with probability 1 \u2212 \u03b4\n\n(cid:114)\n\n1 +\n\n9\u03ba2\n\u03bbn\n\nlog\n\nn\n\u03b4\n\n,\n\nR((cid:98)f\u03bb,M,t)1/2 \u2264 R((cid:101)f\u03bb,M )1/2 + 4(cid:98)v e\u2212\u03bdt\n(cid:80)n\n\ni and \u03bd = log(1 + 2/(cond (B(cid:62)HB)\n\ni=1 y2\n\nwhere(cid:98)v2 = 1\n\nn\n\n1/2 \u2212 1)), with cond (B(cid:62)HB) the\n\ncondition number of B(cid:62)HB. Note that \u03bb1 > 0 is a constant not depending on \u03bb, n, M, \u03b4, t.\nThe additive term in the bound above decreases exponentially in the number of iterations. If the\ncondition number of B(cid:62)HB is smaller than a small universal constant (e.g. 17), then \u03bd > 1/2 and\nthe additive term decreases as e\u2212 t\n2 . Next, theorems derive a condition on M that allows to control\ncond (B(cid:62)HB), and derive such an exponential decay.\nTheorem 2. Under the same conditions of Thm. 1, if\n14\u03ba2\n\n1 +\nthen the exponent \u03bd in Thm. 1 satis\ufb01es \u03bd \u2265 1/2.\nThe above result gives the desired exponential bound showing that after log n iterations the excess\nrisk of FALKON is controlled by that of the basic Nystr\u00a8om, more precisely\n\nM \u2265 5\n\n8\u03ba2\n\u03bb\u03b4\n\n(cid:21)\n\n(cid:20)\n\n\u03bb\n\nlog\n\n.\n\nR((cid:98)f\u03bb,M,t) \u2264 2R((cid:101)f\u03bb,M ) when\n\nt \u2265 log R((cid:101)f\u03bb,M ) + log\n\n(cid:19)\n\n+ log(cid:0)16(cid:98)v2(cid:1) .\n\n(cid:18)\n\n1 +\n\n9\u03ba2\n\u03bbn\n\nlog\n\nn\n\u03b4\n\nFinally, we derive an excess risk bound for FALKON. By the no-free-lunch theorem, this requires\nsome conditions on the learning problem. We \ufb01rst consider a standard basic setting where we only\nassume it exists fH \u2208 H such that E(fH) = inf f\u2208H E(f ).\nTheorem 3. Let \u03b4 \u2208 (0, 1]. Assume there exists \u03ba \u2265 1 such that K(x, x) \u2264 \u03ba2 for any x \u2208 X, and\ny \u2208 [\u2212 a\n\n2 ], almost surely, a > 0. There exist n0 \u2208 N such that for any n \u2265 n0, if\n\n2 , a\n\n\u03bb =\n\n1\u221a\nn\n\n,\n\nM \u2265 75\n\nthen with probability 1 \u2212 \u03b4,\n\n\u221a\n\nn log\n\n48\u03ba2n\n\n\u03b4\n\n,\n\nt \u2265 1\n2\n\nlog(n) + 5 + 2 log(a + 3\u03ba),\n\nR((cid:98)f\u03bb,M,t ) \u2264 c0 log2 24\n\n\u03b4\u221a\nn\n\n.\n\nIn particular n0, c0 do not depend on \u03bb, M, n, t and c0 do not depend on \u03b4.\n\nThe above result provides the desired bound, and all the constants are given in the appendix. The\nobtained learning rate is the same as the full KRR estimator and is known to be optimal in a minmax\nsense [17], hence not improvable. As mentioned before, the same bound is also achieved by the\n\n6\n\n\f\u221a\nbasic Nystr\u00a8om method but with much worse time complexity. Indeed, as discussed before, using\n\u221a\nn log n) iterations, while we need only O(log n).\na simple iterative solver typically requires O(\nConsidering the choice for M this leads to a computational time of O(nM t) = O(n\nn) for optimal\ngeneralization (omitting logarithmic terms). To the best of our knowledge FALKON currently\nprovides the best time/space complexity to achieve the statistical accuracy of KRR. Beyond the\nbasic setting considered above, in the next section we show that FALKON can achieve much faster\nrates under re\ufb01ned regularity assumptions and also consider the potential bene\ufb01ts of leverage score\nsampling.\n\n4.2 Fast learning rates and Nystr\u00a8om with approximate leverage scores\n\nsatisfying the de\ufb01nition of q-approximate leverage scores [13], satisfying q\u22121li(\u03bb) \u2264 (cid:98)li(\u03bb) \u2264\nConsidering fast rates and Nystr\u00a8om with more general sampling is considerably more technical and\na heavier notation is needed. Our analysis apply to any approximation scheme (e.g. [30, 12, 31])\n\u2200 i \u2208 {1, . . . , n}. Here \u03bb > 0, li(\u03bb) = (Knn(Knn + \u03bbnI)\u22121)ii are the leverage scores\nsampled independently from the dataset with probability pi \u221d(cid:98)li(\u03bb). We need a few more de\ufb01nitions.\nqli(\u03bb),\nand q \u2265 1 controls the quality of the approximation. In particular, given \u03bb, the Nystr\u00a8om points are\nLet Kx = K(x,\u00b7) for any x \u2208 X and H the reproducing kernel Hilbert space [32] of functions with\n(cid:104)f, Cg(cid:105)H =(cid:82)\ninner product de\ufb01ned by H = span{Kx | x \u2208 X} and closed with respect to the inner product (cid:104)\u00b7,\u00b7(cid:105)H\nde\ufb01ned by (cid:104)Kx, Kx(cid:48)(cid:105)H = K(x, x(cid:48)), for all x, x(cid:48) \u2208 X. De\ufb01ne C : H \u2192 H to be the linear operator\n\nX f (x)g(x)d\u03c1X (x), for all f, g \u2208 H. Finally de\ufb01ne the following quantities,\nN\u221e(\u03bb) = sup\nx\u2208X\n\n(cid:107)(C + \u03bbI)\u22121/2Kx(cid:107)H, N (\u03bb) = Tr(C(C + \u03bbI)\u22121).\n\nThe latter quantity is known as degrees of freedom or effective dimension, can be seen as a measure\nof the size of H. The quantity N\u221e(\u03bb) can be seen to provide a uniform bound on the leverage scores.\nIn particular note that N (\u03bb) \u2264 N\u221e(\u03bb) \u2264 \u03ba2\n\u03bb [13]. We can now provide a re\ufb01ned version of Thm. 2.\nTheorem 4. Under the same conditions of Thm. 1, the exponent \u03bd in Thm. 1 satis\ufb01es \u03bd \u2265 1/2, when\n\n1. either Nystr\u00a8om uniform sampling is used with M \u2265 70 [1 + N\u221e(\u03bb)] log 8\u03ba2\n\u03bb\u03b4 .\n2. or Nystr\u00a8om q-approx. lev. scores [13] is used, with \u03bb \u2265 19\u03ba2\nn log n\n8\u03ba2\n\u03bb\u03b4\n\nM \u2265 215(cid:2)2 + q2N (\u03bb)(cid:3) log\n\n.\n\n2\u03b4 , n \u2265 405\u03ba2 log 12\u03ba2\n\n\u03b4\n\n,\n\nWe then recall the standard, albeit technical, assumptions leading to fast rates [17, 18]. The capacity\ncondition requires the existence of \u03b3 \u2208 (0, 1] and Q \u2265 0, such that N (\u03bb) \u2264 Q2\u03bb\u2212\u03b3. Note that this\ncondition is always satis\ufb01ed with Q = \u03ba and \u03b3 = 1. The source condition requires the existence\nof r \u2208 [1/2, 1] and g \u2208 H, such that fH = C r\u22121/2g. Intuitively, the capacity condition measures\nthe size of H, if \u03b3 is small then H is small and rates are faster. The source condition measures the\nregularity of fH, if r is big fH is regular and rates are faster. The case r = 1/2 and \u03b3 = D/(2s) (for\na kernel with smoothness s and input space RD) recovers the classic Sobolev condition. For further\ndiscussions on the interpretation of the conditions above see [17, 18, 11, 13]. We can then state our\nmain result on fast rates\nTheorem 5. Let \u03b4 \u2208 (0, 1]. Assume there exists \u03ba \u2265 1 such that K(x, x) \u2264 \u03ba2 for any x \u2208 X,\nand y \u2208 [\u2212 a\n2 ], almost surely, with a > 0. There exist an n0 \u2208 N such that for any n \u2265 n0 the\nfollowing holds. When\n\n2 , a\n\n\u2212 1\n\n2r+\u03b3 ,\n\nt \u2265 log(n) + 5 + 2 log(a + 3\u03ba2),\n\n\u03bb = n\n\n1. and either Nystr\u00a8om uniform sampling is used with M \u2265 70 [1 + N\u221e(\u03bb)] log 8\u03ba2\n\u03bb\u03b4 ,\n\n2. or Nystr\u00a8om q-approx. lev. scores [13] is used with M \u2265 220(cid:2)2 + q2N (\u03bb)(cid:3) log 8\u03ba2\n\n\u03bb\u03b4 ,\n\nthen with probability 1 \u2212 \u03b4,\n\nR((cid:98)f\u03bb,M,t) \u2264 c0 log2 24\n\n\u2212 2r\n\nwhere (cid:98)f\u03bb,M,t is the FALKON estimator (Sect. 3, Alg. 1 and Sect. A, Alg. 2 in the appendix for the\n\ncomplete version). In particular n0, c0 do not depend on \u03bb, M, n, t and c0 do not depend on \u03b4.\n\n2r+\u03b3 .\n\nn\n\n\u03b4\n\n7\n\n\fFigure 1: Falkon is compared to stochastic gradient, gradient descent and conjugate gradient applied\nto Problem (8), while NYTRO refer to the variants described in [23]. The graph shows the test error\non the HIGGS dataset (1.1 \u00d7 107 examples) with respect to the number of iterations (epochs for\nstochastic algorithms).\n\nThe above result shows that FALKON achieves the same fast rates as KRR, under the same conditions\n[17]. For r = 1/2, \u03b3 = 1, the rate in Thm. 3 is recovered. If \u03b3 < 1, r > 1/2, FALKON achieves a\nrate close to O(1/n). By selecting the Nystr\u00a8om points with uniform sampling, a bigger M could be\nM, smaller than n\u03b3/2 (cid:28) \u221a\nneeded for fast rates (albeit always less than n). However, when approximate leverage scores are used\nn is always enough for optimal generalization. This shows that FALKON\nwith approximate leverage scores is the \ufb01rst algorithm to achieve fast rates with a computational\ncomplexity that is O(nN (\u03bb)) = O(n1+ \u03b3\n\n2r+\u03b3 ) \u2264 O(n1+ \u03b3\n\n2 ) in time.\n\n5 Experiments\n\nWe present FALKON\u2019s performance on a range of large scale datasets. As shown in Table 2, 3,\nFALKON achieves state of the art accuracy and typically outperforms previous approaches in all the\nconsidered large scale datasets including IMAGENET. This is remarkable considering FALKON\nrequired only a fraction of the competitor\u2019s computational resources. Indeed we used a single machine\nequipped with two Intel Xeon E5-2630 v3, one NVIDIA Tesla K40c and 128 GB of RAM and a\nbasic MATLAB FALKON implementation, while typically the results for competing algorithm have\nbeen performed on clusters of GPU workstations (accuracies, times and used architectures are cited\nfrom the corresponding papers).\nA minimal MATLAB implementation of FALKON is presented in Appendix G. The code necessary\nto reproduce the following experiments, plus a FALKON version that is able to use the GPU, is\navailable on GitHub at https://github.com/LCSL/FALKON_paper . The error is measured with\nMSE, RMSE or relative error for regression problems, and with classi\ufb01cation error (c-err) or AUC\nfor the classi\ufb01cation problems, to be consistent with the literature. For datasets which do not have a\n\ufb01xed test set, we set apart 20% of the data for testing. For all datasets, but YELP and IMAGENET,\nwe normalize the features by their z-score. From now on we denote with n the cardinality of the\ndataset, d the dimensionality. A comparison of FALKON with respect to other methods to compute\nthe Nystr\u00a8om estimator, in terms of the MSE test error on the HIGGS dataset, is given in Figure 1.\nMillionSongs [36] (Table 2, n = 4.6 \u00d7 105, d = 90, regression). We used a Gaussian kernel with\n\u03c3 = 6, \u03bb = 10\u22126 and 104 Nystr\u00a8om centers. Moreover with 5 \u00d7 104 center, FALKON achieves a\n79.20 MSE, and 4.49 \u00d7 10\u22123 rel. error in 630 sec.\nTIMIT (Table 2, n = 1.2 \u00d7 106, d = 440, multiclass classi\ufb01cation). We used the same\npreprocessed dataset of [6] and Gaussian Kernel with \u03c3 = 15, \u03bb = 10\u22129 and 105 Nystr\u00a8om centers.\nYELP (Table 2, n = 1.5 \u00d7 106, d = 6.52 \u00d7 107, regression). We used the same dataset\nof [24]. We extracted the 3-grams from the plain text with the same pipeline as [24], then we mapped\n\n8\n\n0204060801000.750.80.850.90.9517Nystrom GDNystrom SGDNystrom CGNYTRO GDNYTRO SGDNYTRO CGFALKONIterates/epochsMSE\fthem in a sparse binary vector which records if the 3-gram is present or not in the example. We used a\nlinear kernel with 5\u00d7104 Nystr\u00a8om centers. With 105 centers, we get a RMSE of 0.828 in 50 minutes.\nTable 2: Architectures: \u2021 cluster 128 EC2 r3.2xlarge machines, \u2020 cluster 8 EC2 r3.8xlarge machines, (cid:111)\nsingle machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU, 128GB RAM, (cid:63) cluster\nwith IBM POWER8 12-core processor, 512 GB RAM, \u2217 unknown platform.\n\nMillionSongs\n\nYELP\n\nTIMIT\n\nMSE Relative error Time(s) RMSE Time(m)\n80.10\n\n0.833\n\nc-err\n32.3%\n\nFALKON\nPrec. KRR [4]\nHierarchical [33]\nD&C [29]\nRand. Feat. [29]\nNystr\u00a8om [29]\nADMM R. F.[4]\nBCD R. F. [24]\nBCD Nystr\u00a8om [24]\nEigenPro [6]\nKRR [33] [24]\nDeep NN [34]\nSparse Kernels [34]\nEnsemble [35]\n\n80.35\n80.93\n80.38\n\n-\n-\n\n-\n-\n-\n-\n-\n-\n-\n-\n\n4.51 \u00d7 10\u22123\n4.58 \u00d7 10\u22123\n4.56 \u00d7 10\u22123\n\n5.01 \u00d7 10\u22123\n\n4.55 \u00d7 10\u22123\n\n-\n-\n-\n\n-\n-\n-\n\n-\n-\n-\n\n55\n289\u2020\n293(cid:63)\n737\u2217\n772\u2217\n876\u2217\n958\u2020\n-\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n-\n-\n\n0.949\n0.861\n\n0.854\n\n20\n-\n-\n-\n-\n-\n-\n42\u2021\n60\u2021\n-\n500\u2021\n-\n-\n-\n\n34.0%\n33.7%\n32.6%\n33.5%\n32.4%\n30.9%\n33.5%\n\nTime(h)\n\n1.5\n-\n-\n-\n-\n-\n-\n1.7\u2021\n1.7\u2021\n3.9(cid:111)\n8.3\u2021\n-\n-\n-\n\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n\nTable 3: Architectures: \u2020 cluster with IBM POWER8 12-core cpu, 512 GB RAM, (cid:111) single machine\nwith two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU, 128GB RAM, \u2021 single machine [37]\n\nSUSY\nAUC Time(m) AUC Time(h)\n\nHIGGS\n\nIMAGENET\nc-err\n20.7%\n\nTime(h)\n\nFALKON\nEigenPro [6]\nHierarchical [33]\nBoosted Decision Tree [38]\nNeural Network [38]\nDeep Neural Network [38]\nInception-V4 [39]\n\nc-err\n19.6% 0.877\n19.8%\n20.1%\n\n-\n-\n\n-\n-\n-\n-\n\n0.863\n0.875\n0.879\n\n-\n\n4\n6(cid:111)\n40\u2020\n-\n-\n\n4680\u2021\n\n-\n\n0.833\n\n0.810\n0.816\n0.885\n\n-\n-\n\n-\n\n3\n-\n-\n-\n-\n78\u2021\n-\n\n4\n-\n-\n-\n-\n-\n-\n\n20.0%\n\nSUSY (Table 3, n = 5 \u00d7 106, d = 18, binary classi\ufb01cation). We used a Gaussian kernel\nwith \u03c3 = 4, \u03bb = 10\u22126 and 104 Nystr\u00a8om centers.\nHIGGS (Table 3, n = 1.1 \u00d7 107, d = 28, binary classi\ufb01cation). Each feature has been\nnormalized subtracting its mean and dividing for its variance. We used a Gaussian kernel with\ndiagonal matrix width learned with cross validation on a small validation set, \u03bb = 10\u22128 and 105\nNystr\u00a8om centers. If we use a single \u03c3 = 5 we reach an AUC of 0.825.\nIMAGENET (Table 3, n = 1.3 \u00d7 106, d = 1536, multiclass classi\ufb01cation). We report the\ntop 1 c-err over the validation set of ILSVRC 2012 with a single crop. The features are obtained from\nthe convolutional layers of pre-trained Inception-V4 [39]. We used Gaussian kernel with \u03c3 = 19,\n\u03bb = 10\u22129 and 5 \u00d7 104 Nystr\u00a8om centers. Note that with linear kernel we achieve c-err = 22.2%.\nAcknowledgments.\nThe authors would like to thank Mikhail Belkin, Benjamin Recht and Siyuan Ma, Eric Fosler-Lussier, Shivaram\nVenkataraman, Stephen L. Tu, for providing their features of the TIMIT and YELP datasets, and NVIDIA\nCorporation for the donation of the Tesla K40c GPU used for this research. This work is funded by the Air Force\nproject FA9550-17-1-0390 (European Of\ufb01ce of Aerospace Research and Development) and by the FIRB project\nRBFR12M3AC (Italian Ministry of Education, University and Research).\n\n9\n\n\fReferences\n[1] A. Caponnetto and Yuan Yao. Adaptive rates for regularization operators in learning theory. Analysis and\n\nApplications, 08, 2010.\n\n[2] L. Lo Gerfo, Lorenzo Rosasco, Francesca Odone, Ernesto De Vito, and Alessandro Verri. Spectral\n\nAlgorithms for Supervised Learning. Neural Computation, 20(7):1873\u20131897, 2008.\n\n[3] Gregory E Fasshauer and Michael J McCourt. Stable evaluation of gaussian radial basis function inter-\n\npolants. SIAM Journal on Scienti\ufb01c Computing, 34(2):A737\u2013A762, 2012.\n\n[4] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster kernel ridge regression using sketching\n\nand preconditioning. arXiv preprint arXiv:1611.03220, 2016.\n\n[5] Alon Gonen, Francesco Orabona, and Shai Shalev-Shwartz. Solving ridge regression using sketched\n\npreconditioned svrg. arXiv preprint arXiv:1602.02350, 2016.\n\n[6] Siyuan Ma and Mikhail Belkin. Diving into the shallows: a computational perspective on large-scale\n\nshallow learning. arXiv preprint arXiv:1703.10622, 2017.\n\n[7] Christopher Williams and Matthias Seeger. Using the Nystr\u00a8om Method to Speed Up Kernel Machines. In\n\nNIPS, pages 682\u2013688. MIT Press, 2000.\n\n[8] Alex J. Smola and Bernhard Sch\u00a8olkopf. Sparse Greedy Matrix Approximation for Machine Learning. In\n\nICML, pages 911\u2013918. Morgan Kaufmann, 2000.\n\n[9] Ali Rahimi and Benjamin Recht. Random Features for Large-Scale Kernel Machines. In NIPS, pages\n\n1177\u20131184. Curran Associates, Inc., 2007.\n\n[10] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with\nrandomization in learning. In Advances in neural information processing systems, pages 1313\u20131320, 2009.\n\n[11] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In COLT, volume 30 of JMLR\n\nProceedings, pages 185\u2013209. JMLR.org, 2013.\n\n[12] Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical guarantees.\n\nIn Advances in Neural Information Processing Systems 28, pages 775\u2013783. 2015.\n\n[13] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr\u00a8om computational\n\nregularization. In Advances in Neural Information Processing Systems, pages 1648\u20131656, 2015.\n\n[14] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. arXiv\n\npreprint arXiv:1602.04474, 2016.\n\n[15] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal\n\nof Machine Learning Research, 18(21):1\u201338, 2017.\n\n[16] Bernhard Sch\u00a8olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Reg-\nularization, Optimization, and Beyond (Adaptive Computation and Machine Learning). MIT Press,\n2002.\n\n[17] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.\n\nFoundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[18] Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least squares regression. In\n\nCOLT, 2009.\n\n[19] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of\n\ncomplexity, 23(1):52\u201372, 2007.\n\n[20] Aymeric Dieuleveut and Francis Bach. Non-parametric stochastic approximation with large step sizes.\n\narXiv preprint arXiv:1408.0361, 2014.\n\n[21] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.\n\nConstructive Approximation, 26(2):289\u2013315, 2007.\n\n[22] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel\nmethods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages\n3041\u20133049, 2014.\n\n10\n\n\f[23] Raffaello Camoriano, Tom\u00b4as Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro: When subsampling\nmeets early stopping. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 1403\u20131411, 2016.\n\n[24] Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, and Benjamin Recht. Large scale kernel learning\n\nusing block coordinate descent. arXiv preprint arXiv:1602.05310, 2016.\n\n[25] Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2003.\n\n[26] Kurt Cutajar, Michael Osborne, John Cunningham, and Maurizio Filippone. Preconditioning kernel\n\nmatrices. In International Conference on Machine Learning, pages 2529\u20132538, 2016.\n\n[27] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated\n\nsub-gradient solver for svm. Mathematical programming, 127(1):3\u201330, 2011.\n\n[28] Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels: Fast and optimal\n\nnon-parametric regression. arXiv preprint arXiv:1501.06195, 2015.\n\n[29] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Divide and Conquer Kernel Ridge Regression.\n\nIn COLT, volume 30 of JMLR Proceedings, pages 592\u2013617. JMLR.org, 2013.\n\n[30] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast approximation\n\nof matrix coherence and statistical leverage. JMLR, 13:3475\u20133506, 2012.\n\n[31] Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford.\n\nUniform Sampling for Matrix Approximation. In ITCS, pages 181\u2013190. ACM, 2015.\n\n[32] I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer\n\nNew York, 2008.\n\n[33] Jie Chen, Haim Avron, and Vikas Sindhwani. Hierarchically compositional kernels for scalable nonpara-\n\nmetric learning. CoRR, abs/1608.00860, 2016.\n\n[34] Avner May, Alireza Bagheri Garakani, Zhiyun Lu, Dong Guo, Kuan Liu, Aurelien Bellet, Linxi Fan,\nMichael Collins, Daniel J. Hsu, Brian Kingsbury, Michael Picheny, and Fei Sha. Kernel approximation\nmethods for speech recognition. CoRR, abs/1701.03577, 2017.\n\n[35] Po-Sen Huang, Haim Avron, Tara N. Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel\nmethods match deep neural networks on timit. 2014 IEEE International Conference on Acoustics, Speech\nand Signal Processing (ICASSP), pages 205\u2013209, 2014.\n\n[36] Thierry Bertin-Mahieux, Daniel P. W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset.\n\nIn ISMIR, 2011.\n\n[37] Alexandre Alves. Stacking machine learning classi\ufb01ers to identify higgs bosons at the lhc. CoRR,\n\nabs/1612.07725, 2016.\n\n[38] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics\n\nwith deep learning. Nature communications, 5, 2014.\n\n[39] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-\n\nresnet and the impact of residual connections on learning. pages 4278\u20134284, 2017.\n\n[40] Michael Reed and Barry Simon. Methods of Modern Mathematical Physics: Vol.: 1.: Functional Analysis.\n\nAcademic press, 1980.\n\n[41] Ernesto D Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto D Giovannini, and Francesca Odone.\nLearning from examples as an inverse problem. In Journal of Machine Learning Research, pages 883\u2013904,\n2005.\n\n[42] Alessandro Rudi, Guillermo D Canas, and Lorenzo Rosasco. On the Sample Complexity of Subspace\n\nLearning. In NIPS, pages 2067\u20132075, 2013.\n\n[43] St\u00b4ephane Boucheron, G\u00b4abor Lugosi, and Olivier Bousquet. Concentration inequalities. In Advanced\n\nLectures on Machine Learning. 2004.\n\n11\n\n\f", "award": [], "sourceid": 2109, "authors": [{"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA"}, {"given_name": "Luigi", "family_name": "Carratino", "institution": "University of Genoa"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}]}