{"title": "Diving into the shallows: a computational perspective on large-scale shallow learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3778, "page_last": 3787, "abstract": "Remarkable recent success of deep neural networks has not been easy to analyze theoretically. It has been particularly hard to disentangle relative significance of architecture and optimization in achieving accurate classification on large datasets. On the flip side, shallow methods (such as kernel methods) have encountered obstacles in scaling to large data, despite excellent performance on smaller datasets, and extensive theoretical analysis. Practical methods, such as variants of gradient descent used so successfully in deep learning, seem to perform below par when applied to kernel methods. This difficulty has sometimes been attributed to the limitations of shallow architecture. In this paper we identify a basic limitation in gradient descent-based optimization methods when used in conjunctions with smooth kernels. Our analysis demonstrates that only a vanishingly small fraction of the function space is reachable after a polynomial number of gradient descent iterations. That drastically limits the approximating power of gradient descent leading to over-regularization. The issue is purely algorithmic, persisting even in the limit of infinite data. To address this shortcoming in practice, we introduce EigenPro iteration, a simple and direct preconditioning scheme using a small number of approximately computed eigenvectors. It can also be viewed as learning a kernel optimized for gradient descent. Injecting this small, computationally inexpensive and SGD-compatible, amount of approximate second-order information leads to major improvements in convergence. For large data, this leads to a significant performance boost over the state-of-the-art kernel methods. In particular, we are able to match or improve the results reported in the literature at a small fraction of their computational budget. For complete version of this paper see https://arxiv.org/abs/1703.10622.", "full_text": "Diving into the shallows: a computational perspective\n\non large-scale shallow learning\n\nSiyuan Ma\n\nMikhail Belkin\n\nDepartment of Computer Science and Engineering\n\nThe Ohio State University\n\n{masi, mbelkin}@cse.ohio-state.edu\n\nAbstract\n\nRemarkable recent success of deep neural networks has not been easy to analyze\ntheoretically. It has been particularly hard to disentangle relative signi\ufb01cance of\narchitecture and optimization in achieving accurate classi\ufb01cation on large datasets.\nOn the \ufb02ip side, shallow methods (such as kernel methods) have encountered\nobstacles in scaling to large data, despite excellent performance on smaller datasets,\nand extensive theoretical analysis. Practical methods, such as variants of gradient\ndescent used so successfully in deep learning, seem to perform below par when\napplied to kernel methods. This dif\ufb01culty has sometimes been attributed to the\nlimitations of shallow architecture.\nIn this paper we identify a basic limitation in gradient descent-based optimization\nmethods when used in conjunctions with smooth kernels. Our analysis demon-\nstrates that only a vanishingly small fraction of the function space is reachable\nafter a polynomial number of gradient descent iterations. That drastically limits\nthe approximating power of gradient descent leading to over-regularization. The\nissue is purely algorithmic, persisting even in the limit of in\ufb01nite data.\nTo address this shortcoming in practice, we introduce EigenPro iteration, a simple\nand direct preconditioning scheme using a small number of approximately com-\nputed eigenvectors. It can also be viewed as learning a kernel optimized for gradient\ndescent. Injecting this small, computationally inexpensive and SGD-compatible,\namount of approximate second-order information leads to major improvements in\nconvergence. For large data, this leads to a signi\ufb01cant performance boost over the\nstate-of-the-art kernel methods. In particular, we are able to match or improve the\nresults reported in the literature at a small fraction of their computational budget.\nFor complete version of this paper see https://arxiv.org/abs/1703.10622.\n\nIntroduction\n\n1\nIn recent years we have witnessed remarkable advances in many areas of arti\ufb01cial intelligence. Much\nof this progress has been due to machine learning methods, notably deep neural networks, applied\nto very large datasets. These networks are typically trained using variants of stochastic gradient\ndescent (SGD), allowing training on large data with modern GPU hardware. Despite intense recent\nresearch and signi\ufb01cant progress on SGD and deep architectures, it has not been easy to understand\nthe underlying causes of that success. Broadly speaking, it can be attributed to (a) the structure of the\nfunction space represented by the network or (b) the properties of the optimization algorithms used.\nWhile these two aspects of learning are intertwined, they are distinct and may be disentangled.\nAs learning in deep neural networks is still largely resistant to theoretical analysis, progress can\nbe made by exploring the limits of shallow methods on large datasets. Shallow methods, such as\nkernel methods, are a subject of an extensive and diverse literature, both theoretical and practical.\nIn particular, kernel machines are universal learners, capable of learning nearly arbitrary functions\ngiven a suf\ufb01cient number of examples [STC04, SC08]. Still, while kernel methods are easily\nimplementable and show state-of-the-art performance on smaller datasets (see [CK11, HAS+14,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fDXH+14, LML+14, MGL+17] for some comparisons with DNN\u2019s) there has been signi\ufb01cantly less\nprogress in applying these methods to large modern data. The goal of this work is to make a step\ntoward understanding the subtle interplay between architecture and optimization and to take practical\nsteps to improve performance of kernel methods on large data.\nThe paper consists of two main parts. First, we identify a basic underlying limitation in using gradient\ndescent-based methods in conjunction with smooth (in\ufb01nitely differentiable) kernels typically used in\nmachine learning, showing that only very smooth functions can be approximated after polynomially\nmany steps of gradient descent. This phenomenon is a result of fast spectral decay of smooth kernels\nand can be readily understood in terms of the spectral structure of the gradient descent operator in the\nleast square regression/classi\ufb01cation setting, which is the focus of our discussion. Slow convergence\nleads to severe over-regularization (over-smoothing) and suboptimal approximation for less smooth\nfunctions, which are arguably very common in practice, at least in the classi\ufb01cation setting, where we\nexpect fast transitions near the class boundaries.\nThis shortcoming of gradient descent is purely algorithmic and is not related to the sample complexity\nof the data. It is also not an intrinsic \ufb02aw of the kernel architecture, which is capable of approximating\narbitrary functions but potentially requiring a very large number of gradient descent steps. The issue\nis particularly serious for large data, where direct second order methods cannot be used due to the\ncomputational constraints. While many approximate second-order methods are available, they rely on\nlow-rank approximations and, as we discuss below, lead to over-regularization (approximation bias).\nIn the second part of the paper we propose EigenPro iteration (see http://www.github.com/EigenPro\nfor the code), a direct and simple method to alleviate slow convergence resulting from fast eigen-decay\nfor kernel (and covariance) matrices. EigenPro is a preconditioning scheme based on approximately\ncomputing a small number of top eigenvectors to modify the spectrum of these matrices. It can also\nbe viewed as constructing a new kernel, speci\ufb01cally optimized for gradient descent. While EigenPro\nuses approximate second-order information, it is only employed to modify \ufb01rst-order gradient descent,\nleading to the same mathematical solution as gradient descent (without introducing a bias). EigenPro\nis also fully compatible with SGD, using a low-rank preconditioner with a low overhead per iteration.\nWe analyze the step size in the SGD setting and provide a range of experimental results for different\nkernels and parameter settings showing \ufb01ve to 30-fold acceleration over the standard methods, such\nas Pegasos [SSSSC11]. For large data, when the computational budget is limited, that acceleration\ntranslates into signi\ufb01cantly improved accuracy. In particular, we are able to improve or match the\nstate-of-the-art results reported for large datasets in the kernel literature with only a small fraction of\ntheir computational budget.\n2 Gradient descent for shallow methods\nShallow methods. In the context of this paper, shallow methods denote the family of algorithms\nconsisting of a (linear or non-linear) feature map \u03c6 : RN \u2192 H to a (\ufb01nite or in\ufb01nite-dimensional)\nHilbert space H followed by a linear regression/classi\ufb01cation algorithm. This is a simple yet powerful\nsetting amenable to theoretical analysis. In particular, it includes the class of kernel methods, where\nH is a Reproducing Kernel Hilbert Space (RKHS).\nLinear regression. Consider n labeled data points {(x1, y1), ..., (xn, yn) \u2208 H \u00d7 R}. To simplify\n(cid:80)n\nthe notation let us assume that the feature map has already been applied to the data, i.e., xi = \u03c6(zi).\nLeast square linear regression aims to recover the parameter vector \u03b1\u2217 that minimize the empirical\ni=1((cid:104)\u03b1, xi(cid:105)H \u2212 yi)2. When \u03b1\u2217 is not\nloss such that \u03b1\u2217 = arg min\u03b1\u2208H L(\u03b1) where L(\u03b1)\nuniquely de\ufb01ned, we can choose the smallest norm solution.\nMinimizing the empirical loss is related to solving a linear system of equations. De\ufb01ne the data\ndef\nmatrix X\n= (y1, ..., yn)T , as well as the (non-centralized)\ncovariance matrix/operator, H\n2. Since\ni . Rewrite the loss as L(\u03b1) = 1\n\u2207L(\u03b1) |\u03b1=\u03b1\u2217 = 0, minimizing L(\u03b1) is equivalent to solving the linear system\n(1)\nwith b = X T y. When d = dim(H) < \u221e, the time complexity of solving the linear system in Eq. 1\ndirectly (using Gaussian elimination or other methods typically employed in practice) is O(d3). For\nkernel methods we frequently have d = \u221e. Instead of solving Eq. 1, one solves the dual n\u00d7 n system\nK\u03b1 \u2212 y = 0 where K\ndef\n= [k(zi, zj)]i,j=1,...,n is the kernel matrix . The solution can be written as\n\n(cid:80)n\ni=1 k(zi,\u00b7)\u03b1(zi). A direct solution would require O(n3) operations.\n\ndef\n= (x1, ..., xn)T and the label vector y\n\nn (cid:107)X\u03b1 \u2212 y(cid:107)2\n\ndef\n= 1\nn\n\n(cid:80)n\n\ndef\n= 1\nn\n\ni=1 xixT\n\nH\u03b1 \u2212 b = 0\n\n2\n\n\fdef\n\ndef\n\na2\n\nGradient descent (GD). While linear systems of equations can be solved by direct methods, such as\nGaussian elimination, their computational demands make them impractical for large data. Gradient\ndescent-type methods potentially require a small number of O(n2) matrix-vector multiplications,\na much more manageable task. Moreover, these methods can typically be used in a stochastic\nsetting, reducing computational requirements and allowing for ef\ufb01cient GPU implementations.\nThese schemes are adopted in popular kernel methods implementations such as NORMA [KSW04],\nSDCA [HCL+08], Pegasos [SSSSC11], and DSGD [DXH+14]. For linear systems of equations\ngradient descent takes a simple form known as the Richardson iteration [Ric11]. It is given by\n\n\u03b1(t+1) = \u03b1(t) \u2212 \u03b7(H\u03b1(t) \u2212 b)\n\n(2)\nIt is easy to see that for convergence of \u03b1t to \u03b1\u2217 as t \u2192 \u221e we need to ensure that (cid:107)I \u2212 \u03b7H(cid:107) < 1,\nand hence 0 < \u03b7 < 2/\u03bb1(H). The explicit formula is\n\n\u03b1(t+1) \u2212 \u03b1\u2217 = (I \u2212 \u03b7H)t(\u03b1(1) \u2212 \u03b1\u2217)\n\n(3)\nWe can now describe the computational reach of gradient descent CRt, i.e. the set of vectors which\ncan be \u0001-approximated by gradient descent after t steps, CRt(\u0001)\n= {v \u2208 H, s.t.(cid:107)(I \u2212 \u03b7H)tv(cid:107) <\n\u0001(cid:107)v(cid:107)}. It is important to note that any \u03b1\u2217 /\u2208 CRt(\u0001) cannot be \u0001-approximated by gradient descent in\nless than t + 1 iterations. Note that we typically care about the quality of the solution (cid:107)H\u03b1(t) \u2212 b(cid:107),\nrather than the error estimating the parameter vector (cid:107)\u03b1(t) \u2212 \u03b1\u2217(cid:107) which is re\ufb02ected in the de\ufb01nition.\nWe will assume that the initialization \u03b1(1) = 0. Choosing a different starting point does not change\nthe analysis unless second order information is incorporated in the initialization conditions.\nTo get a better idea of the space CRt(\u0001) consider the eigendecomposition of H. Let \u03bb1 \u2265\n\u03bb2 \u2265 . . . be its eigenvalues and e1, e2, . . . the corresponding eigenvectors/eigenfunctions.\ni . Writing Eq. 3 in terms of eigendirection yields \u03b1(t+1) \u2212\n= (cid:104)ei, v(cid:105) gives CRt(\u0001) =\ni < \u00012 (cid:107)v(cid:107)2}. Recalling that \u03b7 < 2/\u03bb1 and using the fact that (1 \u2212 1/z)z \u2248\n1/e, we see that a necessary condition for v \u2208 CRt is 1\ni (1 \u2212 \u03b7\u03bbi)2ta2\ni <\n\u00012 (cid:107)v(cid:107)2. This is a convenient characterization, we will denote CR(cid:48)\na2\ni <\n\u00012 (cid:107)v(cid:107)2} \u2283 CRt(\u0001). Another convenient but less precise necessary condition for v \u2208 CRt is that\n\nWe have H = (cid:80) \u03bbieieT\n\u03b1\u2217 = (cid:80) (1 \u2212 \u03b7\u03bbi)t(cid:104)ei, \u03b1(1) \u2212 \u03b1\u2217(cid:105)ei.\n{v, s.t.(cid:80) (1 \u2212 \u03b7\u03bbi)2ta2\n(cid:12)(cid:12)(cid:12)(1 \u2212 2\u03bbi/\u03bb1)t (cid:104)ei, v(cid:105)(cid:12)(cid:12)(cid:12) < \u0001(cid:107)v(cid:107). Noting that log(1 \u2212 x) < \u2212x and assuming \u03bb1 > 2\u03bbi, we have\n\ni <(cid:80)\n= {v, s.t.(cid:80)\n\nHence putting ai\n\ni,s.t.\u03bbi< \u03bb1\n2t\ndef\n\ni,s.t.\u03bbi< \u03bb1\n2t\n\nt(\u0001)\n\n(cid:80)\n\n3\n\nt > \u03bb1(2\u03bbi)\u22121 log\n\n(4)\nThe condition number. We are primarily interested in the case when d is in\ufb01nite or very large\nand the corresponding operators/matrices are extremely ill-conditioned with in\ufb01nite or approaching\nin\ufb01nity condition number. In that case instead of a single condition number, one should consider the\nproperties of eigenvalue decay.\nGradient descent, smoothness and kernel methods. We now proceed to analyze the computational\nreach for kernel methods. We will start by discussing the case of in\ufb01nite data (the population case).\nIt is both easier to analyze and allows us to demonstrate the purely computational (non-statistical)\nnature of limitations of gradient descent. We will see that when the kernel is smooth, the reach of\ngradient descent is limited to very smooth, at least in\ufb01nitely differentiable functions. Moreover, to\napproximate a function with less smoothness within some accuracy \u0001 in the L2 norm one needs a\nsuper-polynomial (or even exponential) in 1/\u0001 number of iterations of gradient descent. Let the data\nbe sampled from a probability with a smooth density \u00b5 on a compact domain \u2126 \u2282 Rp. In the case of\nin\ufb01nite data H becomes an integral operator corresponding to a positive de\ufb01nite kernel k(\u00b7,\u00b7) such\nthat Kf (x)\n\u2126 k(x, z)f (z)d\u00b5z. This is a compact self-adjoint operator with an in\ufb01nite positive\nspectrum \u03bb1, \u03bb2, . . ., limi\u2192\u221e \u03bbi = 0. We have (see the full paper for discussion and references):\nTheorem 1. If k is an in\ufb01nitely differentiable kernel, the rate of eigenvalue decay is super-polynomial,\n\u2200P \u2208 N. Moreover, if k is the Gaussian kernel, there exist constants C, C(cid:48) > 0\ni.e. \u03bbi = O(i\u2212P )\n\nsuch that for large enough i, \u03bbi < C(cid:48) exp(cid:0)\u2212Ci1/p(cid:1).\nwhich form an orthonormal basis for L2(\u2126). We can write a function f \u2208 L2(\u2126) as f =(cid:80)\u221e\n\nThe computational reach of kernel methods. Consider the eigenfunctions of K, Kei = \u03bbiei,\ni=1 aiei.\nWe have (cid:107)f(cid:107)2\ni . We can now describe the reach of kernel methods with smooth kernel\n(in the in\ufb01nite data setting). Speci\ufb01cally, functions which can be approximated in a polynomial\nnumber of iterations must have super-polynomial coef\ufb01cient decay.\n\nL2 =(cid:80)\u221e\n\n=(cid:82)\n\ni=1 a2\n\ndef\n\n(cid:16)|(cid:104)ei, v(cid:105)|\u0001\u22121 (cid:107)v(cid:107)\u22121(cid:17)\n\n3\n\n\f(cid:17)\n\n\u221a\n\n1\n\nj=1,3,...\n\ni>\n\n2 ln 2t\n\ns\n\nexp\n\n2\u03c0s\n\n(cid:80)\n\n(cid:16)\u2212 (x\u2212z)2\n\n4s\n\nFourier series f = (cid:80)\u221e\n(cid:80)\n\n\u221a\n\nTheorem 2. Suppose f \u2208 L2(\u2126) is such that it can be approximated within \u0001 using a polynomial\nin 1/\u0001 number of gradient descent iterations, i.e. \u2200\u0001>0f \u2208 CR\u0001\u2212M (\u0001) for some M \u2208 N. Then any\nN \u2208 N and i large enough |ai| < i\u2212N .\nCorollary 1. Any f \u2208 L2(\u2126) which can be \u0001-approximated with polynomial in 1/\u0001 number of steps\nof gradient descent is in\ufb01nitely differentiable. In particular, f function must belong to the intersection\nof all Sobolev spaces on \u2126.\nGradient descent for periodic functions on R. Let us now consider a simple but important\nspecial case, where the reach can be analyzed very explicitly. Let \u2126 be a circle with the uni-\nform measure, or, equivalently, consider periodic functions on the interval [0, 2\u03c0]. Let ks(x, z)\nbe the heat kernel on the circle [Ros97]. This kernel is very close to the Gaussian kernel\nks(x, z) \u2248 1\u221a\n. The eigenfunctions ej of the integral operator K correspond-\ning to ks(x, z) are simply the Fourier harmonics sin jx and cos jx. The corresponding eigenvalues\nare {1, e\u2212s, e\u2212s, e\u22124s, e\u22124s, . . . , e\u2212(cid:98)j/2+1(cid:99)2s, . . .}. Given a function f on [0, 2\u03c0], we can write its\nj=0 ajej. A direct computation shows that for any f \u2208 CRt(\u0001), we have\n2 ln 2ts grows\nextremely slowly as the number of iterations t increases. As a simple example consider the Heaviside\nstep function f (x) (on a circle), taking 1 and \u22121 values for x \u2208 (0, \u03c0] and x \u2208 (\u03c0, 2\u03c0], respectively.\nj sin(jx). From the analysis above, we\nThe step function can be written as f (x) = 4\n\u03c0\nneed O(exp( s\n\u00012 )) iterations of gradient descent to obtain an \u0001-approximation to the function. It\nis important to note that the Heaviside step function is a rather natural example, especially in the\nclassi\ufb01cation setting, where it represents the simplest two-class classi\ufb01cation problem. The situation\nis not much better for functions with more smoothness unless they happen to be extremely smooth\nwith super-exponential Fourier component decay. In contrast, a direct computation of inner products\n(cid:104)f, ei(cid:105) yields exact function recovery for any function in L2([0, 2\u03c0]) using the amount of computation\nequivalent to just one step of gradient descent. Thus, we see that the gradient descent is an extremely\ninef\ufb01cient way to recover Fourier series for a general periodic function. The situation is only mildly\n\nimproved in dimension d, where the span of at most O\u2217(cid:0)(log t)d/2(cid:1) eigenfunctions of a Gaussian\nkernel or O(cid:0)t1/p(cid:1) eigenfunctions of an arbitrary p-differentiable kernel can be approximated in t\n\ni < 3\u00012 (cid:107)v(cid:107)2. We see that the space f \u2208 CRt(\u0001) is \u201cfrozen\" as\na2\n\niterations. The discussion above shows that the gradient descent with a smooth kernel can be viewed\nas a heavy regularization of the target function. It is essentially a band-limited approximation no more\nthan O(ln t) Fourier harmonics. While regularization is often desirable from a generalization/\ufb01nite\nsample point of view , especially when the number of data points is small, the bias resulting from the\napplication of the gradient descent algorithm cannot be overcome in a realistic number of iterations\nunless target functions are extremely smooth or the kernel itself is not in\ufb01nitely differentiable.\nRemark: Rate of convergence vs statistical \ufb01t. Note that we can improve convergence by changing\nthe shape parameter of the kernel, i.e. making it more \u201cpeaked\u201d (e.g., decreasing the bandwidth s\nin the de\ufb01nition of the Gaussian kernel) While that does not change the exponential nature of the\nasymptotics of the eigenvalues, it slows their decay. Unfortunately improved convergence comes at\nthe price of over\ufb01tting. In particular, for \ufb01nite data, using a very narrow Gaussian kernel results in an\napproximation to the 1-NN classi\ufb01er, a suboptimal method which is up to a factor of two inferior to\nthe Bayes optimal classi\ufb01er in the binary classi\ufb01cation case asymptotically.\nFinite sample effects, regularization and early stopping. It is well known (e.g., [B+05, RBV10])\nthat the top eigenvalues of kernel matrices approximate the eigenvalues of the underlying integral\noperators. Therefore computational obstructions encountered in the in\ufb01nite case persist whenever the\ndata set is large enough. Note that for a kernel method, t iterations of gradient descent for n data\npoints require t \u00b7 n2 operations. Thus, gradient descent is computationally pointless unless t (cid:28) n.\nThat would allow us to \ufb01t only about O(log t) eigenvectors. In practice we need t to be much smaller\nthan n, say, t < 1000. At this point we should contrast our conclusions with the important analysis of\nearly stopping for gradient descent provided in [YRC07] (see also [RWY14, CARR16]). The authors\nanalyze gradient descent for kernel methods obtaining the optimal number of iterations of the form\nt = n\u03b8, \u03b8 \u2208 (0, 1). That seems to contradict our conclusion that a very large, potentially exponential,\nnumber of iterations may be needed to guarantee convergence. The apparent contradiction stems from\nthe assumption in [YRC07] that the regression function f\u2217 belongs to the range of some power of\nthe kernel operator K. For an in\ufb01nitely differentiable kernel, that implies super-polynomial spectral\ni ) for any N > 0). In particular, it implies that f\u2217 belongs to any Sobolev space.\ndecay (ai = O(\u03bbN\nWe do not typically expect such high degree of smoothness in practice, particularly in classi\ufb01cation\nproblems, where the Heaviside step function seems to be a reasonable model. In particular, we expect\n\n4\n\n\f1\n\nL2 loss\n\nDataset\n\nMetric\n\nsharp transitions of label probabilities across class boundaries to be typical for many classi\ufb01cations\ndatasets. These areas of near-discontinuity will necessarily result in slow decay of Fourier coef\ufb01cients\nand require many iterations of gradient descent to approximate1.\nTo illustrate this point, we show (right ta-\nble) the results of gradient descent for two\ndatasets of 10000 points (see Section 6).\nThe regression error on the training set is\nroughly inverse to the number of iterations,\ni.e. every extra bit of precision requires\ntwice the number of iterations for the previous bit. For comparison, we see that the minimum\nregression (L2) error on both test sets is achieved at over 10000 iterations. This results is at least\ncubic computational complexity equivalent to that of a direct method.\nRegularization. Note that typical regularization, e.g., adding \u03bb(cid:107)f(cid:107), results in discarding information\nalong the directions with small eigenvalues (below \u03bb). While this improves the condition number it\ncomes at a high cost in terms of over-regularization. In the Fourier analysis example this is similar to\n\nconsidering band-limited functions with \u223c(cid:112)log(1/\u03bb)/s Fourier components. Even for \u03bb = 10\u221216\n\n81920\n1280\n2.17e-5\n2.60e-2\n3.55e-2\n4.59e-2\n3.26% 2.39% 2.49%\n4.21e-3\n3.08e-2\n3.34e-2\n3.42e-2\n\n9.61e-2\n4.07e-1\n4.07e-1\n9.74e-2\n38.50% 7.60%\n4.58e-2\n8.25e-2\n7.98e-2\n4.24e-2\n\ntrain\nL2 loss\ntest\nc-error (test)\ntrain\ntest\n\nNumber of iterations\n80\n\n10240\n2.36e-3\n3.64e-2\n\n1.83e-2\n3.14e-2\n\nHINT-M-10k\n\nMNIST-10k\n\n(limit of double precision) and s = 1 we can only \ufb01t about 10 Fourier components. We argue that\nthere is little need for explicit regularization for most iterative methods in the big data regimes.\n3 Extending the reach of gradient descent: EigenPro iteration\nWe will now propose practical measures to alleviate the over-regularization of linear regression by\ngradient descent. As seen above, one of the key shortcomings of shallow learning methods based on\nsmooth kernels (and their approximations, e.g., Fourier and RBF features) is their fast spectral decay.\nThat suggests modifying the corresponding matrix H by decreasing its top eigenvalues, enabling\nthe algorithm to approximate more target functions in the same number of iterations. Moreover,\nthis can be done in a way compatible with stochastic gradient descent thus obviating the need to\nmaterialize full covariance/kernel matrices in memory. Accurate approximation of top eigenvectors\ncan be obtained from a subsample of the data with modest computational expenditure. Combining\nthese observations we propose EigenPro, a low overhead preconditioned Richardson iteration.\nPreconditioned (stochastic) gradient descent. We will modify the linear system in Eq. 1 with an\ninvertible matrix P , called a left preconditioner. P H\u03b1 \u2212 P b = 0. Clearly, this modi\ufb01ed system and\nthe original system in Eq. 1 have the same solution. The Richardson iteration corresponding to the\nmodi\ufb01ed system (preconditioned Richardson iteration) is\n\n(5)\nIt is easy to see that as long as \u03b7(cid:107)P H(cid:107) < 1 it converges to \u03b1\u2217, the solution of the original linear\nsystem. Preconditioned SGD can be de\ufb01ned similarly by\n\n\u03b1(t+1) = \u03b1(t) \u2212 \u03b7P (H\u03b1(t) \u2212 b)\n\n\u03b1 \u2190 \u03b1 \u2212 \u03b7P (Hm\u03b1 \u2212 bm)\n\n(6)\n\ndef\n= 1\n\ndef\n= 1\n\nm X T\n\nm X T\n\ndef\n= P 1\n\nmXm and bm\n\nmym using sampled mini-batch (Xm, ym).\n\nwhere we de\ufb01ne Hm\nPreconditioning as a linear feature map. It is easy to see that the preconditioned iteration is in\nfact equivalent to the standard Richardson iteration in Eq. 2 on a dataset transformed with the linear\nfeature map, \u03c6P (x)\n2 x. This is a convenient point of view as the transformed data can be stored\nfor future use. It also shows that preconditioning is compatible with most computational methods\nboth in practice and, potentially, in terms of analysis.\nLinear EigenPro. We will now discuss properties desired to make preconditioned GD/SGD meth-\nods effective on large scale problems. Thus for the modi\ufb01ed iteration in Eq. 5 we would like\nto choose P to meet the following targets: (Acceleration) The algorithm should provide high ac-\ncuracy in a small number of iterations.\n(Initial cost) The preconditioning matrix P should be\naccurately computable, without materializing the full covariance matrix. (Cost per iteration) Pre-\nconditioning by P should be ef\ufb01cient per iteration in terms of computation and memory. The\nconvergence of the preconditioned algorithm with the along the i-th eigendirection is dependent\non the ratio of eigenvalues \u03bbi(P H)/\u03bb1(P H). This leads us to choose the preconditioner P to\nmaximize the ratio \u03bbi(P H)/\u03bb1(P H) for each i. We see that modifying the top eigenvalues of\nH makes the most difference in convergence. For example, decreasing \u03bb1 improves convergence\nalong all directions, while decreasing any other eigenvalue only speeds up convergence in that\n\n1Interestingly they can lead to lower sample complexity for optimal classi\ufb01ers (cf. Tsybakov margin\n\ncondition [Tsy04]).\n\n5\n\n\f= I \u2212 k(cid:88)\n\ndef\n\nP\n\ni=1\n\ndirection. However, decreasing \u03bb1 below \u03bb2 does not help unless \u03bb2 is decreased as well. Therefore\nit is natural to decrease the top k eigenvalues to the maximum amount, i.e. to \u03bbk+1, leading to\n\n(1 \u2212 \u03bbk+1/\u03bbi)eieT\n\ni\n\n(7)\n\nAlgorithm: EigenPro(X, y, k, m, \u03b7, \u03c4, M )\ninput training data (X, y), number of eigen-\ndirections k, mini-batch size m, step size \u03b7,\ndamping factor \u03c4, subsample size M\n\ndef\n\ndef\n\n(Xm, ym) \u2190 m rows sampled from (X, y)\nwithout replacement\ng \u2190 1\n\u03b1 \u2190 \u03b1 \u2212 \u03b7P g\n\nm(Xm\u03b1) \u2212 X T\n\nm (X T\n\nmym)\n\n6:\n7:\n8: end while\n\n(cid:80)k\ni=1 (1 \u2212 \u03c4 \u02c6\u03bbk+1/\u02c6\u03bbi)\u02c6ei\u02c6eT\n\noutput weight of the linear model \u03b1\n1: [E, \u039b, \u02c6\u03bbk+1] = RSVD(X, k + 1, M )\n= I \u2212 E(I \u2212 \u03c4 \u02c6\u03bbk+1\u039b\u22121)ET\n2: P\n3: Initialize \u03b1 \u2190 0\n4: while stopping criteria is False do\n5:\n\nWe see that P -preconditioned iteration increases\nconvergence by a factor \u03bb1/\u03bbk. However, exact\nconstruction of P involves computing the eigen-\ndecomposition of the d \u00d7 d matrix H, which\nis not feasible for large data. Instead we use\nsubsampled randomized SVD [HMT11] to ob-\n= I \u2212\ntain an approximate preconditioner \u02c6P\u03c4\ni . Here algorithm\nRSVD (detailed in the full paper ) computes the\napproximate top eigenvectors E \u2190 (\u02c6e1, . . . , \u02c6ek)\nand eigenvalues \u039b \u2190 diag(\u02c6\u03bb1, . . . , \u02c6\u03bbk) and\n\u02c6\u03bbk+1 for subsample covariance matrix HM . We introduce the parameter \u03c4 to counter the effect of\napproximate top eigenvectors \u201cspilling\u201d into the span of the remaining eigensystem. Using \u03c4 < 1 is\npreferable to the obvious alternative of decreasing the step size \u03b7 as it does not decrease the step size\nin the directions nearly orthogonal to the span of (\u02c6e1, . . . , \u02c6ek). That allows the iteration to converge\nfaster in those directions. In particular, when (\u02c6e1, . . . , \u02c6ek) are computed exactly, the step size in\nother eigendirections will not be affected by the choice of \u03c4. We call SGD with the preconditioner\n\u02c6P\u03c4 (Eq. 6) EigenPro iteration. See Algorithm EigenPro for details. Moreover, the key step size\nparameter \u03b7 can be selected in a theoretically sound way discussed below.\nKernel EigenPro. We will now discuss modi\ufb01cations needed to work directly in the RKHS (primal)\n(cid:80)n\nsetting. A positive de\ufb01nite kernel k(\u00b7,\u00b7) : RN \u00d7 RN \u2192 R implies a feature map from X to an\nRKHS space H. The feature map can be written as \u03c6 : x (cid:55)\u2192 k(x,\u00b7), RN \u2192 H. This feature map\n(cid:80)n\ni=1 ((cid:104)f, k(xi,\u00b7)(cid:105)H \u2212 yi)2. Using properties of\nleads to the learning problem f\u2217 = arg minf\u2208H 1\n(cid:80)n\nRKHS, EigenPro iteration in H becomes f \u2190 f \u2212 \u03b7 P(K(f ) \u2212 b) where b\ni=1 yik(xi,\u00b7)\nand covariance operator K = 1\ni=1 k(xi,\u00b7) \u2297 k(xi,\u00b7). The top eigensystem of K forms the\nrem [Aro50], f\u2217 admits a representation of the form(cid:80)n\ni=1 (1 \u2212 \u03c4 \u03bbk+1(K)/\u03bbi(K)) ei(K) \u2297 ei(K). By the Representer theo-\npreconditioner P\ni=1 \u03b1i k(xi,\u00b7). Parameterizing the above\niteration accordingly and applying some linear algebra lead to the following iteration in a \ufb01nite-\ndimensional vector space, \u03b1 \u2190 \u03b1\u2212\u03b7P (K\u03b1\u2212y) where K\n= I \u2212(cid:80)k\ndef\n= [k(xi, xj)]i,j=1,...,n is the kernel matrix\nand EigenPro preconditioner P is de\ufb01ned using the top eigensystem of K (assume Kei = \u03bbiei),\ni . This differs from that for the linear case (Eq. 7) (with an\nP\nextra factor of 1/\u03bbi) due to the difference between the parameter space of \u03b1 and the RKHS space.\nEigenPro as kernel learning. Another way to view EigenPro is in terms of kernel learning. Assum-\ning that the preconditioner is computed exactly, EigenPro is equivalent to computing the (distribution-\ndependent) kernel, kEP (x, z)\ni=k+1 \u03bbiei(x)ei(z). Notice that the\nRKHS spaces corresponding to kEP and k contain the same functions but have different norms. The\nnorm in kEP is a \ufb01nite rank modi\ufb01cation of the norm in the RKHS corresponding to k, a setting\nreminiscent of [SNB05] where unlabeled data was used to \u201cwarp\u201d the norm for semi-supervised\nlearning. However, in our paper the \u201cwarping\" is purely for computational ef\ufb01ciency.\nAcceleration. EigenPro can obtain acceleration factor of up to \u03bb1\nover the standard gradient\n\u03bbk+1\ndescent. That factor assumes full gradient descent and exact computation of the preconditioner. See\nbelow for an acceleration analysis in the SGD setting.\nInitial cost. To construct the preconditioner P , we perform RSVD to compute the approximate top\neigensystem of covariance H. RSVD has time complexity O(M d log k +(M +d)k2) (see [HMT11]).\nThe subsample size M can be much smaller than the data size n while preserving the accuracy of\nestimation. In addition, extra kd memory is needed to store the eigenvectors.\nCost per iteration. For standard SGD using d kernel centers (or random Fourier features) and\nmini-batch of size m, the computational cost per iteration is O(md). In comparison, EigenPro\niteration using top-k eigen-directions costs O(md + kd). Speci\ufb01cally, applying preconditioner P in\nEigenPro requires left multiplication by a matrix of rank k. This involves k vector-vector dot products\nresulting in k \u00b7 d additional operations per iteration. These can be implemented ef\ufb01ciently on a GPU.\n\ni=1 \u03bbk+1ei(x)ei(z) +(cid:80)\u221e\n\n\u22121(1 \u2212 \u03c4 \u03bbk+1/\u03bbi)eieT\n\n= I\u2212(cid:80)k\n\ndef\n\n= (cid:80)k\n\ndef\n\ni=1 \u03bbi\n\ndef\n= 1\nn\n\ndef\n\nn\n\nn\n\n6\n\n\f4 Step Size Selection for EigenPro Preconditioned Methods\nWe will now discuss the key issue of the step size selection for EigenPro iteration. For iteration\n\u22121 = (cid:107)H(cid:107)\u22121 results in optimal (within a factor of 2) con-\ninvolving covariance matrix H, \u03bb1(H)\nvergence. This suggests choosing the corresponding step size \u03b7 = (cid:107)P H(cid:107)\u22121 = \u03bb\u22121\nk+1. In practice\nthis will lead to divergence due to (1) approximate computation of eigenvectors (2) the randomness\ninherent in SGD. One (costly) possibility is to compute (cid:107)P Hm(cid:107) at every step. As the mini-batch\ncan be assumed to be chosen at random, we propose using a lower bound on (cid:107)Hm(cid:107)\u22121 (with high\nprobability) as the step size to guarantee convergence at each iteration.\nLinear EigenPro. Consider the EigenPro preconditioned SGD in Eq. 6. For this analysis assume\nthat P is formed by the exact eigenvectors.Interpreting P 1\n2 as a linear feature map as in Section 2,\nmakes P 1\n2 a random subsample on the dataset XP 1\n2 . Using matrix Bernstein [Tro15] yields\n2 \u2264 \u03ba for any x \u2208 X and \u03bbk+1 = \u03bbk+1(H), with probability at least 1 \u2212 \u03b4,\nTheorem 3. If (cid:107)x(cid:107)2\nKernel EigenPro. For EigenPro iteration in RKHS, we can bound (cid:107)P\u25e6Km(cid:107) with a very similar\nresult based on operator Bernstein [Min17]. Note that dimension d in Theorem 3 is replaced by the\nintrinsic dimension [Tro15]. See the arXiv version of this paper for details.\nChoice of the step size. In the spectral norm bounds \u03bbk+1 is the dominant term when the mini-batch\n\n(cid:107)P Hm(cid:107) \u2264 \u03bbk+1 + 2(\u03bbk+1 + \u03ba)(3m)\u22121(ln 2d\u03b4\u22121) +(cid:112)2\u03bbk+1\u03bam\u22121(ln 2d\u03b4\u22121).\nsize m is large. However, in most large-scale settings, m is small, and(cid:112)2\u03bbk+1\u03ba/m becomes the\ndominant term. This suggests choosing step size \u03b7 \u223c 1/(cid:112)\u03bbk+1 leading to acceleration on the order\nof \u03bb1/(cid:112)\u03bbk+1 over the standard (unpreconditioned) SGD. That choice works well in practice.\n\n2 HmP 1\n\n5 EigenPro and Related Work\nLarge scale machine learning imposes fairly speci\ufb01c limitations on optimization methods. The\ncomputational budget allocated to the problem must not exceed O(n2) operations, a small number of\nmatrix-vector multiplications. That rules out most direct second order methods which require O(n3)\noperations. Approximate second order methods are far more ef\ufb01cient. However, they typically rely\non low rank matrix approximation, a strategy which (similarly to regularization) in conjunction with\nsmooth kernels discards information along important eigen-directions with small eigenvalues. On the\nother hand, \ufb01rst order methods can be slow to converge along eigenvectors with small eigenvalues.\nAn effective method must thus be a hybrid approach using approximate second order information in a\n\ufb01rst order method. EigenPro is an example of such an approach as the second order information is\nused in conjunction with a \ufb01rst order method. The things that make EigenPro effective are as follows:\n1. The second order information (eigenvalues and eigenvectors) is computed ef\ufb01ciently from a\nsubsample of the data. Due to the quadratic loss function, that computation needs to be conducted\nonly once. Moreover, the step size can be \ufb01xed throughout the iterations.\n2. Preconditioning by a low rank modi\ufb01cation of the identity matrix results in low overhead per\niteration. The update is computed without materializing the full preconditioned covariance matrix.\n3. EigenPro iteration converges (mathematically) to the same result even if the second order\napproximation is not accurate. That makes EigenPro relatively robust to errors in the second order\npreconditioning term P , in contrast to most approximate second order methods.\nRelated work: First order optimization methods. Gradient based methods, such as gradient de-\nscent (GD), stochastic gradient descent (SGD), are classical methods [She94, DJS96, BV04, Bis06].\nRecent success of neural networks had drawn signi\ufb01cant attention to improving and accelerating\nthese methods. Methods like SAGA [RSB12] and SVRG [JZ13] improve stochastic gradient by peri-\nodically evaluating full gradient to achieve variance reduction. Algorithms in [DHS11, TH12, KB14]\ncompute adaptive step size for each gradient coordinate.\nScalable kernel methods. There is a signi\ufb01cant literature on scalable kernel methods includ-\ning [KSW04, HCL+08, SSSSC11, TBRS13, DXH+14] Most of these are \ufb01rst order optimization\nmethods. To avoid the O(n2) computation and memory requirement typically involved in construct-\ning the kernel matrix, they often adopt approximations like RBF features [WS01, QB16, TRVR16]\nor random Fourier features [RR07, LSS13, DXH+14, TRVR16].\nSecond order/hybrid optimization methods. Second order methods use the inverse of the Hessian\nmatrix or its approximation to accelerate convergence [SYG07, BBG09, MNJ16, BHNS16, ABH16].\nThese methods often need to compute the full gradient every iteration [LN89, EM15, ABH16]\nmaking less suitable for large data. [EM15] analyzed a hybrid \ufb01rst/second order method for general\nconvex optimization with a rescaling term based on the top eigenvectors of the Hessian. That can\nbe viewed as preconditioning the Hessian at every GD iteration. A related recent work [GOSS16]\n\n7\n\n\fanalyses a hybrid method designed to accelerate SGD convergence for ridge regression. The data are\npreprocessed by rescaling points along the top singular vectors of the data matrix. Another second\norder method PCG [ACW16] accelerates the convergence of conjugate gradient for large kernel ridge\nregression using a preconditioner which is the inverse of an approximate covariance generated with\nrandom Fourier features. [TRVR16] achieves similar preconditioning effects by solving a linear\nsystem involving a subsampled kernel matrix every iteration. While not strictly a preconditioner\nNystr\u00f6m with gradient descent(NYTRO) [CARR16] also improves the condition number. Compared\nto many of these methods EigenPro directly addresses the underlying issues of slow convergence\nwithout introducing a bias in directions with small eigenvalues. Additionally EigenPro incurs only a\nsmall overhead per iteration both in memory and computation.\n6 Experimental Results\nComputing Resource/Data/Metrics. Experiments were run on a workstation with 128GB main\nmemory, two Intel Xeon(R) E5-2620 CPUs, and one GTX Titan X (Maxwell) GPU. For multiclass\ndatasets, we report classi\ufb01cation error (c-error) for binary valued labels and mean squared error\n(mse) for real valued labels. See the arXiv version for details and more experimental results.\nKernel methods/Hyperparameters. For smaller datasets direct solution of kernel regularized least\nsquares (KRLS) is used to obtain the reference error. We compare with the primal method Pe-\ngasos [SSSSC11]. For even larger datasets, we use Random Fourier Features [RR07] (RF) with\nSGD as in [DXH+14, TRVR16]. The results of these methods are presented as baselines. For\nconsistent comparison, all iterative methods use mini-batch of size m = 256. EigenPro pre-\nconditioner is constructed using the top k = 160 eigenvectors of a subsampled dataset of size\nM = 4800. For EigenPro-RF, we set the damping factor \u03c4 = 1/4. For primal EigenPro \u03c4 = 1.\nAcceleration for different kernels. The\ntable on the right presents the number of\nepochs needed by EigenPro and Pegasos to\nreach the error of the optimal kernel clas-\nsi\ufb01er. We see that EigenPro provides ac-\nceleration of 6 to 35 times in terms of the\nnumber of epochs required without any loss of accuracy. The actual acceleration is about 20% less\ndue to the overhead of maintaining and applying a preconditioner.\nComparisons on large datasets. Table below compares EigenPro to Pegasos/SGD-RF on several\nlarge datasets for 10 epochs. We see that EigenPro consistently outperforms Pegasos/SGD-RF within\na \ufb01xed computational budget. Note that we adopt Gaussian kernel and 2 \u00b7 105 random features.\nSGD-RF\n\nDataset\nMNIST\nCIFAR-10\n\nSize\n6 \u00b7 104\n5 \u00b7 104\n7 \u00b7 104\n5 \u00b7 104\n\nPega\n77\n56\n54\n164\n\nPega\n143\n136\n297\n308\n\nPega\n78\n107\n191\n126\n\nSVHN\nHINT-S\n\nEigenPro-RF\n\nEigenPro\n\n4\n13\n14\n15\n\n7\n6\n17\n13\n\n7\n5\n8\n19\n\nGaussian\n\nPegasos\n\nLaplace\n\nEigPro\n\nCauchy\n\nEigPro\n\nEigPro\n\nSize Metric\nDataset\n2 \u00b7 105\nHINT-S\n1 \u00b7 106\nTIMIT\nMNIST-8M 1 \u00b7 106\n8 \u00b7 106\n1 \u00b7 106\nHINT-M\n7 \u00b7 106\n\nc-error\n\nmse\n\nresult\n10.0%\n31.7%\n0.8%\n\n2.3e-2\n\n-\n\n-\n\nGPU hours\n\n0.1\n3.2\n3.0\n\nresult GPU hours\n11.7%\n33.0%\n1.1%\n\n0.1\n2.2\n2.7\n\n1.9\n\n2.7e-2\n\n1.5\n\n-\n\n-\n\nresult GPU hours\n10.3%\n32.6%\n0.8%\n0.7%\n2.4e-2\n2.1e-2\n\n0.2\n1.5\n0.8\n7.2\n0.8\n5.8\n\nresult GPU hours\n11.5%\n33.3%\n1.0%\n0.8%\n2.7e-2\n2.4e-2\n\n0.1\n1.0\n0.7\n6.0\n0.6\n4.1\n\nComparisons to state-of-the-art. In the below table, we provide a comparison to several large scale\nkernel results reported in the literature. EigenPro improves or matches performance on each dataset at\na much lower computational budget. We note that [MGL+17] achieves error 30.9% on TIMIT using\nan AWS cluster. The method uses a novel supervised feature selection method, hence is not directly\ncomparable. EigenPro can plausibly further improve the training error using this new feature set.\n\nDataset\n\nMNIST\n\nSize\n1 \u00b7 106\n6.7 \u00b7 106\n2 \u00b7 106\n4 \u00b7 106\n\nGPU hours\n\nEigenPro (use 1 GTX Titan X)\nerror\nepochs\n0.70%\n0.80%\u2020\n31.7%\n(32.5%)\u2021\n19.8%\n\n16\n10\n10\n0.6\n\n4.8\n0.8\n3.2\n0.1\n\nReported results\n\nsource\n\n[ACW16]\n[LML+14]\n[HAS+14]\n[TRVR16]\n[CAS16]\n\ndescription\n\nerror\n0.72% 1.1 hours/189 epochs/1344 AWS vCPUs\n0.85%\n33.5%\n33.5%\n\u2248 20%\n\nless than 37.5 hours on 1 Tesla K20m\n\n512 IBM BlueGene/Q cores\n\n7.5 hours on 1024 AWS vCPUs\n\nTIMIT\nSUSY\n\u2020 The result is produced by EigenPro-RF using 1 \u00d7 106 data points.\n\u2021 Our TIMIT training set (1 \u00d7 106 data points) was generated\nfollowing a standard practice in the speech community [PGB+11] by taking 10ms frames and dropping the glottal stop \u2019q\u2019 labeled frames in\ncore test set (1.2% of total test set). [HAS+14] adopts 5ms frames, resulting in 2 \u00d7 106 data points, and keeping the glottal stop \u2019q\u2019. In the\nworst case scenario EigenPro, if we mislabel all glottal stops, the corresponding frame-level error increases from 31.7% to 32.5%.\nAcknowledgements. We thank Adam Stiff, Eric Fosler-Lussier, Jitong Chen, and Deliang Wang\nfor providing TIMIT and HINT datasets. This work is supported by NSF IIS-1550757 and NSF\nCCF-1422830. Part of this work was completed while the second author was at the Simons Institute\nat Berkeley. In particular, he thanks Suvrit Sra, Daniel Hsu, Peter Bartlett, and Stefanie Jegelka for\nmany discussions and helpful suggestions.\n\n0.6 hours on IBM POWER8\n\n8\n\n\fReferences\n[ABH16] Naman Agarwal, Brian Bullins, and Elad Hazan. Second order stochastic optimization in linear\n\ntime. arXiv preprint arXiv:1602.03943, 2016.\n\n[ACW16] H. Avron, K. Clarkson, and D. Woodruff. Faster kernel ridge regression using sketching and\n\npreconditioning. arXiv preprint arXiv:1611.03220, 2016.\n\n[Aro50] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical\n\nsociety, 68(3):337\u2013404, 1950.\n\n[B+05] Mikio Ludwig Braun et al. Spectral properties of the kernel matrix and their relation to kernel\n\nmethods in machine learning. PhD thesis, University of Bonn, 2005.\n\n[BBG09] Antoine Bordes, L\u00e9on Bottou, and Patrick Gallinari. SGD-QN: Careful quasi-newton stochastic\n\ngradient descent. JMLR, 10:1737\u20131754, 2009.\n\n[BHNS16] Richard H Byrd, SL Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method\n\nfor large-scale optimization. SIAM Journal on Optimization, 26(2):1008\u20131031, 2016.\n\n[Bis06] Christopher M Bishop. Pattern recognition. Machine Learning, 128, 2006.\n[BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.\n[CARR16] Raffaello Camoriano, Tom\u00e1s Angles, Alessandro Rudi, and Lorenzo Rosasco. NYTRO: When\n\nsubsampling meets early stopping. In AISTATS, pages 1403\u20131411, 2016.\n\n[CAS16] Jie Chen, Haim Avron, and Vikas Sindhwani. Hierarchically compositional kernels for scalable\n\nnonparametric learning. arXiv preprint arXiv:1608.00860, 2016.\n\n[CK11] Chih-Chieh Cheng and Brian Kingsbury. Arccosine kernels: Acoustic modeling with in\ufb01nite neural\n\nnetworks. In ICASSP, pages 5200\u20135203. IEEE, 2011.\n\n[DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. JMLR, 12:2121\u20132159, 2011.\n\n[DJS96] John E Dennis Jr and Robert B Schnabel. Numerical methods for unconstrained optimization and\n\nnonlinear equations. SIAM, 1996.\n\n[DXH+14] B. Dai, B. Xie, N. He, Y. Liang, A. Raj, M. Balcan, and L. Song. Scalable kernel methods via\n\ndoubly stochastic gradients. In NIPS, pages 3041\u20133049, 2014.\n\n[EM15] M. Erdogdu and A. Montanari. Convergence rates of sub-sampled newton methods. In NIPS, 2015.\n[GOSS16] Alon Gonen, Francesco Orabona, and Shai Shalev-Shwartz. Solving ridge regression using sketched\n\npreconditioned svrg. In ICML, pages 1397\u20131405, 2016.\n\n[HAS+14] Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel\n\nmethods match deep neural networks on timit. In ICASSP, pages 205\u2013209. IEEE, 2014.\n\n[HCL+08] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundarara-\njan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th\ninternational conference on Machine learning, pages 408\u2013415. ACM, 2008.\n\n[HMT11] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Prob-\nabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217\u2013\n288, 2011.\n\n[JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS, pages 315\u2013323, 2013.\n\n[KB14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[KSW04] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. Signal\n\nProcessing, IEEE Transactions on, 52(8):2165\u20132176, 2004.\n\n[LML+14] Zhiyun Lu, Avner May, Kuan Liu, Alireza Bagheri Garakani, Dong Guo, Aur\u00e9lien Bellet, Linxi\nFan, Michael Collins, Brian Kingsbury, Michael Picheny, et al. How to scale up kernel methods to\nbe as good as deep neural nets. arXiv preprint arXiv:1411.4000, 2014.\n\n[LN89] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization.\n\nMathematical programming, 45(1-3):503\u2013528, 1989.\n\n[LSS13] Quoc Le, Tam\u00e1s Sarl\u00f3s, and Alex Smola. Fastfood-approximating kernel expansions in loglinear\n\ntime. In Proceedings of the international conference on machine learning, 2013.\n\n9\n\n\f[MGL+17] Avner May, Alireza Bagheri Garakani, Zhiyun Lu, Dong Guo, Kuan Liu, Aur\u00e9lien Bellet, Linxi Fan,\nMichael Collins, Daniel Hsu, Brian Kingsbury, et al. Kernel approximation methods for speech\nrecognition. arXiv preprint arXiv:1701.03577, 2017.\n\n[Min17] Stanislav Minsker. On some extensions of bernstein\u2019s inequality for self-adjoint operators. Statistics\n\n& Probability Letters, 2017.\n\n[MNJ16] P. Moritz, R. Nishihara, and M. Jordan. A linearly-convergent stochastic l-bfgs algorithm. In\n\nAISTATS, 2016.\n\n[PGB+11] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek,\n\nY. Qian, P. Schwarz, et al. The kaldi speech recognition toolkit. In ASRU, 2011.\n\n[QB16] Qichao Que and Mikhail Belkin. Back to the future: Radial basis function networks revisited. In\n\nAISTATS, pages 1375\u20131383, 2016.\n\n[RBV10] Lorenzo Rosasco, Mikhail Belkin, and Ernesto De Vito. On learning with integral operators.\n\nJournal of Machine Learning Research, 11(Feb):905\u2013934, 2010.\n\n[Ric11] Lewis Fry Richardson. The approximate arithmetical solution by \ufb01nite differences of physical\nproblems involving differential equations, with an application to the stresses in a masonry dam.\nPhilosophical Transactions of the Royal Society of London. Series A, 210:307\u2013357, 1911.\n\n[Ros97] Steven Rosenberg. The Laplacian on a Riemannian manifold: an introduction to analysis on\n\nmanifolds. Number 31. Cambridge University Press, 1997.\n\n[RR07] A. Rahimi and B. Recht. Random features for large-scale kernel machines.\n\n1177\u20131184, 2007.\n\nIn NIPS, pages\n\n[RSB12] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an\nexponential convergence _rate for \ufb01nite training sets. In Advances in Neural Information Processing\nSystems, pages 2663\u20132671, 2012.\n\n[RWY14] G. Raskutti, M. Wainwright, and B. Yu. Early stopping and non-parametric regression: an optimal\n\ndata-dependent stopping rule. JMLR, 15(1):335\u2013366, 2014.\n\n[SC08] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business\n\nMedia, 2008.\n\n[She94] Jonathan Richard Shewchuk. An introduction to the conjugate gradient method without the\n\nagonizing pain, 1994.\n\n[SNB05] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. Beyond the point cloud: from transductive\nto semi-supervised learning. In Proceedings of the 22nd international conference on Machine\nlearning, pages 824\u2013831. ACM, 2005.\n\n[SSSSC11] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated\n\nsub-gradient solver for SVM. Mathematical programming, 127(1):3\u201330, 2011.\n\n[STC04] John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge\n\nuniversity press, 2004.\n\n[SYG07] Nicol N Schraudolph, Jin Yu, and Simon G\u00fcnter. A stochastic quasi-newton method for online\n\nconvex optimization. In AISTATS, pages 436\u2013443, 2007.\n\n[TBRS13] Martin Tak\u00e1c, Avleen Singh Bijral, Peter Richt\u00e1rik, and Nati Srebro. Mini-batch primal and dual\n\nmethods for SVMs. In ICML (3), pages 1022\u20131030, 2013.\n\n[TH12] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4:2, 2012.\narXiv preprint\n\nAn introduction to matrix concentration inequalities.\n\n[Tro15] Joel A Tropp.\n\narXiv:1501.01571, 2015.\n\n[TRVR16] S. Tu, R. Roelofs, S. Venkataraman, and B. Recht. Large scale kernel learning using block\n\ncoordinate descent. arXiv preprint arXiv:1602.05310, 2016.\n\n[Tsy04] Alexandre B Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics,\n\npages 135\u2013166, 2004.\n\n[WS01] Christopher Williams and Matthias Seeger. Using the Nystr\u00f6m method to speed up kernel machines.\n\nIn NIPS, pages 682\u2013688, 2001.\n\n[YRC07] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent\n\nlearning. Constructive Approximation, 26(2):289\u2013315, 2007.\n\n10\n\n\f", "award": [], "sourceid": 2089, "authors": [{"given_name": "SIYUAN", "family_name": "MA", "institution": "The Ohio State University"}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}]}