{"title": "Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 7636, "page_last": 7646, "abstract": "In this paper, we study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular first-order methods but with an improved behavior, in particular in ill-conditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nystr\\\"om projections techniques, and prove that it achieves optimal generalization bounds with a time complexity of order O(n df), a memory complexity of order O(df^2) and no dependence on the condition number, generalizing the results known for least squares regression. Here n is the number of observations and df is the associated degrees of freedom. In particular, this is the first large-scale algorithm to solve logistic and softmax regressions in the non-parametric setting with large condition numbers and theoretical guarantees.", "full_text": "Globally Convergent Newton Methods for\n\nIll-conditioned Generalized Self-concordant Losses\n\nUlysse Marteau-Ferey\n\nFrancis Bach\n\nINRIA - \u00c9cole Normale Sup\u00e9rieure\n\nPSL Reasearch University\n\nulysse.marteau-ferey@inria.fr\n\nINRIA - \u00c9cole Normale Sup\u00e9rieure\n\nPSL Reasearch University\n\nfrancis.bach@inria.fr\n\nAlessandro Rudi\n\nINRIA - \u00c9cole Normale Sup\u00e9rieure\n\nPSL Reasearch University\n\nalessandro.rudi@inria.fr\n\nAbstract\n\nIn this paper, we study large-scale convex optimization algorithms based on the\nNewton method applied to regularized generalized self-concordant losses, which\ninclude logistic regression and softmax regression. We \ufb01rst prove that our new\nsimple scheme based on a sequence of problems with decreasing regularization\nparameters is provably globally convergent, that this convergence is linear with a\nconstant factor which scales only logarithmically with the condition number. In\nthe parametric setting, we obtain an algorithm with the same scaling than regular\n\ufb01rst-order methods but with an improved behavior, in particular in ill-conditioned\nproblems. Second, in the non-parametric machine learning setting, we provide\nan explicit algorithm combining the previous scheme with Nystr\u00f6m projection\ntechniques, and prove that it achieves optimal generalization bounds with a time\ncomplexity of order O(ndf\u03bb), a memory complexity of order O(df2\n\u03bb) and no\ndependence on the condition number, generalizing the results known for least-\nsquares regression. Here n is the number of observations and df\u03bb is the associated\ndegrees of freedom. In particular, this is the \ufb01rst large-scale algorithm to solve\nlogistic and softmax regressions in the non-parametric setting with large condition\nnumbers and theoretical guarantees.\n\n1\n\nIntroduction\n\nMinimization algorithms constitute a crucial algorithmic part of many machine learning methods,\nwith algorithms available for a variety of situations [10]. In this paper, we focus on \ufb01nite sum\nproblems of the form\n\nmin\nx\u2208H f\u03bb(x) = f (x) +\n\n(cid:107)x(cid:107)2, with f (x) =\n\n\u03bb\n2\n\n1\nn\n\nfi(x),\n\nn(cid:88)\n\ni=1\n\nwhere H is a Euclidean or a Hilbert space, and each function is convex and smooth. The running-\ntime of minimization algorithms classically depends on the number of functions n, the explicit (for\nEuclidean spaces) or implicit (for Hilbert spaces) dimension d of the search space, and the condition\nnumber of the problem, which is upper bounded by \u03ba = L/\u03bb, where L characterizes the smoothness\nof the functions fi, and \u03bb the regularization parameter.\nIn the last few years, there has been a strong focus on problems with large n and d, leading to \ufb01rst-\norder (i.e., gradient-based) stochastic algorithms, culminating in a sequence of linearly convergent\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u221a\n\n\u221a\n\n\u03ba is not practical anymore (see examples in Sect. 5).\n\n\u03ba [15, 22, 14, 4].\nalgorithms whose running time is favorable in n and d, but scale at best in\nHowever, modern problems lead to objective functions with very large condition numbers, i.e., in\nmany learning problems, the regularization parameter that is optimal for test predictive performance\nmay be so small that the scaling above in\nThese ill-conditioned problems are good candidates for second-order methods (i.e., that use the\nHessians of the objective functions) such as Newton method. These methods are traditionally\ndiscarded within machine learning for several reasons: (1) they are usually adapted to high precision\nresults which are not necessary for generalization to unseen data for machine learning problems [9],\n(2) computing the Newton step \u2206\u03bb(x) = \u22072f\u03bb(x)\u22121\u2207f\u03bb(x) requires to form the Hessian and solve\nthe associated linear system, leading to complexity which is at least quadratic in d, and thus prohibitive\nfor large d, and (3) the global convergence properties are not applicable, unless the function is very\nspecial, i.e., self-concordant [24] (which includes only few classical learning problems), so they often\nare only shown to converge in a small area around the optimal x.\nIn this paper, we argue that the three reasons above for not using Newton method can be circumvented\nto obtain competitive algorithms: (1) high absolute precisions are indeed not needed for machine\nlearning, but faced with strongly ill-conditioned problems, even a low-precision solution requires\nsecond-order schemes; (2) many approximate Newton steps have been designed for approximating\nthe solution of the associated large linear system [1, 27, 25, 8]; (3) we propose a novel second-\norder method which is globally convergent and which is based on performing approximate Newton\nmethods for a certain class of so-called generalized self-concordant functions which includes logistic\nregression [6]. For these functions, the conditioning of the problem is also characterized by a more\nlocal quantity: \u03ba(cid:96) = R2/\u03bb, where R characterizes the local evolution of Hessians. This leads\nto second-order algorithms which are competitive with \ufb01rst-order algorithms for well-conditioned\nproblems, while being superior for ill-conditioned problems which are common in practice.\n\nContributions. We make the following contributions:\n\n(a) We build a global second-order method for the minimization of f\u03bb, which relies only on\ncomputing approximate Newton steps of the functions f\u00b5, \u00b5 \u2265 \u03bb. The number of such\n\u0001 ) where \u0001 is the desired precision, and c is an\nsteps will be of order O(c log \u03ba(cid:96) + log 1\nexplicit constant. In the parametric setting (H = Rd), c can be as bad as\n\u03ba(cid:96) in the\nworst-case but much smaller in theory and practice. Moreover in the non-parametric/kernel\nmachine learning setting (H in\ufb01nite dimensional), c does not depend on the local condition\nnumber \u03ba(cid:96).\n\n\u221a\n\n(b) Together with the appropriate quadratic solver to compute approximate Newton steps,\nwe obtain an algorithm with the same scaling as regular \ufb01rst-order methods but with an\nimproved behavior, in particular in ill-conditioned problems. Indeed, this algorithm matches\nthe performance of the best quadratic solvers but covers any generalized self-concordant\nfunction, up to logarithmic terms.\n\n(c) In the non-parametric/kernel machine learning setting we provide an explicit algorithm\ncombining the previous scheme with Nystr\u00f6m projections techniques. We prove that it\nachieves optimal generalization bounds with O(ndf\u03bb) in time and O(df2\n\u03bb) in memory,\nwhere n is the number of observations and df\u03bb is the associated degrees of freedom. In\nparticular, this is the \ufb01rst large-scale algorithm to solve logistic and softmax regression in\nthe non-parametric setting with large condition numbers and theoretical guarantees.\n\n1.1 Comparison to related work\nWe consider two cases for H and the functions fi that are common in machine learning: H = Rd with\nlinear (in the parameter) models with explicit feature maps, and H in\ufb01nite-dimensional, corresponding\nin machine learning to learning with kernels [32]. Moreover in this section we \ufb01rst consider the\n2 (x(cid:62)zi \u2212 yi)2 for\nquadratic case, for example the squared loss in machine learning (i.e., fi(x) = 1\nsome zi \u2208 H, yi \u2208 R). We \ufb01rst need to introduce the Hessian of the problem, for any \u03bb > 0, de\ufb01ne\n\nH(x) := \u22072f (x),\n\nH\u03bb(x) := \u22072f\u03bb(x) = H(x) + \u03bbI,\n\nin particular we denote by H (and analogously H\u03bb) the Hessian at optimum (which in case of squared\nloss corresponds to the covariance matrix of the inputs).\n\n2\n\n\fQuadratic problems and H = Rd (ridge regression). The problem then consists in solving a\n(ill-conditioned) positive semi-de\ufb01nite symmetric linear system of dimension d \u00d7 d. Methods based\non randomized linear algebra, sketching and suitable subsampling [17, 18, 11] are able to \ufb01nd the\nsolution with precision \u0001 in time that is O((nd+min(n, d)3) log(L/\u03bb\u0001)), so essentially independently\nof the condition number, because of the logarithmic complexity in \u03bb.\nQuadratic problems and H in\ufb01nite-dimensional (kernel ridge regression). Here the problem\ncorresponds to solving a (ill-conditioned) in\ufb01nite-dimensional linear system in a reproducing kernel\nHilbert space [32]. Since however the sum de\ufb01ning f is \ufb01nite, the problem can be projected on a\nsubspace of dimension at most n [5], leading to a linear system of dimension n \u00d7 n. Solving it\nwith the techniques above would lead to a complexity of the order O(n2), which is not feasible on\nmassive learning problems (e.g., n \u2248 107). Interestingly these problems are usually approximately\nlow-rank, with the rank represented by the so called effective-dimension df\u03bb [13], counting essentially\nthe eigenvalues of the problem larger than \u03bb,\n\ndf\u03bb = Tr(HH\u22121\n\u03bb ).\n\n(1)\nNote that df\u03bb is bounded by min{n, L/\u03bb} and in many cases df\u03bb (cid:28) min(n, L/\u03bb). Using suitable\nprojection techniques, like Nystr\u00f6m [34] or random features [26] it is possible to further reduce the\nproblem to dimension df\u03bb, for a total cost to \ufb01nd the solution of O(ndf2\n\u03bb). Finally recent methods [29],\ncombining suitable projection methods with re\ufb01ned preconditioning techniques, are able to \ufb01nd the\nsolution with precision compatible with the optimal statistical learning error [13] in time that is\nO(ndf\u03bb log(L/\u03bb)), so being essentially independent of the condition number of the problem.\n\nConvex problems and explicit features (logistic regression). When the loss function is self-\nconcordant it is possible to leverage the fast techniques for linear systems in approximate Newton\nalgorithms [25] (see more in Sec. 2), to achieve the solution in essentially O(nd + min(n, d)3)\ntime, modulo logarithmic terms. However only few loss functions of interest are self-concordant,\nin particular the widely used logistic and soft-max losses are not self-concordant, but generalized-\nself-concordant [6]. In such cases we need to use (accelerated/stochastic) \ufb01rst order optimization\nmethods to enter in the quadratic convergence region of Newton methods [2], which leads to a\n\nsolution in O(dn + d(cid:112)nL/\u03bb + min(n, d)3) time, which does not present any improvement on a\n\nsimple accelerated \ufb01rst-order method. Globally convergent second-order methods have also been\nproposed to solve such problems [21], but the number of Newton steps needed being bounded only\nby L/\u03bb, they lead to a solution in O(L/\u03bb (nd + min(n, d)3)). With \u03bb that could be as small as\n10\u221212 in modern machine learning problems, this makes both these kind of approaches expensive\nfrom a computational viewpoint for ill-conditioned problems. For such problems, with our new\nglobal second-order scheme, the algorithm we propose achieves instead a complexity of essentially\nO((nd + min(n, d)3) log(R2/\u03bb\u0001)) (see Thm. 1).\nConvex problems and H in\ufb01nite-dimensional (kernel logistic regression). Analogously to the\ncase above, it is not possible to use Newton methods pro\ufb01tably as global optimizers on losses that\nare not self-concordant as we see in Sec. 3. In such cases by combining projecting techniques\ndevelopped in Sec. 4 and accelerated \ufb01rst-order optimization methods, it is possible to \ufb01nd a\nsolution in O(ndf\u03bb + df\u03bb\nscenario, since it strongly depends on the condition number L/\u03bb. In Sec. 4 we suitably combine our\noptimization algorithm with projection techniques achieving optimal statistical learning error [23] in\nessentially O(ndf\u03bb log(R2/\u03bb)).\n\n(cid:112)nL/\u03bb) time. This can still be prohibitive in the very small regularization\n\nFirst-order algorithms for \ufb01nite sums.\nIn dimension d, accelerated algorithms for strongly-\nconvex smooth (not necessarily self-concordant) \ufb01nite sums, such as K-SVRG [4], have a running time\n\nproportional O((n +(cid:112)nL/\u03bb)d). This can be improved with preconditioning to O((n +(cid:112)dL/\u03bb)d)\n\nfor large n [2]. Quasi-Newton methods can also be used [20], but typically without the guarantees\nthat we provide in this paper (which are logarithmic in the condition number in natural scenarios).\n\n2 Background: Newton methods and generalized self concordance\n\nIn this section we start by recalling the de\ufb01nition of generalized self concordant functions and motivate\nit with examples. We then recall basic facts about Newton and approximate Newton methods, and\n\n3\n\n\fpresent existing techniques to ef\ufb01ciently compute approximate Newton steps. We start by introducing\nthe de\ufb01nition of generalized self-concordance, that here is an extension of the one in [6].\nDe\ufb01nition 1 (generalized self-concordant (GSC) function). Let H be a Hilbert space. We say that f\nis a generalized self-concordant function on G \u2282 H, when G is a bounded subset of H and f is a\nconvex and three times differentiable mapping on H such that\n\n\u2200x \u2208 H, \u2200h, k \u2208 H, \u2207(3)f (x)[h, k, k] \u2264 supg\u2208G |g \u00b7 h| \u22072f (x)[k, k].\n\nj wi)(cid:1) \u2212 x(cid:62)\n\nyi\n\nwi, where now x \u2208 Rd\u00d7k and\n\ni x) with \u03d5(u) = log(eu + e\u2212u).\n\ni x)), where x, wi \u2208 Rd and yi \u2208 {\u22121, 1}.\n\n(b) Softmax regression: fi(x) = log(cid:0)(cid:80)k\n\nWe will usually denote by R the quantity supg\u2208G (cid:107)g(cid:107) < \u221e and often omit G when it is clear from\nthe context (for simplicity think of G as the ball in H centered in zero and with radius R > 0,\nthen supg\u2208G |g \u00b7 h| = R(cid:107)h(cid:107)). The globally convergent second-order scheme we present in Sec. 3\nis speci\ufb01c to losses which satisfy this generalized self-concordance property. The following loss\nfunctions, which are widely used in machine learning, are generalized-self-concordant, and motivate\nthis work.\nExample 1 (Application to \ufb01nite-sum minimization). The following loss functions are generalized\nself-concordant functions, but not self-concordant:\n(a) Logistic regression: fi(x) = log(1 + exp(\u2212yiw(cid:62)\nj=1 exp(x(cid:62)\nyi \u2208 {1, . . . , k} and xj denotes the j-th column of x.\n(c) Generalized linear models with bounded features (see details in [7, Sec. 2.1]), which include\nconditional random \ufb01elds [33].\n(d) Robust regression: fi(x) = \u03d5(yi \u2212 w(cid:62)\nNote that these losses are not self-concordant in the sense of [25]. Moreover, even if the losses fi are\nself-concordant, the objective function f is not necessarily self-concordant, making any attempt to\nprove the self-concordance of the objective function f almost impossible.\nNewton method (NM). Given x0 \u2208 H, the Newton method consists in doing the following update:\n(2)\n\u03bb (x)\u2207f\u03bb(x) is called the Newton step at point x, and x \u2212 \u2206\u03bb(x) is the\nThe quantity \u2206\u03bb(x) := H\u22121\nminimizer of the second order approximation of f\u03bb around x. Newton methods enjoy the following\nkey property: if x0 is close enough to the optimum, the convergence to the optimum is quadratic and\nthe number of iterations required to a given precision is independent of the condition number of the\nproblem [12].\nHowever Newton methods have two main limitations: (a) the region of quadratic convergence can be\nquite small and reaching the region can be computationally expensive, since it is usually done via\n\ufb01rst order methods [2] that converge linearly depending on the condition number of the problem, (b)\nthe cost of computing the Hessian can be really expensive when n, d are large, and also (c) the cost\nof computing \u2206\u03bb(xt) can be really prohibitive. In the rest of the section we recall some ways to deal\nwith (b) and (c). Our main result of Sec. 3 is to provide globalization scheme for the Newton method\nto tackle problem (a), which is easily integrable with approximate techniques to deal with (b) ans (c),\nto make second-order technique competitive.\n\n\u2206\u03bb(xt) := H\u22121\n\n\u03bb (xt)\u2207f\u03bb(xt).\n\nxt+1 = xt \u2212 \u2206\u03bb(xt),\n\nApproximate Newton methods (ANM) and approximate solutions to linear systems. Comput-\ning exactly the Newton increment \u2206\u03bb(xt), which corresponds essentially to the solution of a linear\nsystem, can be too expensive when n, d are large. A natural idea is to approximate the Newton\niteration, leading to approximate Newton methods,\n\nxt+1 = xt \u2212(cid:101)\u2206\u03bb(xt),\n\n(cid:101)\u2206\u03bb \u2248 \u2206\u03bb(xt).\n\nIn this paper, more generally we consider any technique to compute (cid:101)\u2206\u03bb(xt) that provides a relative\n\napproximation [16] of \u2206\u03bb(xt) de\ufb01ned as follows.\nDe\ufb01nition 2 (relative approximation). Let \u03c1 < 1, let A be an invertible positive de\ufb01nite Hermitian\noperator on H and b in H. We denote by LinApprox(A, b, \u03c1) the set of all \u03c1-relative approximations\nof z\u2217 = A\u22121b, i.e., LinApprox(A, b, \u03c1) = {z \u2208 H | (cid:107)z \u2212 z\u2217(cid:107)A \u2264 \u03c1(cid:107)z\u2217(cid:107)A}.\n\n(3)\n\n4\n\n\fSketching and subsampling for approximate Newton methods. Many techniques for approxi-\n\nmating linear systems have been used to compute (cid:101)\u2206\u03bb, in particular sketching of the Hessian matrix\n\nvia fast transforms and subsampling (see [25, 8, 2] and references therein). Assuming for simplicity\nthat fi = (cid:96)i(w(cid:62)\n\ni x), with (cid:96)i : R \u2192 R and wi \u2208 H, it holds:\ni x)wiw(cid:62)\n\nn(cid:88)\n\nH(x) =\n\n(w(cid:62)\n\n(cid:96)(2)\ni\n\ni = V (cid:62)\n\nx Vx,\n\n(4)\n\n1\nn\n\ni=1\n\ni\n\n(w(cid:62)\n\ni x))1/2 and W \u2208 Rn\u00d7d de\ufb01ned as W = (w1, . . . , wn)(cid:62).\n\nwith Vx \u2208 Rn\u00d7d = DxW , where Dx \u2208 Rn\u00d7n is a diagonal matrix de\ufb01ned as (Dx)ii =\n((cid:96)(2)\nBoth sketching and subsampling methods approximate z\u2217 = H\u03bb(x)\u22121\u2207f\u03bb(x) with \u02dcz =\nij where\nj=1 are indices selected at random from\n\n(cid:101)H\u03bb(x)\u22121\u2207f\u03bb(x), in particular, in the case of subsampling (cid:101)H(x) = (cid:80)Q\nj=1 pjwij w(cid:62)\n{1, . . . , n} with suitable probabilities. Sketching methods instead use (cid:101)H(x) = (cid:101)V (cid:62)\nx (cid:101)Vx, with\n(cid:101)Vx = \u2126Vx with \u2126 \u2208 RQ\u00d7n a structured matrix such that computing (cid:101)Vx has a cost in the order\n\nj=1 are suitable weights and (ij)Q\n\nQ (cid:28) min(n, d), (pj)n\n\nof O(nd log n); to this end usually \u2126 is based on fast Fourier or Hadamard transforms [25]. Note that\nessentially all the techniques used in approximate Newton methods guarantee relative approximation.\nIn particular the following results can be found in the literature (see Lemmas 28 and 29 in Appendix I\nand [25], Lemma 2 for more details).\nLemma 1. Let x, b \u2208 H and assume that (cid:96)(2)\ni \u2264 a for a > 0. With probability 1 \u2212 \u03b4 the following\nmethods output an element in LinApprox(H\u03bb(x), b, \u03c1), in O(Q2d + Q3 + c) time, O(Q2 + d) space:\n(a) Subsampling with uniform sampling (see [27, 28]), where Q = O(\u03c1\u22122a/\u03bb log 1\n\u03bb\u03b4 ) and c = O(1).\n(b) Subsampling with approximate leverage scores [27, 3, 28]), where Q = O(\u03c1\u22122 \u00afdf\u03bb log 1/\u03bb\u03b4), c =\n2) and \u00afdf\u03bb = Tr(W (cid:62)W (W (cid:62)W + \u03bb/aI)\u22121) [30]. Note that \u00afdf\u03bb \u2264 min(n, d).\nO(min(n, a/\u03bb) \u00afdf\u03bb\n(c) Sketching with fast Hadamard transform [25], where Q = O(\u03c1\u22122 \u00afdf\u03bb log a/\u03bb\u03b4), c = O(nd log n).\n\n3 Globally convergent scheme for ANM algorithms on GSC functions\n\nThe algorithm is based on the observation that when f\u03bb is generalized self concordant, there exists\na region where t steps of ANM converge as fast as 2\u2212t. Our idea is to start from a very large\nregularization parameter \u03bb0, such that we are sure that x0 is in the convergence region and perform\nsome steps of ANM such that the solution enters in the convergence region of f\u03bb1, with \u03bb1 = q\u03bb0\nwith q < 1, and to iterate this procedure until we enter the convergence region of f\u03bb. First we de\ufb01ne\nthe region of interest and characterize the behavior of NM and ANM in the region, then we analyze\nthe globalization scheme.\n\nPreliminary results: the Dikin ellipsoid. We consider the following region that we prove to be\ncontained in the region of quadratic convergence for the Newton method and that will be useful to\nbuild the globalization scheme. Let c, R > 0 and f\u03bb be generalized self-concordant with coef\ufb01cient R,\nwe call Dikin ellipsoid and denote by D\u03bb(c) the region\n\nD\u03bb(c) :=(cid:8)x | \u03bd\u03bb(x) \u2264 c\n\n\u221a\n\n\u03bb/R(cid:9), with \u03bd\u03bb(x) := (cid:107)\u2207f\u03bb(x)(cid:107)H\n\n4 \u03bd\u03bb(x)2 \u2264 f\u03bb(x) \u2212 f\u03bb(x(cid:63)\n\n\u03bb (x),\n\u22121\nwhere \u03bd\u03bb(x) is usually called the Newton decrement and (cid:107)x(cid:107)A stands for (cid:107)A1/2x(cid:107).\nLemma 2. Let \u03bb > 0, c \u2264 1/7, let f\u03bb be generalized self-concordant and x \u2208 D\u03bb(c). Then it\n\u03bb) \u2264 \u03bd\u03bb(x)2. Moreover Newton method starting from x0 has\nholds: 1\nquadratic convergence, i.e., let xt be obtained via t \u2208 N steps of Newton method in Eq. (2), then\nconvergence rate, i.e., let xt given by Eq. (3), with (cid:101)\u2206t \u2208 LinApprox(H\u03bb(xt),\u2207f\u03bb(xt), \u03c1) and\n\u03bd\u03bb(xt) \u2264 2\u2212(2t\u22121)\u03bd\u03bb(x0). Finally, approximate Newton methods starting from x0 have a linear\n\u03c1 \u2264 1/7, then \u03bd\u03bb(xt) \u2264 2\u2212t\u03bd\u03bb(x0).\nThis result is proved in Lemma 11 in Appendix B.3. The crucial aspect of the result above is that\nwhen x0 \u2208 D\u03bb(c), the convergence of the approximate Newton method is linear and does not depend\n\u221a\non the condition number of the problem. However D\u03bb(c) itself can be very small depending on\n\n\u03bb/R. In the next subsection we see how to enter in D\u03bb(c) in an ef\ufb01cient way.\n\n5\n\n\f\u221a\n\nEntering the Dikin ellipsoid using a second-order scheme. The lemma above shows that D\u03bb(c)\nis a good region where to use the approximate Newton algorithm on GSC functions. However the\nregion itself is quite small, since it depends on\n\u03bb/R. Some other globalization schemes arrive to\nregions of interest by \ufb01rst-order methods or back-tracking schemes [2, 1]. However such approaches\n\nrequire a number of steps that is usually proportional to(cid:112)L/\u03bb making them non-bene\ufb01cial in machine\n\nlearning contexts. Here instead we consider the following simple scheme where ANM\u03c1(f\u03bb, x, t) is the\nresult of a \u03c1-relative approximate Newton method performing t steps of optimization starting from x.\nThe main ingredient to guarantee the scheme to work is the following lemma (see Lemma 13 in\nAppendix C.1 for a proof).\nLemma 3. Let \u00b5 > 0, c < 1 and x \u2208 H. Let s = 1 + R(cid:107)x(cid:107)/c, then for q \u2208 [1 \u2212 2/(3s), 1)\n\nD\u00b5(c/3) \u2286 Dq\u00b5(c).\n\nNow we are ready to show that we can guarantee the loop invariant xk \u2208 D\u00b5k (c). Indeed assume that\nxk\u22121 \u2208 D\u00b5k\u22121 (c). Then \u03bd\u00b5k\u22121(xk\u22121) \u2264 c\n\u00b5k\u22121/R. By taking t = 2, \u03c1 = 1/7, and performing\nxk = ANM\u03c1(f\u00b5k\u22121, xk\u22121, t), by Lemma 2, \u03bd\u00b5k\u22121 (xk) \u2264 1/4\u03bd\u00b5k\u22121(xk\u22121) \u2264 c/4\n\u00b5k\u22121/R, i.e.,\nxk \u2208 D\u00b5k\u22121 (c/4). If qk is large enough, this implies that xk \u2208 Dqk\u00b5k\u22121 (c) = D\u00b5k (c), by Lemma 3.\nNow we are ready to state our main theorem of this section.\n\n\u221a\n\n\u221a\n\nProposed Globalization Scheme\n\nStart with x0 \u2208 H, \u00b50 > 0, t, T \u2208 N and (qk)k\u2208N \u2208 (0, 1].\nFor k \u2208 N\n\nPhase I: Getting in the Dikin ellispoid of f\u03bb\n\nStop when \u00b5k+1 < \u03bb and set xlast \u2190 xk.\n\nxk+1 \u2190 ANM\u03c1(f\u00b5k , xk, t)\n\u00b5k+1 \u2190 qk+1\u00b5k\nReturn(cid:98)x \u2190 ANM\u03c1(f\u03bb, xlast, T )\n\nPhase II: reach a certain precision starting from inside the Dikin ellipsoid\n\nFully adaptive method. The scheme presented above converges with the following parameters.\nTheorem 1. Let \u0001 > 0. Set \u00b50 = 7R(cid:107)\u2207f (0)(cid:107), x0 = 0, and perform the globalization scheme above\nfor \u03c1 \u2264 1/7, t = 2, and qk = 1/3+7R(cid:107)xk(cid:107)\n1+7R(cid:107)xk(cid:107) , T = (cid:100)log2\nnumber of steps performed in the Phase I, it holds:\n\n(cid:112)1 \u2228 (\u03bb\u0001\u22121/R2)(cid:101). Then denoting by K the\n\nK \u2264 (cid:98)(3 + 11R(cid:107)x(cid:63)\n\n\u03bb(cid:107)) log(7R(cid:107)\u2207f (0)(cid:107)/\u03bb)(cid:99) .\n\nf\u03bb((cid:98)x) \u2212 f\u03bb(x(cid:63)\n\n\u03bb) \u2264 \u0001,\n\n(cid:16)\n\nO\n\nR(cid:107)x(cid:63)\n\n(cid:17)(cid:17)\n\n(nd log n + dQ2 + Q3)\n\nNote that the theorem above (proven in Appendix C.3) guarantees a solution with error \u0001 with K steps\nof ANM each performing 2 iterations of approximate linear system solving, plus a \ufb01nal step of ANM\nwhich performs T iterations of approximate linear system solving. In case of fi(x) = (cid:96)i(w(cid:62)\ni x), with\ni \u2264 a, for a > 0, the \ufb01nal runtime cost of the proposed scheme to\n(cid:96)i : R \u2192 R, wi \u2208 H with (cid:96)(2)\n(cid:16)\nachieve precision \u0001, when combined with of the methods for approximate linear system solving from\nLemma 1 (i.e. sketching), is O(Q2 + d) in memory and\n\u03bb\nR\u0001\n\n(cid:16) \u00afdf\u03bb log\n\nwhere \u00afdf\u03bb, de\ufb01ned in Lemma 1, measures the effective dimension of the correlation matrix W (cid:62)W\nwith W = (w1, . . . , wn)(cid:62) \u2208 Rn\u00d7d, corresponding essentially to the number of eigenvalues of W (cid:62)W\nlarger than \u03bb/a. In particular note that \u00afdf\u03bb \u2264 min(n, d, rank(W ), ab2/\u03bb), with b := maxi (cid:107)wi(cid:107),\nand usually way smaller than such quantities.\nRemark 1. The proposed method does not depend on the condition number of the problem L/\u03bb, but\non the term R(cid:107)x(cid:63)\n\u03bb in the worst case, but usually way smaller.\nFor example, it is possible to prove that this term is bounded by an absolute constant not depending\non \u03bb, if at least one minimum for f exists. In the appendix (see Proposition 7), we show a variant of\nthis adaptive method which can leverage the regularity of the solution with respect to the Hessian,\ni.e., depending on the smaller quantity R\n\n\u221a\n\u03bb(cid:107) which can be in the order of R/\n\nin time, Q = O\n\n\u03bb(cid:107) log\n\n1\n\u03bb\u03b4\n\n+ log\n\n\u03bb(cid:107)x(cid:63)\n\n\u03bb(cid:107)H\n\n\u03bb) instead of R(cid:107)x(cid:63)\n\u03bb(cid:107).\n\n\u22121\n\u03bb (x(cid:63)\n\n(cid:17)\n\n,\n\nR\n\u03bb\n\n\u221a\n\nFinally note that it is possible to use qk = q \ufb01xed for all the iterations and way smaller than the one\nin Thm. 1, depending on some regularity properties of H (see Proposition 8 in Appendix C.2).\n\n6\n\n\f4 Application to the non-parametric setting: Kernel methods\n\nIn supervised learning the goal is to predict well on future data, given the observed training dataset.\nLet X be the input space and Y \u2286 Rp be the output space. We consider a probability distribution P\nover X \u00d7 Y generating the data and the goal is to estimate g\u2217 : X \u2192 Y solving the problem\n\ng\u2217 = arg min\ng:X\u2192Y\n\nL(g), L(g) = E[(cid:96)(g(x), y)],\n\nn\n\n1\nn\n\nw\u2208H\n\n\u03bb \u03c6(x),\n\n(cid:80)n\n\n(5)\nfor a given loss function (cid:96) : Y \u00d7 Y \u2192 R. Note that P is not known, and accessible only via the\ni=1, with n \u2208 N, independently sampled from P . A prototypical estimator for g\u2217 is\ndataset (xi, yi)n\ni=1 (cid:96)(g(xi), yi) over a suitable space of\nfunctions G. Given \u03c6 : X \u2192 H a common choice is to select G as the set of linear functions of \u03c6(x),\n\nthe regularized minimizer of the empirical risk (cid:98)L(g) = 1\nthat is, G = {w(cid:62)\u03c6(\u00b7) | w \u2208 H}. Then the regularized minimizer of (cid:98)L, denoted by(cid:98)g\u03bb, corresponds to\n(cid:98)g\u03bb(x) = (cid:98)w(cid:62)\nLearning theory guarantees how fast(cid:98)g\u03bb converges to the best possible estimator g\u2217 with respect\nto the number of observed examples, in terms of the so called excess risk L((cid:98)g\u03bb) \u2212 L(g\u2217). The\n\n(cid:80)n\ni=1 fi(w) + \u03bb(cid:107)w(cid:107)2,\n\n(cid:98)w\u03bb = arg min\n\nfollowing theorem recovers the minimax optimal learning rates for squared loss and extend them to\nany generalized self-concordant loss function.\nNote on df\u03bb. In this section, we always denote with df\u03bb the effective dimension of the problem in\nEq. (5). When the loss belongs to the family of generalized linear models (see Example 1) and if the\nmodel is well-speci\ufb01ed, then df\u03bb is de\ufb01ned exactly as in Eq. (1) otherwise we need a more re\ufb01ned\nde\ufb01nition (see [23] or Eq. (30) in Appendix D).\nTheorem 2 (from [23], Thm. 4). Let \u03bb > 0, \u03b4 \u2208 (0, 1]. Let (cid:96) be generalized self-concordant with\nparameter R > 0 and supx\u2208X (cid:107)\u03c6(x)(cid:107) \u2264 C < \u221e. Assume that there exists g\u2217 minimizing L.\nn \u2265 C/\u03bb log(\u03b4\u22121C/\u03bb) the following holds with probability 1 \u2212 \u03b4:\n\nThen there exists c0 not depending on n, \u03bb, \u03b4, df\u03bb, C, g\u2217, such that if(cid:112)df\u03bb/n, b\u03bb \u2264 \u03bb1/2/R, and\n\nfi(w) = (cid:96)(w(cid:62)\u03c6(xi), yi).\n\n(6)\n\nL((cid:98)g\u03bb) \u2212 L(g\u2217) \u2264 c0\n\n(cid:16) df\u03bb\n\nn\n\n(cid:17)\n\n+ b2\n\u03bb\n\nlog(1/\u03b4),\n\nb\u03bb := \u03bb(cid:107)g\u2217(cid:107)H\n\n\u03bb (g\u2217).\n\u22121\n\n(7)\n\nUnder standard regularity assumptions of the learning problems [23], i.e., (a) the capacity condition\n\u03c3j(H(g\u2217)) \u2264 Cj\u2212\u03b1, for \u03b1 \u2265 1, C > 0 (i.e., a decay of eigenvalues \u03c3j(H(g\u2217)) of the Hessian at the\noptimum), and (b) the source condition g\u2217 = H(g\u2217)rv, with v \u2208 H and r > 0 (i.e., the control of the\n\u03bb \u2264 C(cid:48)(cid:48)\u03bb1+2r, leading to\noptimal g\u2217 for a speci\ufb01c Hessian-dependent norm), df\u03bb \u2264 C(cid:48)\u03bb\u22121/\u03b1 and b2\nthe following optimal learning rate,\nNow we propose an algorithmic scheme to compute ef\ufb01ciently an approximation of(cid:98)g\u03bb that achieves\n\nL((cid:98)g\u03bb) \u2212 L(g\u2217) \u2264 c1n\u2212 1+2r\u03b1\n\n1+\u03b1+2r\u03b1 log(1/\u03b4), when \u03bb = n\u2212\n\nthe same optimal learning rates. First we need to introduce the technique we are going to use.\n\n1+\u03b1+2r\u03b1 .\n\n(8)\n\n\u03b1\n\nIt consists in suitably selecting {\u00afx1, . . . , \u00afxM} \u2282 {x1, . . . , xn}, with M (cid:28) n\nNystr\u00f6m projection.\nand computing \u00afgM,\u03bb, i.e., the solution of Eq. (6) over HM = span{\u03c6(\u00afx1), . . . , \u03c6(\u00afxM )} instead of H.\nIn this case the problem can be reformulated as a problem in RM as\n\n1\nn\n\n\u00affi(\u03b1) + \u03bb(cid:107)\u03b1(cid:107)2,\n\n\u00afgM,\u03bb = \u00af\u03b1(cid:62)\n\nM,\u03bbT\u22121v(x),\n\n\u00aff\u03bb(\u03b1),\n\n\u00aff (\u03b1) =\n\ni=1\n\n\u00af\u03b1M,\u03bb = arg min\n\u03b1\u2208RM\n\n(9)\nwhere \u00affi(\u03b1) = (cid:96)(v(xi)(cid:62)T\u22121\u03b1, yi) and v(x) \u2208 RM , v(x) = (k(x, \u00afx1), . . . , k(x, \u00afxM )) with\nk(x, x(cid:48)) = \u03c6(x)(cid:62)\u03c6(x(cid:48)) the associated positive-de\ufb01nite kernel [32], while T is the upper trian-\ngular matrix such that K = T(cid:62)T, with K \u2208 RM\u00d7M with Kij = k(\u00afxi, \u00afxj). In the next theorem\nwe characterize the suf\ufb01cient M to achieve minimax optimal rates, for two standard techniques of\nchoosing the Nystr\u00f6m points {\u00afx1, . . . , \u00afxM}.\nTheorem 3 (Optimal rates for learning with Nystr\u00f6m). Let \u03bb > 0, \u03b4 \u2208 (0, 1]. Assume the conditions\nof Thm. 2. Then the excess risk of \u00afgM,\u03bb is bounded with prob. 1 \u2212 2\u03b4 as in Eq. (7) (with c(cid:48)\n1 \u221d c1),\nwhen\n\n(1) Uniform Nystr\u00f6m method [28, 29] is used and M \u2265 C1/\u03bb log(C2/\u03bb\u03b4).\n(2) Approximate leverage score method [3, 28, 29] is used and M \u2265 C3 df\u03bb log(C4/\u03bb\u03b4).\n\nHere C, C1, C2, C4 do not depend on \u03bb, n, M, df\u03bb, \u03b4.\n\nn(cid:88)\n\n7\n\n\fThm. 3 generalizes results for learning with Nystr\u00f6m and squared loss [28], to GSC losses. It is\nproved in Thm. 6, in Appendix D.4. As in [28], Thm. 3 shows that Nystr\u00f6m is a valid technique\nfor dimensionality reduction. Indeed it is essentially possible to project the learning problem on a\nsubspace HM of dimension M = O(c/\u03bb) or even as small as M = O(df\u03bb) and still achieve the\noptimal rates of Thm. 2. Now we are ready to introduce our algorithm.\n\nProposed algorithm. The algorithm conceptually consists in (a) performing a projection step with\nNystr\u00f6m, and (b) solving the resulting optimization problem with the globalization scheme proposed\nin Sec. 3 based on ANM in Eq. (3). In particular, we want to avoid to apply explicitly T\u22121 to each\nv(xi) in Eq. (9), which would require O(nM 2) time. Then we will use the following approximation\ntechnique based only on matrix vector products, so we can just apply T\u22121 to \u03b1 at each iteration,\nwith a total cost proportional only to O(nM + M 2) per iteration. Given \u03b1,\u2207 \u00aff\u03bb(\u03b1), we approximate\nz\u2217 = \u00afH\u03bb(\u03b1)\u22121\u2207 \u00aff\u03bb(\u03b1), where \u00afH\u03bb is the Hessian of \u00aff\u03bb(\u03b1), with \u02dcz de\ufb01ned as\n\n\u02dcz = prec-conj-gradt( \u00afH\u03bb(\u03b1),\u2207 \u00aff\u03bb(\u03b1)),\n\nwhere prec-conj-gradt corresponds to performing t steps of preconditioned conjugate gradi-\nent [19] with preconditioner computed using a subsampling approach for the Hessian among the ones\npresented in Sec. 2, in the paragraph starting with Eq. (4). The pseudocode for the whole procedure\nis presented in Alg. 1, Appendix E. This technique of approximate linear system solving has been\nstudied in [29] in the context of empirical risk minimization for squared loss.\nLemma 4 ([29]). Let \u03bb > 0, \u03b1, b \u2208 RM . The previous method, applied with t = O(log 1/\u03c1), outputs\nan element of LinApprox( \u00afH\u03bb(\u03b1), b, \u03c1), with probability 1 \u2212 \u03b4 with complexity O((nM + M 2Q +\nM 3 + c)t) in time and O(M 2 + n) in space, with Q = O(C1/\u03bb log(C1/\u03bb\u03b4)), c = O(1) if uniform\nsub-sampling is used or Q = O(C2df\u03bb log(C1/\u03bb\u03b4)), c = O(df2\n\u03bb )) if sub-sampling with\nleverage scores is used [30].\n\n\u03bb min(n, 1\n\nA more complete version of this lemma is shown in Proposition 12 in Appendix D.5.1. We conclude\nthis section with a result proving the learning properties of the proposed algorithm.\nTheorem 4 (Optimal rates for the proposed algorithms). Let \u03bb > 0 and \u0001 < \u03bb/R2. Under the\nhypotheses of Thm. 3, if we set M as in Thm. 3, Q as in Lemma 4 and setting the globalization\nscheme as in Thm. 1, then the proposed algorithm (Alg. 1, Appendix E) \ufb01nishes in a \ufb01nite number of\nnewton steps Nns = O(R(cid:107)g\u2217(cid:107) log(C/\u03bb) + log(C/\u0001)) and returns a predictor gQ,M,\u03bb of the form\ngQ,M,\u03bb = \u03b1(cid:62)T\u22121v(x). With probability at least 1 \u2212 \u03b4, this predictor satis\ufb01es:\n\n(cid:16) df\u03bb\n\nn\n\n(cid:17)\n\nL(gQ,M,\u03bb) \u2212 L(g\u2217) \u2264 c0\n\n+ b2\n\n\u03bb + \u0001\n\nlog(1/\u03b4),\n\nb\u03bb := \u03bb(cid:107)g\u2217(cid:107)H\n\n\u03bb (g\u2217).\n\u22121\n\n(10)\n\nThe theorem above (see Proposition 14, Appendix D.6 for exacts quanti\ufb01cations) shows that the\nproposed algorithm is able to achieve the same learning rates of plain empirical risk minimization as\nin Thm. 2. The total complexity of the procedure, including the cost of computing the preconditioner,\nthe selection of the Nystr\u00f6m points via approximate leverage scores and also the computation of the\nleverage scores [30] is then\n\nO(cid:0)R(cid:107)g\u2217(cid:107) log(R2/\u03bb)(cid:0)n df\u03bb log(C\u03bb\u22121\u03b4\u22121) cX + + df3\n\n\u03bb log3(C\u03bb\u22121\u03b4\u22121) + min(n, C/\u03bb) df2\n\n(cid:1)(cid:1)\n\n\u03bb\n\n\u03bb log2(C\u03bb\u22121\u03b4\u22121)) in space, where cX is the cost of computing the inner product\nin time and O(df2\nk(x, x(cid:48)) (in the kernel setting assumed when the input space X is X = Rp it is c = O(p)).\nAs noted in [30], under the standard regularity assumptions on the learning problem seen above,\n\u03bb \u2264 df\u03bb/\u03bb \u2264 n when the optimal \u03bb is chosen. So the total computational complexity is\ndf2\n\n\u03bb\u00b7log2(C\u03bb\u22121\u03b4\u22121)) in space.\nFirst note, the fact that due to the statistical properties of the problem the complexity does not depend\n\nO(cid:0)R log(R2/\u03bb) log3(C\u03bb\u22121\u03b4\u22121) (cid:107)g\u2217(cid:107) \u00b7 n \u00b7 df\u03bb \u00b7 cX\neven implicitly on(cid:112)C/\u03bb, but only on log(C/\u03bb), so the algorithm runs in essentially O(ndf\u03bb),\n(cid:112)nC/\u03bb) of the accelerated \ufb01rst-order methods we develop in Appendix F and\n(cid:112)C/\u03bb) of other Newton schemes (see Sec. 1.1). To our knowledge, this is the \ufb01rst\ncomplexity only (cid:101)O(ndf\u03bb). This generalizes similar results for squared loss [29, 30].\n\ncompared to O(df\u03bb\nthe O(ndf\u03bb\nalgorithm to achieve optimal statistical learning rates for generalized self-concordant losses and with\n\n(cid:1) in time, O(df2\n\n8\n\n\fFigure 1: Training loss and test error as as function of the number of passes on the data for our\nalgorithm vs. K-SVRG. on the (left) Susy and (right) Higgs data sets.\n\n5 Experiments\n\nThe code necessary to reproduce the following experiments is available on GitHub at https:\n//github.com/umarteau/Newton-Method-for-GSC-losses-.\nWe compared the performances of our algorithm for kernel logistic regression on two large scale\nclassi\ufb01cation datasets (n \u2248 107), Higgs and Susy, pre-processed as in [29]. We implemented the\nalgorithm in pytorch and performed the computations on 1 Tesla P100-PCIE-16GB GPU. For Susy\n(n = 5 \u00d7 106, p = 18): we used Gaussian kernel with k(x, x(cid:48)) = e\u2212(cid:107)x\u2212x(cid:48)(cid:107)2/(2\u03c32), with \u03c3 = 5,\nwhich we obtained through a grid search (in [29], \u03c3 = 4 is taken for the ridge regression); M = 104\nNystr\u00f6m centers and a subsampling Q = M for the preconditioner, both obtained with uniform\nsampling. Analogously for Higgs (n = 1.1 \u00d7 107, p = 28): , we used a Gaussian kernel with \u03c3 = 5\nand M = 2.5 \u00d7 104 and Q = M, using again uniform sampling. To \ufb01nd reasonable \u03bb for supervised\nlearning applications, we cross-validated \u03bb \ufb01nding the minimum test error at \u03bb = 10\u221210 for Susy\nand \u03bb = 10\u22129 for Higgs (see Figs. 2 and 3 in Appendix F) for such values our algorithm and the\ncompetitor achieve an error of 19.5% on the test set for Susy, comparable to the state of the art (19.6%\n[29]) and analogously for Higgs (see Appendix F). We then used such \u03bb\u2019s as regularization parameters\nand compared our algorithm with a well known accelerated stochastic gradient technique Katyusha\nSVRG (K-SVRG) [4], tailored to our problem using mini batches. In Fig. 1 we show the convergence\nof the training loss and classi\ufb01cation error with respect to the number of passes on the data, of our\nalgorithm compared to K-SVRG. It is possible to note our algorithm is order of magnitude faster in\nachieving convergence, validating empirically the fact that the proposed algorithm scales as O(ndf\u03bb)\n\nin learning settings, while accelerated \ufb01rst order methods go as O((n +(cid:112)nL/\u03bb)df\u03bb). Moreover,\n\nas mentioned in the introduction, this highlights the fact that precise optimization is necessary to\nachieve a good performance in terms of test error. Finally, note that since a pass on the data is much\nmore expensive for K-SVRG than for our second order method (see Appendix F for details), the\ndifference in computing time between the second order scheme and K-SVRG is even more in favour\nof our second order method (see Figs. 4 and 5 in Appendix F).\n\nAcknowledgments\n\nWe acknowledge support from the European Research Council (grant SEQUOIA 724063).\n\nReferences\n[1] Murat A. Erdogdu and Andrea Montanari. Convergence rates of sub-sampled Newton methods.\n\nTechnical Report 1508.02810, ArXiv, 2015.\n\n[2] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for\n\nmachine learning in linear time. J. Mach. Learn. Res., 18(1):4148\u20134187, January 2017.\n\n[3] Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical\n\nguarantees. In Advances in Neural Information Processing Systems, pages 775\u2013783, 2015.\n\n9\n\n020406080100120passes over data19.419.619.820.020.220.420.620.821.0classification error104103102101distance to optimumsecond orderK-SVRG20406080100120passes over data27.828.028.228.428.628.829.0classification error106105104103102101distance to optimumsecond orderK-SVRG\f[4] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In\n\nProceedings of the Symposium on Theory of Computing, pages 1200\u20131205, 2017.\n\n[5] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathemati-\n\ncal Society, 68(3):337\u2013404, 1950.\n\n[6] Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics,\n\n4:384\u2013414, 2010.\n\n[7] Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for\n\nlogistic regression. Journal of Machine Learning Research, 15(1):595\u2013627, 2014.\n\n[8] Raghu Bollapragada, Richard H. Byrd, and Jorge Nocedal. Exact and inexact subsampled\nnewton methods for optimization. IMA Journal of Numerical Analysis, 39(2):545\u2013578, 2018.\n\n[9] L\u00e9on Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural\n\nInformation Processing Systems, pages 161\u2013168, 2008.\n\n[10] L\u00e9on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. Siam Review, 60(2):223\u2013311, 2018.\n\n[11] Christos Boutsidis and Alex Gittens. Improved matrix algorithms via the subsampled random-\nized hadamard transform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301\u20131340,\n2013.\n\n[12] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[13] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Found.\n\nComput. Math., 7(3):331\u2013368, July 2007.\n\n[14] Aaron Defazio. A simple practical accelerated method for \ufb01nite sums. In Advances in Neural\n\nInformation Processing Systems, pages 676\u2013684, 2016.\n\n[15] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing systems, pages 1646\u20131654, 2014.\n\n[16] Peter Deu\ufb02hard. Newton Methods for Nonlinear Problems: Af\ufb01ne Invariance and Adaptive\n\nAlgorithms. Springer, 2011.\n\n[17] Petros Drineas, Michael W Mahoney, Shan Muthukrishnan, and Tam\u00e1s Sarl\u00f3s. Faster least\n\nsquares approximation. Numerische mathematik, 117(2):219\u2013249, 2011.\n\n[18] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast\napproximation of matrix coherence and statistical leverage. Journal of Machine Learning\nResearch, 13(Dec):3475\u20133506, 2012.\n\n[19] Gene H. Golub and Charles F. Van Loan. Matrix Computations, volume 3. JHU Press, 2012.\n\n[20] Robert Gower, Filip Hanzely, Peter Richt\u00e1rik, and Sebastian U. Stich. Accelerated stochastic ma-\ntrix inversion: general theory and speeding up BFGS rules for faster second-order optimization.\nIn Advances in Neural Information Processing Systems, pages 1619\u20131629, 2018.\n\n[21] Sai Praneeth Karimireddy, Sebastian U. Stich, and Martin Jaggi. Global linear convergence of\nnewton\u2019s method without strong-convexity or lipschitz gradients. CoRR, abs/1806.00413, 2018.\nURL http://arxiv.org/abs/1806.00413.\n\n[22] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n[23] Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Beyond least-\nsquares: Fast rates for regularized empirical risk minimization through self-concordance. In\nProceedings of the Conference on Computational Learning Theory, 2019.\n\n10\n\n\f[24] Arkadii Nemirovskii and Yurii Nesterov.\n\nInterior-point polynomial algorithms in convex\n\nprogramming. Society for Industrial and Applied Mathematics, 1994.\n\n[25] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization\nalgorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205\u2013245,\n2017.\n\n[26] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin Neural Information Processing Systems, pages 1177\u20131184, 2008.\n\n[27] Farbod Roosta-Khorasani and Michael W. Mahoney. Sub-sampled Newton methods. Math.\n\nProgram., 174(1-2):293\u2013326, 2019.\n\n[28] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr\u00f6m com-\nputational regularization. In Advances in Neural Information Processing Systems 28, pages\n1657\u20131665. 2015.\n\n[29] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large scale\nkernel method. In Advances in Neural Information Processing Systems 30, pages 3888\u20133898.\n2017.\n\n[30] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast leverage\nscore sampling and optimal learning. In Advances in Neural Information Processing Systems,\npages 5672\u20135682, 2018.\n\n[31] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied\n\nMathematics, Philadelphia, PA, USA, 2nd edition, 2003.\n\n[32] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge\n\nUniversity Press, 2004.\n\n[33] Charles Sutton and Andrew McCallum. An introduction to conditional random \ufb01elds. Founda-\n\ntions and Trends R(cid:13) in Machine Learning, 4(4):267\u2013373, 2012.\n\n[34] Christopher K. I. Williams and Matthias Seeger. Using the Nystr\u00f6m method to speed up kernel\n\nmachines. In Advances in Neural Information Processing Systems, pages 682\u2013688, 2001.\n\n11\n\n\f", "award": [], "sourceid": 4163, "authors": [{"given_name": "Ulysse", "family_name": "Marteau-Ferey", "institution": "DI ENS / INRIA"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA, Ecole Normale Superieure"}]}