{"title": "Sub-sampled Newton Methods with Non-uniform Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 3000, "page_last": 3008, "abstract": "We consider the problem of finding the minimizer of a convex function $F: \\mathbb R^d \\rightarrow \\mathbb R$ of the form $F(w) \\defeq \\sum_{i=1}^n f_i(w) + R(w)$ where a low-rank factorization of $\\nabla^2 f_i(w)$ is readily available.We consider the regime where $n \\gg d$. We propose randomized Newton-type algorithms that exploit \\textit{non-uniform} sub-sampling of $\\{\\nabla^2 f_i(w)\\}_{i=1}^{n}$, as well as inexact updates, as means to reduce the computational complexity, and are applicable to a wide range of problems in machine learning. Two non-uniform sampling distributions based on {\\it block norm squares} and {\\it block partial leverage scores} are considered. Under certain assumptions, we show that our algorithms inherit a linear-quadratic convergence rate in $w$ and achieve a lower computational complexity compared to similar existing methods. In addition, we show that our algorithms exhibit more robustness and better dependence on problem specific quantities, such as the condition number. We numerically demonstrate the advantages of our algorithms on several real datasets.", "full_text": "Sub-sampled Newton Methods\nwith Non-uniform Sampling\n\nPeng Xu\u2020 Jiyan Yang\u2020 Farbod Roosta-Khorasani\u2021 Christopher R\u00e9\u2020 Michael W. Mahoney\u2021\n\n\u2020 Stanford University\n\n\u2021 University of California at Berkeley\n\npengxu@stanford.edu jiyan@stanford.edu farbod@icsi.berkeley.edu\n\nchrismre@cs.stanford.edu mmahoney@stat.berkeley.edu\n\nAbstract\n\nof the form F (w) := (cid:80)n\n\nWe consider the problem of \ufb01nding the minimizer of a convex function F : Rd \u2192 R\ni=1 fi(w) + R(w) where a low-rank factorization of\n\u22072fi(w) is readily available. We consider the regime where n (cid:29) d. We propose\nrandomized Newton-type algorithms that exploit non-uniform sub-sampling of\n{\u22072fi(w)}n\ni=1, as well as inexact updates, as means to reduce the computational\ncomplexity, and are applicable to a wide range of problems in machine learning.\nTwo non-uniform sampling distributions based on block norm squares and block\npartial leverage scores are considered. Under certain assumptions, we show that\nour algorithms inherit a linear-quadratic convergence rate in w and achieve a lower\ncomputational complexity compared to similar existing methods. In addition, we\nshow that our algorithms exhibit more robustness and better dependence on problem\nspeci\ufb01c quantities, such as the condition number. We empirically demonstrate that\nour methods are at least twice as fast as Newton\u2019s methods on several real datasets.\n\nn(cid:88)\n\ni=1\n\nIntroduction\n\n1\nMany machine learning applications involve \ufb01nding the minimizer of optimization problems of the\nform\n\nfi(w) + R(w)\n\nw\u2208C F (w) :=\nmin\n\n(1)\nwhere fi(w) is a smooth convex function, R(w) is a regularizer, and C \u2286 Rd is a convex constraint\nset (e.g., (cid:96)1 ball). Examples include sparse least squares [21], generalized linear models (GLMs) [8],\nand metric learning problems [12].\nFirst-order optimization algorithms have been the workhorse of machine learning applications and\nthere is a plethora of such methods [3, 17] for solving (1). However, for ill-conditioned problems,\nit is often the case that \ufb01rst-order methods return a solution far from w\u2217 albeit a low objective\nvalue. On the other hand, most second-order algorithms prove to be more robust to such adversarial\neffects. This is so since, using the curvature information, second order methods properly rescale\nthe gradient, such that it is a more appropriate direction to follow. For example, take the canonical\nsecond order method, i.e., Newton\u2019s method, which, in the unconstrained case, has updates of the\nform wt+1 = wt \u2212 [H(wt)]\u22121g(wt) (here, g(wt) and H(wt) denote the gradient and the Hessian\nof F at wt, respectively). Classical results indicate that under certain assumptions, Newton\u2019s method\ncan achieve a locally super-linear convergence rate, which can be shown to be problem independent!\nNevertheless, the cost of forming and inverting the Hessian is a major drawback in using Newton\u2019s\nmethod in practice. In this regard, there has been a long line of work aiming at providing suf\ufb01cient\nsecond-order information more ef\ufb01ciently, e.g., the classical BFGS algorithm and its limited memory\nversion [14, 17].\nAs the mere evaluation of H(w) grows linearly in n, a natural idea is to use uniform sub-sampling\n{\u22072fi(w)}n\ni=1 as a way to reduce the cost of such evaluation [7, 19, 20]. However, in the presence\nof high non-uniformity among {\u22072fi(w)}n\ni=1, the sampling size required to suf\ufb01ciently capture the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fi=1 AT\n\ni w)xT\n\ncurvature information of the Hessian can be very large. In such situations, non-uniform sampling can\nindeed be a much better alternative and is addressed in this work in detail.\nIn this work, we propose novel, robust and highly ef\ufb01cient non-uniformly sub-sampled Newton\nmethods (SSN) for a large sub-class of problem (1), where the Hessian of F (w) in (1) can be\ni = 1, 2, . . . , n,\nare readily available and Q(w) is some positive semi-de\ufb01nite matrix. This situation arises very\nfrequently in machine learning problems. For example, take any problem where fi(w) = (cid:96)(xT\ni w),\n(cid:96)(\u00b7) is any convex loss function and xi\u2019s are data points.\nIn such situations, Ai(w) is simply\ni . Under this setting, non-uniformly sub-sampling the Hessians now boils down to\nbuilding an appropriate non-uniform distribution to sub-sample the most \u201crelevant\u201d terms among\n{Ai(w)}n\n\nwritten as H(w) = (cid:80)n\n(cid:112)(cid:96)(cid:48)(cid:48)(xT\ni=1. The approximate Hessian, denoted by (cid:101)H(wt), is then used to update the current iterate\nas wt+1 = wt \u2212 [(cid:101)H(wt)]\u22121g(wt). Furthermore, in order to improve upon the overall ef\ufb01ciency of\n\ni (w)Ai(w) + Q(w), where Ai(w) \u2208 Rki\u00d7d,\n\nour SSN algorithms, we will allow for the linear system in the sub-problem to be solved inexactly,\ni.e., using only a few iterations of any iterative solver such as Conjugate Gradient (CG). Such inexact\nupdates used in many second-order optimization algorithms have been well studied in [4, 5].\nAs we shall see (in Section 4), our algorithms converge much faster than other competing methods\nfor a variety of problems. In particular, on several machine learning datasets, our methods are at least\ntwice as fast as Newton\u2019s methods in \ufb01nding a high-precision solution while other methods converge\nslowly. Indeed, this phenomenon is well supported by our theoretical \ufb01ndings\u2014the complexity of\nour algorithms has a lower dependence on the problem condition number and is immune to any\nnon-uniformity among {Ai(w)}n\ni=1 which may cause a factor of n in the complexity (Table 1).\nIn the following we present details of our main contributions and connections to other prior work.\nReaders interested in more details should see the technical report version of this conference paper [23]\nfor proofs of our main results, additional theoretical results, as well as a more detailed empirical\nevaluation.\n1.1 Contributions and related work\nRecently, within the context of randomized second order methods, many algorithms have been\nproposed that aim at reducing the computational costs involving pure Newton\u2019s method. Among\nthem, algorithms that employ uniform sub-sampling constitute a popular line of work [4, 7, 16, 22].\nIn particular, [19, 20] consider a more general class of problems and, under a variety of conditions,\nthoroughly study the local and global convergence properties of sub-sampled Newton methods where\nthe gradient and/or the Hessian are uniformly sub-sampled. Our work here, however, is more closely\nrelated to a recent work [18](Newton Sketch), which considers a similar class of problems and\nproposes sketching the Hessian using random sub-Gaussian matrices or randomized orthonormal\nsystems. Furthermore, [1] proposes a stochastic algorithm (LiSSA) that, for solving the sub-problems,\nemploys some unbiased estimators of the inverse of the Hessian.\nIn light of these prior works, our contributions can be summarized as follows.\n\u2022 For the class of problems considered here, unlike the uniform sampling used in [4, 7, 19, 20], we\nemploy two non-uniform sampling schemes based on block norm squares and a new, and more\ngeneral, notion of leverage scores named block partial leverage scores (De\ufb01nition 1). It can be\nshown that in the case of extreme non-uniformity among {Ai(w)}n\ni=1, uniform sampling might\nrequire \u2126(n) samples to capture the Hessian information appropriately. However, we show that our\nnon-uniform sampling schemes result in sample sizes completely independent of n and immune to\nsuch non-uniformity.\n\u2022 Within the context of globally convergent randomized second order algorithms, [4, 20] incorporate\ninexact updates where the sub-problems are solved only approximately. We extend the study of\ninexactness to our local convergence analysis.\n\u2022 We provide a general structural result (Lemma 2) showing that, as in [7, 18, 19], our main algorithm\nexhibits a linear-quadratic solution error recursion. However, we show that by using our non-\nuniform sampling strategies, the factors appearing in such error recursion enjoy a much better\ndependence on problem speci\ufb01c quantities, e.g., such as the condition number (Table 2). For\n\u221a\nexample, using block partial leverage score sampling, the factor for the linear term of the error\nrecursion (5) is of order O(\n\u2022 We demonstrate that to achieve a locally problem independent linear convergence rate, i.e., (cid:107)wt+1\u2212\nw\u2217(cid:107) \u2264 \u03c1(cid:107)wt \u2212 w\u2217(cid:107) for some \ufb01xed \u03c1 < 1, our algorithms achieve a lower per-iteration complexity\ncompared to [1, 18, 20] (Table 1). In particular, unlike Newton Sketch [18], which employs random\n\n\u03ba) as opposed to O(\u03ba) for uniform sampling.\n\n2\n\n\fTable 1: Complexity per iteration of different methods to obtain a problem independent local linear\nconvergence rate. The quantities \u03ba, \u02c6\u03ba, and \u00af\u03ba are the local condition numbers, de\ufb01ned in (6), satisfying\n\u03ba \u2264 \u02c6\u03ba \u2264 \u00af\u03ba, at the optimum w\u2217. A is de\ufb01ned in Assumption A3 and sr(A) is the stable rank of A\nsatisfying sr(A) \u2264 d. Here we assume ki = 1, C = Rd, R(w) = 0, and CG is used for solving\nsub-problems in our algorithms.\n\nNAME\n\nNewton-CG method\nSSN (leverage scores)\nSSN (row norm squares)\nNewton Sketch (SRHT)\n\nSSN (uniform)\n\nLiSSA\n\n\u221a\n\u03ba)\n\n\u02dcO(nnz(A)\n\nCOMPLEXITY PER ITERATION\n\u02dcO(nnz(A) log n + d2\u03ba3/2)\n\u02dcO(nnz(A) + sr(A)d\u03ba5/2)\n\n\u02dcO(nd(log n)4 + d2(log n)4\u03ba3/2)\n\n\u02dcO(nnz(A) + d\u02c6\u03ba\u03ba3/2)\n\u02dcO(nnz(A) + d\u02c6\u03ba\u00af\u03ba2)\n\nREFERENCE\n\n[17]\n\nThis paper\nThis paper\n\n[18]\n[20]\n[1]\n\nmin(A) = minx\u2208K\\{0} xT Ax/xT x and \u03bbK\nF /(cid:107)A(cid:107)2\n\nprojections and fails to preserve the sparsity of {Ai(w)}n\ni=1, our methods indeed take advantage\nof such sparsity. Also, in the presence of high non-uniformity among {Ai(w)}n\ni=1, factors \u00af\u03ba and\n\u02c6\u03ba (see De\ufb01nition (6)) which appear in SSN (uniform) [19], and LiSSA [1], can potentially be as\nlarge as \u2126(n\u03ba); see Section 3.5 for detailed discussions.\n\u2022 We numerically demonstrate the effectiveness and robustness of our algorithms in recovering the\nminimizer of ridge logistic regression on several real datasets (Figures 1 and 2). In particular, our\nalgorithms are at least twice as fast as Newton\u2019s methods in \ufb01nding a high-precision solution while\nother methods converge slowly.\n1.2 Notation and assumptions\nGiven a function F , the gradient, the exact Hessian and the approximate Hessian are denoted by g, H,\n\nand (cid:101)H, respectively. Iteration counter is denoted by subscript, e.g., wt. Unless stated speci\ufb01cally, (cid:107)\u00b7(cid:107)\n(cid:1)T , for appropriate size blocks Ai. The tangent cone of constraint\nbe viewed as A =(cid:0)AT\n\ndenotes the Euclidean norm for vectors and spectral norm for matrices. Frobenius norm of matrices\nis written as (cid:107) \u00b7 (cid:107)F . By a matrix A having n blocks, we mean that A has a block structure and can\nset C at the optimum w\u2217 is denoted by K and de\ufb01ned as K = {\u2206|w\u2217 + t\u2206 \u2208 C for some t > 0}.\nGiven a symmetric matrix A, the K-restricted minimum and maximum eigenvalues of A are de\ufb01ned,\nrespectively, as \u03bbK\nmax(A) = maxx\u2208K\\{0} xT Ax/xT x.\nThe stable rank of a matrix A is de\ufb01ned as sr(A) = (cid:107)A(cid:107)2\n2. We use nnz(A) to denote number\nof non-zero elements in A.\nThroughout the paper, we make use of the following assumptions:\nA.1 Lipschitz Continuity: F (w) is convex and twice differentiable with L-Lipschitz Hessian, i.e.,\nA.2 Local Regularity: F (x) is locally strongly convex and smooth, i.e., \u00b5 = \u03bbK\nmin(H(w\u2217)) >\nmax(H(w\u2217)) < \u221e. Here we de\ufb01ne the local condition number of the problem as\n0,\n\u03ba := \u03bd/\u00b5.\ni (w)Ai(w).\nFor simplicity, we assume k1 = \u00b7\u00b7\u00b7 = kn = k and k is independent of d. Furthermore, we\nassume that given w, computing Ai(w), Hi(w), and g(w) takes O(d), O(d2), and O(nnz(A))\n1 , . . . , AT\nn\nmatrix of {Ai(w)}. Note that H(w) = A(w)T A(w) + Q(w).\n2 Main Algorithm: SSN with Non-uniform Sampling\nOur proposed SSN method with non-uniform sampling is given in Algorithm 1. The core of our\nalgorithm is based on choosing a sampling scheme S that, at every iteration, constructs a non-uniform\nsampling distribution {pi}n\ni=1 to form the\n\ntime, respectively. We call the matrix A(w) = (cid:0)AT\n\nA.3 Hessian Decomposition: For each fi(w) in (1), de\ufb01ne \u22072fi(w) := Hi(w) := AT\n\ni=1 over {Ai(wt)}n\n\napproximate Hessian, (cid:101)H(wt). The sampling sizes s needed for different sampling distributions will be\ndiscussed in Section 3.2. Since H(w) =(cid:80)n\n\ni (w)Ai(w) + Q(w), the Hessian approximation\nessentially boils down to a matrix approximation problem. Here, we generalize the two popular\nnon-uniform sampling strategies, i.e., leverage score sampling and row norm squares sampling, which\nare commonly used in the \ufb01eld of randomized linear algebra, particularly for matrix approximation\n\ni=1 and then samples from {Ai(wt)}n\n\n(cid:1)T \u2208 Rnk\u00d7d the augmented\n\n(cid:107)H(u) \u2212 H(v)(cid:107) \u2264 L(cid:107)u \u2212 v(cid:107),\n\n\u2200u, v \u2208 C.\n\n1 \u00b7\u00b7\u00b7 AT\n\nn\n\n\u03bd = \u03bbK\n\ni=1 AT\n\n3\n\n\fF /(cid:107)A(cid:107)2\n\nproblems [10, 15]. With an approximate Hessian constructed via non-uniform sampling, we may\nchoose an appropriate solver A to the solve the sub-problem in Step 11 of Algorithm 1. Below we\nelaborate on the construction of the two non-uniform sampling schemes.\nBlock Norm Squares Sampling This is done by constructing a sampling distribution based on the\nFrobenius norm of the blocks Ai, i.e., pi = (cid:107)Ai(cid:107)2\nF , i = 1, . . . , n. This is an extension to the\nrow norm squares sampling in which the intuition is to capture the importance of the blocks based on\nthe \u201cmagnitudes\u201d of the sub-Hessians [10].\nBlock Partial Leverage Scores Sampling Recall standard leverage scores of a matrix A are\nde\ufb01ned as diagonal elements of the \u201chat\u201d matrix A(AT A)\u22121AT [15] which prove to be very useful\nin matrix approximation algorithms. However, in contrast to the standard case, there are two major\ndifferences in our task. First, blocks, not rows, are being sampled. Second, an additional matrix Q is\ninvolved in the target matrix, i.e., H. In light of this, we introduce a new and more general notion of\nleverage scores, called block partial leverage scores.\nDe\ufb01nition 1 (Block Partial Leverage Scores). Given a matrix A \u2208 Rkn\u00d7d viewed as having n\nblocks of size k \u00d7 d and a SPSD matrix Q \u2208 Rd\u00d7d, let {\u03c4i}kn+d\ni=1 be the (standard) leverage scores\n. The block partial leverage score for the i-th block is de\ufb01ned as\nof the augmented matrix\n\n(cid:18) A\n\n(cid:19)\n\nQ 1\n\n2\n\nj=k(i\u22121)+1 \u03c4j.\n\n\u03c4 Q\nNote that for k = 1 and Q = 0, the block partial leverage score is simply the standard leverage score.\nThe sampling distribution is de\ufb01ned as pi = \u03c4 Q\n\ni = 1, . . . , n.\n\n,\n\nj=1 \u03c4 Q\n\nj (A)\n\ni (A)/\n\n(cid:16)(cid:80)n\n\n(cid:17)\n\ni (A) =(cid:80)ki\n\nAlgorithm 1 Sub-sampled Newton method with Non-uniform Sampling\n1: Input: Initialization point w0, number of iteration T , sampling scheme S and solver A.\n2: Output: wT\n3: for t = 0, . . . , T \u2212 1 do\n4:\n5:\n6:\n\nConstruct the non-uniform sampling distribution {pi}n\nfor i = 1, . . . , n do\n\ni=1 as described in Section 2.\n\nqi = min{s \u00b7 pi, 1}, where s is the sampling size.\nqi, with probability qi,\n\n\u221a\n\n(cid:101)Ai(wt) =\n(cid:101)H(wt) =(cid:80)n\n\nend for\n\n(cid:26)Ai(wt)/\ni=1 (cid:101)AT\n\n0,\n\nwith probability 1 \u2212 qi.\n\ni (wt)(cid:101)Ai(wt) + Q(wt).\n\n7:\n\n8:\n9:\n10:\n11:\n\nCompute g(wt)\nUse solver A to solve the sub-problem inexactly\n\n(cid:104)(w \u2212 wt),(cid:101)H(wt)(w \u2212 wt)(cid:105) + (cid:104)g(wt), w \u2212 wt(cid:105)}.\n\n(2)\n\nwt+1 \u2248 arg min\nw\u2208C{ 1\n\n2\n\n12: end for\n13: return wT .\n\n3 Theoretical Results\nIn this section we provide detailed complexity analysis of our algorithm.1 Different choices of\nsampling scheme S and the sub-problem solver A lead to different complexities in SSN. More\nprecisely, total complexity is characterized by the following four factors: (i) total number of iterations\nT determined by the convergence rate which is affected by the choice of S and A; see Lemma 2 in\nSection 3.1, (ii) the time, tgrad, it takes to compute the full gradient g(wt) (Step 10 in Algorithm 1),\n(iii) the time tconst, to construct the sampling distribution {pi}n\ni=1 and sample s terms at each iteration\n(Steps 4-8 in Algorithm 1), which is determined by S; see Section 3.2 for details, and (iv) the time\ntsolve needed to (implicitly) form \u02dcH and (inexactly) solve the sub-problem at each iteration (Steps 9\nand 11 in Algorithm 1) which is affected by the choices of both S (manifested in the sampling size s)\nand A see Section 3.2&3.3 for details. With these, the total complexity can be expressed as\n\nT \u00b7 (tgrad + tconst + tsolve).\n\n(3)\n\n1In this work, we only focus on local convergence guarantees for Algorithm 1. To ensure global convergence,\none can incorporate an existing globally convergent method, e.g. [20], as initial phase and switch to Algorithm 1\nonce the iterate is \u201cclose enough\u201d to the optimum; see Lemma 2.\n\n4\n\n\fBelow we study these contributing factors. Moreover, the per iteration complexity of our algorithm for\nachieving a problem independent linear convergence rate is presented in Section 3.4 and comparison\nto other related work is discussed in Section 3.5.\n3.1 Local linear-quadratic error recursion\nBefore diving into details of the complexity analysis, we state a structural lemma that characterizes\nthe local convergence rate of our main algorithm, i.e., Algorithm 1. As discussed earlier, there are\ntwo layers of approximation in Algorithm 1, i.e., approximation of the Hessian by sub-sampling and\ninexactness of solving (2). For the \ufb01rst layer, we require the approximate Hessian to satisfy one of\nthe following two conditions (in Section 3.2 we shall see our construction of approximate Hessian\nvia non-uniform sampling can achieve these conditions with a sampling size independent of n).\n\n(cid:107)(cid:101)H(wt) \u2212 H(wt)(cid:107) \u2264 \u0001 \u00b7 (cid:107)H(wt)(cid:107),\n\n|xT ((cid:101)H(wt) \u2212 H(wt))y| \u2264 \u0001 \u00b7(cid:113)\n\nxT H(wt)x \u00b7(cid:113)\n\nor\n\n(C2)\nNote that (C1) and (C2) are two commonly seen guarantees for matrix approximation problems. In\nparticular, (C2) is stronger in the sense that the spectral of the approximated matrix H(wt) is well\npreserved. Below in Lemma 2, we shall see such a stronger condition ensures a better dependence on\nthe condition number in terms of the convergence rate. For the second layer of approximation, we\nrequire the solver to produce an \u00010-approximate solution wt+1 satisfying\n\nyT H(wt)y, \u2200x, y \u2208 K.\n\n(cid:107)wt+1 \u2212 w\u2217\n\nt+1(cid:107) \u2264 \u00010 \u00b7 (cid:107)wt \u2212 w\u2217\n\nt+1(cid:107),\n\n(4)\nt+1 is the exact optimal solution to (2). Note that (4) implies an \u00010-relative error approxima-\nt+1 \u2212 wt.\ni=1 be a sequence generated\n4L . Under\n\nwhere w\u2217\ntion to the exact update direction, i.e., (cid:107)v \u2212 v\u2217(cid:107) \u2264 \u0001(cid:107)v\u2217(cid:107) where v = wt+1 \u2212 wt, v\u2217 = w\u2217\nLemma 2 (Structural Result). Let \u0001 \u2208 (0, 1/2) and \u00010 be given and {wt}T\nby (2) which satis\ufb01es (4). Also assume that the initial point w0 satis\ufb01es (cid:107)w0 \u2212 w\u2217(cid:107) \u2264 \u00b5\nAssumptions A1 & A2, the solution error satis\ufb01es the following recursion\n(cid:107)wt+1 \u2212 w\u2217(cid:107) \u2264 (1 + \u00010)Cq \u00b7 (cid:107)wt \u2212 w\u2217(cid:107)2 + (\u00010 + (1 + \u00010)Cl) \u00b7 (cid:107)wt \u2212 w\u2217(cid:107),\n\n(5)\n\n(C1)\n\nwhere Cl and Cq are speci\ufb01ed as below.\n\n\u2022 Cq =\n\u2022 Cq =\n\n2L\n\n(1 \u2212 2\u0001\u03ba)\u00b5\n(1 \u2212 \u0001)\u00b5\n\n2L\n\n4\u0001\u03ba\n1 \u2212 2\u0001\u03ba\n\u221a\n3\u0001\n\u03ba\n1 \u2212 \u0001\n\nand Cl =\n\n, if condition (C1) is met;\n\nand Cl =\n\n, if condition (C2) is met.\n3.2 Complexities related to the choice of sampling scheme S\nThe following lemma gives the complexity of constructing the sampling distributions used in this\npaper. Here, we adopt the fast approximation algorithm for standard leverage scores, [6], to obtain an\nef\ufb01cient approximation to our block partial leverage scores.\nLemma 3 (Construction Complexity). Under Assumption 3, it takes tconst = O(nnz(A)) time to\nconstruct a block norm squares sampling distribution, and it takes tconst = O(nnz(A) log n) time\nto construct, with high probability, a distribution with constant factor approximation to the block\npartial leverage scores.\nThe following theorem indicates that if the blocks of the augmented matrix of {Ai(w)} (see As-\nsumption 3) are sampled based on block norm squares or block partial leverage scores with large\nenough sampling size, (C1) or (C2) holds, respectively, with high probability.\nTheorem 4 (Suf\ufb01cient Sample Size). Given any \u0001 \u2208 (0, 1), the following statements hold:\n(i) Let ri = (cid:107)Ai(cid:107)2\n\nj=1 rj) and construct (cid:101)H as in Steps 5-9 of\nAlgorithm 1. Then if s \u2265 4sr(A) \u00b7 log (min{4sr(A), d}/\u03b4) /\u00012, with probability at least 1 \u2212 \u03b4,\n(C1) holds.\nj (A)), i = 1, . . . , n. Construct (cid:101)H as in\ni (A) \u2265\n(cid:17) \u00b7 log (4d/\u03b4) /\u00012, with probability at\n\nF , i = 1, . . . , n, set pi = ri/((cid:80)n\ni (A)/((cid:80)n\n(cid:16)(cid:80)n\n\ni=1 be some overestimate of the block partial leverage scores, i.e., \u02c6\u03c4 Q\n\n\u03c4 Q\ni (A), i = 1, . . . , n and set pi = \u02c6\u03c4 Q\nSteps 5-9 of Algorithm 1. Then if s \u2265 4\nleast 1 \u2212 \u03b4, (C2) holds.\n\n(ii) Let {\u02c6\u03c4 Q\n\nj=1 \u02c6\u03c4 Q\ni (A)\n\ni=1 \u02c6\u03c4 Q\n\ni (A)}n\n\n5\n\n\fAlso, as for the exact block partial leverage scores we have(cid:80)n\n\nRemarks: Part (i) of Theorem 4 is an extension of [10] to our particular augmented matrix setting.\ni (A) \u2264 d, part (ii) of Theorem 4\nimplies that, using exact scores, less than O(d log d/\u00012) blocks are needed for (C2) to hold.\n3.3 Complexities related to the choice of solver A\nWe now discuss how tsolve in (3) is affected by the choice of the solver A in Algorithm 1. The\nfor solving the sub-problem (2) essentially depends on the choice A, the constraint set C, s and d,\ni.e., tsolve = T (A,C, s, d). For example, when the problem is unconstrained (C = Rd), CG takes\n\u221a\ntsolve = O(sd\n\u03bat\u0001 in (4) where\n\napproximate Hessian (cid:101)H(wt) is of the form \u02dcAT \u02dcA + Q where \u02dcA \u2208 Rsk\u00d7d. As a result, the complexity\n\u03bat = \u03bbmax((cid:101)H(wt))/\u03bbmin((cid:101)H(wt)).\n\n\u03bat log(1/\u0001)) to return a solution with approximation quality \u00010 =\n\ni=1 \u03c4 Q\n\n\u221a\n\n3.4 Total complexity per iteration\nLemma 2 implies that, by choosing appropriate values for \u0001 and \u00010, SSN inherits a local constant\nlinear convergence rate, i.e., (cid:107)wt+1 \u2212 w\u2217(cid:107) \u2264 \u03c1(cid:107)wt \u2212 w\u2217(cid:107) with \u03c1 < 1. The following Corollary\ngives the total complexity per iteration of Algorithm 1 to obtain a locally linear rate.\nCorollary 5. Suppose C = Rd and CG is used to solve the sub-problem (2). Then under Assump-\ntion 3, to obtain a constant local linear convergence rate with a constant probability, the complexity\nper iteration of Algorithm 1 using the block partial leverage scores sampling and block norm squares\nsampling is \u02dcO(nnz(A) log n + d2\u03ba3/2) and \u02dcO(nnz(A) + sr(A)d\u03ba5/2), respectively. 2\n3.5 Comparison with existing similar methods\nAs discussed above, the sampling scheme S plays a crucial role in the overall complexity of SSN.\nWe \ufb01rst compare our proposed non-uniform sampling schemes with the uniform alternative [20],\nin terms of complexities tconst and tsolve as well as the quality of the locally linear-quadratic error\nrecursion (5), measured by Cq and Cl. Table 2 gives a summary of such comparison where, for\nsimplicity, we assume that k = 1, C = Rd, and a direct solver is used for the linear system sub-\nproblem (2). Also, throughout this subsection, for randomized algorithms, we choose parameters\nsuch that the failure probability is a constant. One advantage of uniform sampling is its simplicity of\nconstruction. However, as shown in Section 3.2, it takes nearly input-sparsity time to construct the\nproposed non-uniform sampling distribution. In addition, when rows of A are very non-uniform, i.e.,\nmaxi (cid:107)Ai(cid:107) (cid:117) (cid:107)A(cid:107), uniform scheme requires \u2126(n) samples to achieve (C1). It can also be seen that\nfor a given \u0001, row norm squares sampling requires the smallest sampling size, yielding the smallest\ntsolve in Table 2. More importantly, although either (C1) or (C2) is suf\ufb01cient to give (5), having (C2)\nas in SSN with leverage score sampling yields constants Cq and Cl with much better dependence on\nthe local condition number, \u03ba, than other methods. This fact can drastically improve the performance\nof SSN for ill-conditioned problems; see Figure 1 in Section 4.\n\nTable 2: Comparison between standard Newton\u2019s methods and sub-sampled Newton methods (SSN)\nwith different sampling schemes. Cq and Cl are the constants appearing in (5), A is the augmented\nmatrix of {Ai(w)} with stable rank sr(A), \u03ba = \u03bd/\u00b5 is the local condition number and \u02dc\u03ba = L/\u00b5.\nHere, we assume that k = 1, C = Rd, and a direct solver is used in Algorithm 1.\nCq\n\u02dc\u03ba\n\u02dc\u03ba\n1\u2212\u0001\n\u02dc\u03ba\n1\u2212\u0001\u03ba\n1\u2212\u0001\u03ba\n\nSSN (leverage scores)\nSSN (row norm squares)\n\nO(nd2)\ni \u03c4 Q\ni (A))d2/\u00012)\n\n\u02dcO(sr(A)d2/\u00012)\nnd2 maxi (cid:107)Ai(cid:107)2\n\n\u02dcO(((cid:80)\n\u02dcO(cid:16)\n\nCl\n0\n\u221a\n\u03ba\n\u0001\n1\u2212\u0001\n\u0001\u03ba\n1\u2212\u0001\u03ba\n\u0001\u03ba\n1\u2212\u0001\u03ba\n\nSSN (uniform) [20]\n\nO(nnz(A) log n)\n\nO(nnz(A))\n\n/\u00012(cid:17)\n\ntsolve = sd2\n\nNAME\n\nNewton\u2019s method\n\ntconst\n\n0\n\nO(1)\n\n(cid:107)A(cid:107)2\n\n\u02dc\u03ba\n\nNext, recall that in Table 1, we summarize the per-iteration complexity needed by our algorithm and\nother similar methods [20, 1, 18] to achieve a given local linear convergence rate. Here we provide\nmore details. First, the de\ufb01nition of various notions of condition number used in Table 1 is given\nbelow. For any given w \u2208 Rd, de\ufb01ne\n\n\u03bbmax((cid:80)n\n\u03bbmin((cid:80)n\n\n\u03ba(w) =\n\ni=1 Hi(w))\ni=1 Hi(w))\n\n, \u02c6\u03ba(w) = n\u00b7 maxi \u03bbmax(Hi(w))\ni=1 Hi(w))\n\n, \u00af\u03ba(w) =\n\nmaxi \u03bbmax(Hi(w))\nmini \u03bbmin(Hi(w))\n\n,\n\n(6)\n\n\u03bbmin((cid:80)n\n\n2In this paper, \u02dcO(\u00b7) hides logarithmic factors of d, \u03ba and 1/\u03b4.\n\n6\n\n\fmax((cid:80)n\n\n(cid:80)n\ni=1 \u03bbK\n\nmax(Hi(w)) \u2248 n\u00b7 maxi \u03bbK\n\nIt is easy to see that \u03ba(w) \u2264 \u02c6\u03ba(w) \u2264 \u00af\u03ba(w).\nassuming that the denominators are non-zero.\nHowever, the degree of the discrepancy among these inequalities depends on the properties of\ni=1 Hi(w)) \u2248\nHi(w). Roughly speaking, when all Hi(w)\u2019s are \u201csimilar\u201d, one has that \u03bbK\nmax(Hi(w)), and thus \u03ba(w) \u2248 \u02c6\u03ba(w) \u2248 \u00af\u03ba(w). However, in many\nreal applications, such uniformity doesn\u2019t simply exist. For example, it is not hard to design a matrix\nA with non-uniform rows such that for H = AT A, \u02c6\u03ba and \u00af\u03ba are larger than \u03ba by a factor of n. This\nimplies although SSN with leverage score sampling has a quadratic dependence on d, its dependence\non the condition number is signi\ufb01cantly better than all other methods such as SSN (uniform) and\nLiSSA. Moreover compared to Newton\u2019s method, all these stochastic variants replace the coef\ufb01cient\nof the leading term, i.e., O(nd), with some lower order terms that only depend on d and condition\nnumbers (assuming nnz(A) \u2248 nd). Therefore, one should expect these algorithms to perform well\nwhen n (cid:29) d and the problem is moderately conditioned.\n4 Numerical Experiments\nWe consider an estimation problem in GLMs with Gaussian prior. Assume X \u2208 Rn\u00d7d, Y \u2208 Y n are\nthe data matrix and response vector. The problem of minimizing the negative log-likelihood with\nridge penalty can be written as\n\nn(cid:88)\n\n(cid:48)(cid:48)\n\n(cid:48)(cid:48)\n\n(xT\n\ni=1\n\ni=1 \u03c8\n\nparameter. In this case, the Hessian is H(w) =(cid:80)n\n(cid:112)\u03c8\n\nwhere \u03c8 : R \u00d7 Y \u2192 R is a convex cumulant generating function and \u03bb \u2265 0 is the ridge penalty\ni +\u03bbI := XT D2(w)X+\u03bbI,\nwhere xi is i-th column of XT and D(w) is a diagonal matrix with the diagonal [D(w)]ii =\ni w, yi). The augmented matrix of {Ai(w)} can be written as A(w) = DX \u2208 Rn\u00d7d where\n\ni w, yi)xixT\n\ni .\nAi(w) = [D(w)]iixT\nFor our numerical simulations, we consider a very popular instance of GLMs, namely, logistic\nregression, where \u03c8(u, y) = log(1 + exp(\u2212uy)) and Y = {\u00b11}. Table 3 summarizes the datasets\nused in our experiments.\nTable 3: Datasets used in ridge logistic regression. In the above, \u03ba and \u00af\u03ba are the local condition\nnumbers of ridge logistic regression problem with \u03bb = 0.01 as de\ufb01ned in (6).\n\n(xT\n\nmin\nw\u2208Rd\n\n\u03c8(xT\n\ni w, yi) + \u03bb(cid:107)w(cid:107)2\n2,\n\nDATASET\n\nn\nd\n\u03ba\n\u02c6\u03ba\n\nCT slices[9]\n\n53,500\n\n385\n368\n\n47,078\n\nForest[2]\n581,012\n\n55\n221\n\n322,370\n\nAdult[13]\n\n32,561\n\n123\n182\n\n69,359\n\nBuzz[11]\n59,535\n\n78\n37\n\n384,580\n\nWe compare the performance of the following \ufb01ve algorithms: (i) Newton: the standard Newton\u2019s\nmethod, (ii) Uniform: SSN with uniform sampling, (iii) PLevSS: SSN with partial leverage scores\nsampling, (iv) RNormSS: SSN with block (row) norm squares sampling, and (v) LBFGS-k is the\nstandard L-BFGS method [14] with history size k.\nAll algorithms are initialized with a zero vector.3 We also use CG to solve the sub-problem approxi-\nmately to within 10\u22126 relative residue error. In order to compute the relative error (cid:107)wt \u2212 w\u2217(cid:107)/(cid:107)w\u2217(cid:107),\nan estimate of w\u2217 is obtained by running the standard Newton\u2019s method for suf\ufb01ciently long time.\nNote here, in SSN with partial leverage score sampling, we recompute the leverage scores every 10\niterations. Roughly speaking, these \u201cstale\u201d leverage scores can be viewed as approximate leverage\nscores for the current iteration with approximation quality that can be upper bounded by the change\nof the Hessian and such quantity is often small in practice. So reusing the leverage scores allows us\nto further drive down the running time.\nWe \ufb01rst investigate the effect of the condition number, controlled by varying \u03bb, on the performance\nof different methods, and the results are depicted in Figure 1. It can be seen that in well-conditioned\ncases, all sampling schemes work equally well. However, as the condition number worsens, the\nperformance of uniform sampling deteriorates, while non-uniform sampling, in particular leverage\nscore sampling, shows a great degree of robustness to such ill-conditioning effect. The experiments\nshown in Figure 1 are consistent with the theoretical results of Table 2, showing that the theory\npresented here can indeed be a reliable guide to practice.\n\n3Theoretically, the suitable initial point for all the algorithms is the one with which the standard Newton\u2019s\n\nmethod converges with a unit stepsize. Here, w0 = 0 happens to be one such good starting point.\n\n7\n\n\f(a) condition number\n\n(b) sampling size\n\n(c) running time\n\nFigure 1: Ridge logistic regression on Adult with different \u03bb\u2019s: (a) local condition number \u03ba, (b)\nsample size for different SSN methods giving the best overall running time, (c) running time for\ndifferent methods to achieve 10\u22128 relative error.\n\nNext, we compare the performance of various methods as measured by relative-error of the solution\nvs. running time and the results are shown in Figure 24. It can be seen that, in most cases, SSN with\nnon-uniform sampling schemes outperforms the other algorithms, especially Newton\u2019s method. In\nparticular, uniform sampling scheme performs poorly, e.g., in Figure 2(b), when the problem exhibits\na high non-uniformity among data points which is re\ufb02ected in the difference between \u03ba and \u00af\u03ba shown\nin Table 3.\n\n(a) CT Slice\n\n(b) Forest\n\n(c) Adult\n\n(d) Buzz\n\nFigure 2: Iterate relative solution error vs. time(s) for various methods on four datasets with ridge\npenalty parameter \u03bb = 0.01. The values in brackets denote the sample size used for each method.\n\nWe would like to remind the reader that for the locally strongly convex problems that we consider\nhere, one can provably show that the behavior of the error in the loss function, i.e., F (wk) \u2212\nF (w\u2217)/|F (w\u2217)| follows the same pattern as that of the solution error, i.e., (cid:107)wk \u2212 w\u2217(cid:107)/(cid:107)w\u2217(cid:107); see\n[23] for details. As a result, our algorithms remain to be effective for cases where the primary goal is\nto reduce the loss (as opposed to the solution error).\n5 Conclusions\nIn this paper, we propose non-uniformly sub-sampled Newton methods with inexact update for a class\nof constrained problems. We show that our algorithms have a better dependence on the condition\nnumber and enjoy a lower per-iteration complexity, compared to other similar existing methods.\nTheoretical advantages are numerically demonstrated.\nAcknowledgments. We would like to thank the Army Research Of\ufb01ce and the Defense Advanced\nResearch Projects Agency as well as Intel, Toshiba and the Moore Foundation for support along\nwith DARPA through MEMEX (FA8750-14-2-0240), SIMPLEX (N66001-15-C-4043), and XDATA\n(FA8750-12-2-0335) programs, and the Of\ufb01ce of Naval Research (N000141410102, N000141210041\nand N000141310129). Any opinions, \ufb01ndings, and conclusions or recommendations expressed in\nthis material are those of the authors and do not necessarily re\ufb02ect the views of DARPA, ONR, or the\nU.S. government.\n\nReferences\n[1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second order stochastic optimization in linear time. arXiv\n\npreprint arXiv:1602.03943, 2016.\n\n4For each sub-sampled Newton method, the sampling size is determined by choosing the best value from\n{10d, 20d, 30d, ..., 100d, 200d, 300d, ..., 1000d} in the sense that the objective value drops to 1/3 of initial\nfunction value \ufb01rst.\n\n8\n\nlog(lambda)-6-5-4-3-2-10condition number100102104106108log(lambda)-6-5-4-3-2-10best sampling size\u00d710400.511.522.533.5NewtonUniformPLevSSRNormSSlog(lambda)-6-5-4-3-2-10running time (s)00.20.40.60.811.2NewtonUniformPLevSSRNormSSLBFGS-50time (s)0246810||w - w*||2/||w*||210-1510-1010-5100logistic - lambda=0.01NewtonUniform (7700)PLevSS (3850)RNormSS (3850)LBFGS-100LBFGS-50time (s)0246810||w - w*||2/||w*||210-1510-1010-5100logistic - lambda=0.01NewtonUniform (27500)PLevSS (3300)RNormSS (3300)LBFGS-100LBFGS-50time (s)00.511.52||w - w*||2/||w*||210-1510-1010-5100logistic - lambda=0.01NewtonUniform (24600)PLevSS (2460)RNormSS (2460)LBFGS-100LBFGS-50time (s)0246810||w - w*||2/||w*||210-1510-1010-5100logistic - lambda=0.01NewtonUniform (39000)PLevSS (1560)RNormSS (1560)LBFGS-100LBFGS-50\f[2] Jock A Blackard and Denis J Dean. Comparative accuracies of arti\ufb01cial neural networks and discriminant\nanalysis in predicting forest cover types from cartographic variables. Computers and electronics in\nagriculture, 24(3):131\u2013151, 1999.\n\n[3] S\u00e9bastien Bubeck. Theory of convex optimization for machine learning. arXiv preprint arXiv:1405.4980,\n\n2014.\n\n[4] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic Hessian\ninformation in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977\u2013995,\n2011.\n\n[5] Ron S Dembo, Stanley C Eisenstat, and Trond Steihaug. Inexact Newton methods. SIAM Journal on\n\nNumerical Analysis, 19(2):400\u2013408, 1982.\n\n[6] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation\nof matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1):3475\u20133506,\n2012.\n\n[7] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled Newton methods. In Advances\n\nin Neural Information Processing Systems, pages 3034\u20133042, 2015.\n\n[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer series in statistics Springer, Berlin, 2001.\n\n[9] Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian P\u00f6lsterl, and Alexander Cavallaro. 2d image\nregistration in ct images using radial image descriptors. In Medical Image Computing and Computer-\nAssisted Intervention\u2013MICCAI 2011, pages 607\u2013614. Springer, 2011.\n\n[10] John T. Holodnak and Ilse C. F. Ipsen. Randomized approximation of the Gram matrix: Exact computation\n\nand probabilistic bounds. SIAM J. Matrix Analysis Applications, 36(1):110\u2013137, 2015.\n\n[11] Fran\u00e7ois Kawala, Ahlame Douzal-Chouakria, Eric Gaussier, and Eustache Dimert. Pr\u00e9dictions d\u2019activit\u00e9\ndans les r\u00e9seaux sociaux en ligne. In 4i\u00e8me conf\u00e9rence sur les mod\u00e8les et l\u2019analyse des r\u00e9seaux: Approches\nmath\u00e9matiques et informatiques, page 16, 2013.\n\n[12] Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287\u2013364,\n\n2012.\n\n[13] M. Lichman. UCI machine learning repository, 2013.\n\n[14] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.\n\n45:503\u2013528, 1989.\n\n[15] Michael W Mahoney. Randomized Algorithms for Matrices and Data. Foundations and Trends in Machine\n\nLearning. NOW Publishers, Boston, 2011. Also available at arXiv:1104.5557v2.\n\n[16] James Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International\n\nConference on Machine Learning (ICML-10), pages 735\u2013742, 2010.\n\n[17] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.\n\n[18] Mert Pilanci and Martin J Wainwright. Newton sketch: A linear-time optimization algorithm with\n\nlinear-quadratic convergence. arXiv preprint arXiv:1505.02250, 2015.\n\n[19] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-Sampled Newton Methods I: Globally Convergent\n\nAlgorithms. arXiv preprint arXiv:1601.04737, 2016.\n\n[20] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-Sampled Newton Methods II: Local Convergence\n\nRates. arXiv preprint arXiv:1601.04738, 2016.\n\n[21] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[22] Oriol Vinyals and Daniel Povey. Krylov subspace descent for deep learning.\n\narXiv:1111.4259, 2011.\n\narXiv preprint\n\n[23] Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher R\u00e9, and Michael W Mahoney. Sub-sampled\n\nNewton methods with non-uniform sampling. arXiv preprint arXiv:1607.00559, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1497, "authors": [{"given_name": "Peng", "family_name": "Xu", "institution": "Stanford University"}, {"given_name": "Jiyan", "family_name": "Yang", "institution": "Stanford University"}, {"given_name": "Fred", "family_name": "Roosta", "institution": "University of California Berkeley"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": null}, {"given_name": "Michael", "family_name": "Mahoney", "institution": "UC Berkeley"}]}