{"title": "First-Order Adaptive Sample Size Methods to Reduce Complexity of Empirical Risk Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2060, "page_last": 2068, "abstract": "This paper studies empirical risk minimization (ERM) problems for large-scale datasets and incorporates the idea of adaptive sample size methods to improve the guaranteed convergence bounds for first-order stochastic and deterministic methods. In contrast to traditional methods that attempt to solve the ERM problem corresponding to the full dataset directly, adaptive sample size schemes start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically -- e.g., scaling by a factor of two -- and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. The gains are specific to the choice of method. When particularized to, e.g., accelerated gradient descent and stochastic variance reduce gradient, the computational cost advantage is a logarithm of the number of training samples. Numerical experiments on various datasets confirm theoretical claims and showcase the gains of using the proposed adaptive sample size scheme.", "full_text": "First-Order Adaptive Sample Size Methods to\n\nReduce Complexity of Empirical Risk Minimization\n\nAryan Mokhtari\n\nUniversity of Pennsylvania\naryanm@seas.upenn.edu\n\nAlejandro Ribeiro\n\nUniversity of Pennsylvania\naribeiro@seas.upenn.edu\n\nAbstract\n\nThis paper studies empirical risk minimization (ERM) problems for large-scale\ndatasets and incorporates the idea of adaptive sample size methods to improve the\nguaranteed convergence bounds for \ufb01rst-order stochastic and deterministic meth-\nods. In contrast to traditional methods that attempt to solve the ERM problem\ncorresponding to the full dataset directly, adaptive sample size schemes start with\na small number of samples and solve the corresponding ERM problem to its sta-\ntistical accuracy. The sample size is then grown geometrically \u2013 e.g., scaling by a\nfactor of two \u2013 and use the solution of the previous ERM as a warm start for the\nnew ERM. Theoretical analyses show that the use of adaptive sample size methods\nreduces the overall computational cost of achieving the statistical accuracy of the\nwhole dataset for a broad range of deterministic and stochastic \ufb01rst-order meth-\nods. The gains are speci\ufb01c to the choice of method. When particularized to, e.g.,\naccelerated gradient descent and stochastic variance reduce gradient, the computa-\ntional cost advantage is a logarithm of the number of training samples. Numerical\nexperiments on various datasets con\ufb01rm theoretical claims and showcase the gains\nof using the proposed adaptive sample size scheme.\n\n1\n\nIntroduction\n\nFinite sum minimization (FSM) problems involve objectives that are expressed as the sum of a\ntypically large number of component functions. Since evaluating descent directions is costly, it is\ncustomary to utilize stochastic descent methods that access only one of the functions at each itera-\ntion. When considering \ufb01rst order methods, a \ufb01tting measure of complexity is the total number of\ngradient evaluations that are needed to achieve optimality of order \u270f. The paradigmatic deterministic\ngradient descent (GD) method serves as a naive complexity upper bound and has long been known\nto obtain an \u270f-suboptimal solution with O(N\uf8ff log(1/\u270f)) gradient evaluations for an FSM problem\nwith N component functions and condition number \uf8ff [13]. Accelerated gradient descent (AGD) [14]\nimproves the computational complexity of GD to O(Np\uf8ff log(1/\u270f)), which is known to be the opti-\nmal bound for deterministic \ufb01rst-order methods [13]. In terms of stochastic optimization, it has been\nonly recently that linearly convergent methods have been proposed. Stochastic averaging gradient\n[15, 8], stochastic variance reduction [10], and stochastic dual coordinate ascent [17, 18], have all\nbeen shown to converge to \u270f-accuracy at a cost of O((N +\uf8ff) log(1/\u270f)) gradient evaluations. The ac-\ncelerating catalyst framework in [11] further reduces complexity to O((N +pN\uf8ff ) log(\uf8ff) log(1/\u270f))\nand the works in [1] and [7] to O((N + pN\uf8ff ) log(1/\u270f)). The latter matches the upper bound on\n\nthe complexity of stochastic methods [20].\nPerhaps the main motivation for studying FSM is the solution of empirical risk minimization (ERM)\nproblems associated with a large training set. ERM problems are particular cases of FSM, but they\ndo have two speci\ufb01c qualities that come from the fact that ERM is a proxy for statistical loss min-\nimization. The \ufb01rst property is that since the empirical risk and the statistical loss have different\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fminimizers, there is no reason to solve ERM beyond the expected difference between the two objec-\ntives. This so-called statistical accuracy takes the place of \u270f in the complexity orders of the previous\nparagraph and is a constant of order O(1/N \u21b5) where \u21b5 is a constant from the interval [0.5, 1] de-\npending on the regularity of the loss function; see Section 2. The second important property of\nERM is that the component functions are drawn from a common distribution. This implies that if\nwe consider subsets of the training set, the respective empirical risk functions are not that different\nfrom each other and, indeed, their differences are related to the statistical accuracy of the subset.\nThe relationship of ERM to statistical loss minimization suggests that ERM problems have more\nstructure than FSM problems. This is not exploited by most existing methods which, albeit used for\nERM, are in fact designed for FSM. The goal of this paper is to exploit the relationship between ERM\nand statistical loss minimization to achieve lower overall computational complexity for a broad class\nof \ufb01rst-order methods applied to ERM. The technique we propose uses subsamples of the training\nset containing n \uf8ff N component functions that we grow geometrically. In particular, we start by a\nsmall number of samples and minimize the corresponding empirical risk added by a regularization\nterm of order Vn up to its statistical accuracy. Note that, based on the \ufb01rst property of ERM, the\nadded adaptive regularization term does not modify the required accuracy while it makes the problem\nstrongly convex and improves the problem condition number. After solving the subproblem, we\ndouble the size of the training set and use the solution of the problem with n samples as a warm\nstart for the problem with 2n samples. This is a reasonable initialization since based on the second\nproperty of ERM the functions are drawn from a joint distribution, and, therefore, the optimal values\nof the ERM problems with n and 2n functions are not that different from each other. The proposed\napproach succeeds in exploiting the two properties of ERM problems to improve complexity bounds\nof \ufb01rst-order methods. In particular, we show that to reach the statistical accuracy of the full training\nset the adaptive sample size scheme reduces the overall computational complexity of a broad range\nof \ufb01rst-order methods by a factor of log(N \u21b5). For instance, the overall computational complexity\nof adaptive sample size AGD to reach the statistical accuracy of the full training set is of order\nO(Np\uf8ff) which is lower than O((Np\uf8ff) log(N \u21b5)) complexity of AGD.\nRelated work. The adaptive sample size approach was used in [6] to improve the performance of\nthe SAGA method [8] for solving ERM problems. In the dynamic SAGA (DynaSAGA) method in\n[6], the size of training set grows at each iteration by adding two new samples, and the iterates are\nupdated by a single step of SAGA. Although DynaSAGA succeeds in improving the performance of\nSAGA for solving ERM problems, it does not use an adaptive regularization term to tune the problem\ncondition number. Moreover, DynaSAGA only works for strongly convex functions, while in our\nproposed scheme the functions are convex (not necessarily strongly convex). The work in [12] is\nthe most similar work to this manuscript. The Ada Newton method introduced in [12] aims to solve\neach subproblem within its statistical accuracy with a single update of Newton\u2019s method by ensuring\nthat iterates always stay in the quadratic convergence region of Newton\u2019s method. Ada Newton\nreaches the statistical accuracy of the full training in almost two passes over the dataset; however, its\ncomputational complexity is prohibitive since it requires computing the objective function Hessian\nand its inverse at each iteration.\n\n2 Problem Formulation\nConsider a decision vector w 2 Rp, a random variable Z with realizations z and a convex loss\nfunction f (w; z). We aim to \ufb01nd the optimal argument that minimizes the optimization problem\n\nw\u21e4 := argmin\n\nw\n\nL(w) = argmin\n\nw\n\nEZ[f (w, Z)] = argmin\n\nf (w, Z)P (dz),\n\n(1)\n\nw ZZ\n\nwhere L(w) := EZ[f (w, Z)] is de\ufb01ned as the expected loss, and P is the probability distribution\nof the random variable Z. The optimization problem in (1) cannot be solved since the distribution\nP is unknown. However, we have access to a training set T = {z1, . . . , zN} containing N indepen-\ndent samples z1, . . . , zN drawn from P , and, therefore, we attempt to minimize the empirical loss\nassociated with the training set T = {z1, . . . , zN}, which is equivalent to minimizing the problem\n(2)\n\nw\u2020n := argmin\n\nf (w, zi),\n\nnXi=1\nfor n = N. Note that in (2) we de\ufb01ned Ln(w) := (1/n)Pn\n\nLn(w) = argmin\n\n1\nn\n\nw\n\nw\n\ni=1 f (w, zi) as the empirical loss.\n\n2\n\n\fThere is a rich literature on bounds for the difference between the expected loss L and the empirical\nloss Ln which is also referred to as estimation error [4, 3]. We assume here that there exists a\nconstant Vn, which depends on the number of samples n, that upper bounds the difference between\nthe expected and empirical losses for all w 2 Rp\n\nE\uf8ff sup\nw2Rp |L(w) Ln(w)| \uf8ff Vn,\n\n(3)\n\nwhere the expectation is with respect to the choice of the training set. The celebrated work of Vapnik\n\nin [19, Section 3.4] provides the upper bound Vn = O(p(1/n) log(1/n)) which can be improved\nto Vn = O(p1/n) using the chaining technique (see, e.g., [5]). Bounds of the order Vn = O(1/n)\n\nhave been derived more recently under stronger regularity conditions that are not uncommon in\npractice, [2, 9, 4]. In this paper, we report our results using the general bound Vn = O(1/n\u21b5) where\n\u21b5 can be any constant form the interval [0.5, 1].\nThe observation that the optimal values of the expected loss and empirical loss are within a Vn\ndistance of each other implies that there is no gain in improving the optimization error of minimizing\nLn beyond the constant Vn. In other words, if we \ufb01nd an approximate solution wn such that the\noptimization error is bounded by Ln(wn) Ln(w\u2020n) \uf8ff Vn, then \ufb01nding a more accurate solution to\nreduce the optimization error is not bene\ufb01cial since the overall error, i.e., the sum of estimation and\noptimization errors, does not become smaller than Vn. Throughout the paper we say that wn solves\nthe ERM problem in (2) to within its statistical accuracy if it satis\ufb01es Ln(wn) Ln(w\u2020n) \uf8ff Vn.\nWe can further leverage the estimation error to add a regularization term of the form (cVn/2)kwk2 to\nthe empirical loss to ensure that the problem is strongly convex. To do so, we de\ufb01ne the regularized\nempirical risk Rn(w) := Ln(w) + (cVn/2)kwk2 and the corresponding optimal argument\n\nw\u21e4n := argmin\n\nw\n\nRn(w) = argmin\n\nw\n\nLn(w) +\n\ncVn\n2 kwk2,\n\n(4)\n\nand attempt to minimize Rn with accuracy Vn. Since the regularization in (4) is of order Vn and\n(3) holds, the difference between Rn(w\u21e4n) and L(w\u21e4) is also of order Vn \u2013 this is not immediate\nas it seems; see [16]. Thus, the variable wn solves the ERM problem in (2) to within its statistical\naccuracy if it satis\ufb01es Rn(wn) Rn(w\u21e4n) \uf8ff Vn. It follows that by solving the problem in (4) for\nn = N we \ufb01nd w\u21e4N that solves the expected risk minimization in (1) up to the statistical accuracy\nVN of the full training set T . In the following section we introduce a class of methods that solve\nproblem (4) up to its statistical accuracy faster than traditional deterministic and stochastic descent\nmethods.\n\n3 Adaptive Sample Size Methods\n\nThe empirical risk minimization (ERM) problem in (4) can be solved using state-of-the-art methods\nfor minimizing strongly convex functions. However, these methods never exploit the particular\nproperty of ERM that the functions are drawn from the same distribution. In this section, we propose\nan adaptive sample size scheme which exploits this property of ERM to improve the convergence\nguarantees for traditional optimization method to reach the statistical accuracy of the full training\nset. In the proposed adaptive sample size scheme, we start by a small number of samples and solve\nits corresponding ERM problem with a speci\ufb01c accuracy. Then, we double the size of the training\nset and use the solution of the previous ERM problem \u2013 with half samples \u2013 as a warm start for the\nnew ERM problem. This procedure keeps going until the training set becomes identical to the given\ntraining set T which contains N samples.\nConsider the training set Sm with m samples as a subset of the full training T , i.e., Sm \u21e2T . As-\nsume that we have solved the ERM problem corresponding to the set Sm such that the approximate\nsolution wm satis\ufb01es the condition E[Rm(wm) Rm(w\u21e4m)] \uf8ff m. Now the next step in the pro-\nposed adaptive sample size scheme is to double the size of the current training set Sm and solve the\nERM problem corresponding to the set Sn which has n = 2m samples and contains the previous\nset, i.e., Sm \u21e2S n \u21e2T .\nWe use wm which is a proper approximate for the optimal solution of Rm as the initial iterate for the\noptimization method that we use to minimize the risk Rn. This is a reasonable choice if the optimal\narguments of Rm and Rn are close to each other, which is the case since samples are drawn from\n\n3\n\n\fAlgorithm 1 Adaptive Sample Size Mechanism\n1: Input: Initial sample size n = m0 and argument wn = wm0 with krRn(wn)k \uf8ff (p2c)Vn\n2: while n \uf8ff N do {main loop}\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end while\n\nUpdate argument and index: wm = wn and m = n.\nIncrease sample size: n = min{2m, N}.\nSet the initial variable: \u02dcw = wm.\nwhile krRn( \u02dcw)k > (p2c)Vn do\nUpdate the variable \u02dcw: Compute \u02dcw = Update( \u02dcw,rRn( \u02dcw))\nend while\nSet wn = \u02dcw.\n\na \ufb01xed distribution P. Starting with wm, we can use \ufb01rst-order descent methods to minimize the\nempirical risk Rn. Depending on the iterative method that we use for solving each ERM problem\nwe might need different number of iterations to \ufb01nd an approximate solution wn which satis\ufb01es the\ncondition E[Rn(wn) Rn(w\u21e4n)] \uf8ff n. To design a comprehensive routine we need to come up\nwith a proper condition for the required accuracy n at each phase.\nIn the following proposition we derive an upper bound for the expected suboptimality of the variable\nwm for the risk Rn based on the accuracy of wm for the previous risk Rm associated with the\ntraining set Sm. This upper bound allows us to choose the accuracy m ef\ufb01ciently.\nProposition 1. Consider the sets Sm and Sn as subsets of the training set T such that Sm\u21e2S n\u21e2T ,\nwhere the number of samples in the sets Sm and Sn are m and n, respectively. Further, de\ufb01ne wm as\nan m optimal solution of the risk Rm in expectation, i.e., E[Rm(wm) R\u21e4m] \uf8ff m, and recall Vn\nas the statistical accuracy of the training set Sn. Then the empirical risk error Rn(wm) Rn(w\u21e4n)\nof the variable wm corresponding to the set Sn in expectation is bounded above by\nE[Rn(wm)Rn(w\u21e4n)] \uf8ff m+\nProof. See Section 7.1 in the supplementary material.\n\n(Vnm + Vm)+2 (Vm Vn)+\n\nkw\u21e4k2. (5)\n\nc(Vm Vn)\n\n2(n m)\n\nn\n\n2\n\nThe result in Proposition 1 characterizes the sub-optimality of the variable wm, which is an m\nsub-optimal solution for the risk Rm, with respect to the empirical risk Rn associated with the set\nSn. If we assume that the statistical accuracy Vn is of the order O(1/n\u21b5) and we double the size of\nthe training set at each step, i.e., n = 2m, then the inequality in (5) can be simpli\ufb01ed to\n\nE[Rn(wm) Rn(w\u21e4n)] \uf8ff m +\uf8ff2 +\u27131 \n\n1\n\n2\u21b5\u25c6\u21e32 +\n\nc\n\n2kw\u21e4k2\u2318 Vm.\n\nThe expression in (6) formalizes the reason that there is no need to solve the sub-problem Rm\nbeyond its statistical accuracy Vm. In other words, even if m is zero the expected sub-optimality\nwill be of the order O(Vm), i.e., E[Rn(wm) Rn(w\u21e4n)] = O(Vm). Based on this observation, The\nrequired precision m for solving the sub-problem Rm should be of the order m = O(Vm).\nThe steps of the proposed adaptive sample size scheme is summarized in Algorithm 1. Note that\nsince computation of the sub-optimality Rn(wn)Rn(w\u21e4n) requires access to the minimizer w\u21e4n, we\nreplace the condition Rn(wn) Rn(w\u21e4n) \uf8ff Vn by a bound on the norm of gradient krRn(wn)k2.\nThe risk Rn is strongly convex, and we can bound the suboptimality Rn(wn) Rn(w\u21e4n) as\n\n(6)\n\n(7)\n\nRn(wn) Rn(w\u21e4n) \uf8ff\n\n1\n2cVnkrRn(wn)k2.\n\nHence, at each stage, we stop updating the variable if the condition krRn(wn)k \uf8ff (p2c)Vn holds\nwhich implies Rn(wn) Rn(w\u21e4n) \uf8ff Vn. The intermediate variable \u02dcw can be updated in Step 7\nusing any \ufb01rst-order method. We will discuss this procedure for accelerated gradient descent (AGD)\nand stochastic variance reduced gradient (SVRG) methods in Sections 4.1 and 4.2, respectively.\n\n4\n\n\f4 Complexity Analysis\n\nIn this section, we aim to characterize the number of required iterations sn at each stage to solve\nthe subproblems within their statistical accuracy. We derive this result for all linearly convergent\n\ufb01rst-order deterministic and stochastic methods.\nThe inequality in (6) not only leads to an ef\ufb01cient policy for the required precision m at each\nstep, but also provides an upper bound for the sub-optimality of the initial iterate, i.e., wm, for\nminimizing the risk Rn. Using this upper bound, depending on the iterative method of choice, we\ncan characterize the number of required iterations sn to ensure that the updated variable is within\nthe statistical accuracy of the risk Rn. To formally characterize the number of required iterations\nsn, we \ufb01rst assume the following conditions are satis\ufb01ed.\nAssumption 1. The loss functions f (w, z) are convex with respect to w for all values of z. More-\nover, their gradients rf (w, z) are Lipschitz continuous with constant M\n\nkrf (w, z) rf (w0, z)k \uf8ff Mkw w0k,\n\nfor all z.\n\n(8)\n\nThe conditions in Assumption 1 imply that the average loss L(w) and the empirical loss Ln(w)\nare convex and their gradients are Lipschitz continuous with constant M. Thus, the empirical risk\nRn(w) is strongly convex with constant cVn and its gradients rRn(w) are Lipschitz continuous\nwith parameter M + cVn.\nSo far we have concluded that each subproblem should be solved up to its statistical accuracy.\nThis observation leads to an upper bound for the number of iterations needed at each step to solve\neach subproblem. Indeed various descent methods can be executed for solving the sub-problem.\nHere we intend to come up with a general result that contains all descent methods that have a\nlinear convergence rate when the objective function is strongly convex and smooth. In the following\ntheorem, we derive a lower bound for the number of required iterations sn to ensure that the variable\nwn, which is the outcome of updating wm by sn iterations of the method of interest, is within the\nstatistical accuracy of the risk Rn for any linearly convergent method.\nTheorem 2. Consider the variable wm as a Vm-suboptimal solution of the risk Rm in expectation,\ni.e., E[Rm(wm) Rm(w\u21e4m)] \uf8ff Vm, where Vm = O(1/m\u21b5). Consider the sets Sm \u21e2S n \u21e2T\nsuch that n = 2m, and suppose Assumption 1 holds. Further, de\ufb01ne 0 \uf8ff \u21e2n < 1 as the linear\nconvergence factor of the descent method used for updating the iterates. Then, the variable wn\ngenerated based on the adaptive sample size mechanism satis\ufb01es E[Rn(wn) Rn(w\u21e4n)] \uf8ff Vn if\nthe number of iterations sn at the n-th stage is larger than\n\nsn \n\nlog\u21e53 \u21e5 2\u21b5 + (2\u21b5 1)2 + c\n\nlog \u21e2n\n\n2kw\u21e4k2\u21e4\n\n.\n\n(9)\n\nProof. See Section 7.2 in the supplementary material.\n\nThe result in Theorem 2 characterizes the number of required iterations at each phase. Depending\non the linear convergence factor \u21e2n and the parameter \u21b5 for the order of statistical accuracy, the\nnumber of required iterations might be different. Note that the parameter \u21e2n might depend on the\nsize of the training set directly or through the dependency of the problem condition number on\nn. It is worth mentioning that the result in (9) shows a lower bound for the number of required\n\niteration which means that sn = b( log\u21e53 \u21e5 2\u21b5 + (2\u21b5 1)2 + (c/2)kw\u21e4k2\u21e4/log \u21e2n)c + 1 is\nthe exact number of iterations needed when minimizing Rn, where bac indicates the \ufb02oor of a.\nTo characterize the overall computational complexity of the proposed adaptive sample size scheme,\nthe exact expression for the linear convergence constant \u21e2n is required. In the following section,\nwe focus on two deterministic and stochastic methods and characterize their overall computational\ncomplexity to reach the statistical accuracy of the full training set T .\n4.1 Adaptive Sample Size Accelerated Gradient (Ada AGD)\nThe accelerated gradient descent (AGD) method, also called as Nesterov\u2019s method, is a long-\nestablished descent method which achieves the optimal convergence rate for \ufb01rst-order determin-\nistic methods. In this section, we aim to combine the update of AGD with the adaptive sample size\nscheme in Section 3 to improve convergence guarantees of AGD for solving ERM problems. This\n\n5\n\n\fand\n\ncan be done by using AGD for updating the iterates in step 7 of Algorithm 1. Given an iterate wm\nwithin the statistical accuracy of the set Sm, the adaptive sample size accelerated gradient descent\nmethod (Ada AGD) requires sn iterations of AGD to ensure that the resulted iterate wn lies in the\nstatistical accuracy of Sn. In particular, if we initialize the sequences \u02dcw and \u02dcy as \u02dcw0 = \u02dcy0 = wm,\nthe approximate solution wn for the risk Rn is the outcome of the updates\n(10)\n\n\u02dcwk+1 = \u02dcyk \u2318nrRn(\u02dcyk),\n\n\u02dcyk+1 = \u02dcwk+1 + n( \u02dcwk+1 \u02dcwk)\n\n(11)\nafter sn iterations, i.e., wn = \u02dcwsn. The parameters \u2318n and n are indexed by n since they depend on\nthe number of samples. We use the convergence rate of AGD to characterize the number of required\niterations sn to guarantee that the outcome of the recursive updates in (10) and (11) is within the\nstatistical accuracy of Rn.\nTheorem 3. Consider the variable wm as a Vm-optimal solution of the risk Rm in expectation, i.e.,\nE[Rm(wm) Rm(w\u21e4m)] \uf8ff Vm, where Vm = /m\u21b5. Consider the sets Sm \u21e2S n \u21e2T such that\nn = 2m, and suppose Assumption 1 holds. Further, set the parameters \u2318n and n as\n\n\u2318n =\n\n1\n\ncVn + M\n\nand\n\nn =\n\npcVn + M pcVn\npcVn + M + pcVn\n\n.\n\n(12)\n\nThen, the variable wn generated based on the update of Ada AGD in (10)-(11) satis\ufb01es E[Rn(wn)\nRn(w\u21e4n)] \uf8ff Vn if the number of iterations sn is larger than\n\nsn s n\u21b5M + c\n\nc\n\nlog\u21e56 \u21e5 2\u21b5 + (2\u21b5 1)4 + ckw\u21e4k2\u21e4 .\nMoreover, if we de\ufb01ne m0 as the size of the \ufb01rst training set, to reach the statistical accuracy VN of\nthe full training set T the overall computational complexity of Ada GD is given by\n\n(13)\n\nN\"1 + log2\u2713 N\n\nm0\u25c6 + p2\u21b5\n\np2\u21b5 1!s N \u21b5M\n\nc\n\n# log\u21e56 \u21e5 2\u21b5 + (2\u21b5 1)4 + ckw\u21e4k2\u21e4 .\n\n(14)\n\nProof. See Section 7.3 in the supplementary material.\n\nThe result in Theorem 3 characterizes the number of required iterations sn to achieve the statistical\naccuracy of Rn. Moreover, it shows that to reach the accuracy VN = O(1/N \u21b5) for the risk RN\naccosiated to the full training set T , the total computational complexity of Ada AGD is of the order\nON (1+\u21b5/2). Indeed, this complexity is lower than the overall computational complexity of AGD\nfor reaching the same target which is given by ONp\uf8ffN log(N \u21b5) = ON (1+\u21b5/2) log(N \u21b5).\nNote that this bound holds for AGD since the condition number \uf8ffN := (M + cVN )/(cVN ) of the\nrisk RN is of the order O(1/VN ) = O(N \u21b5).\n4.2 Adaptive Sample Size SVRG (Ada SVRG)\nFor the adaptive sample size mechanism presented in Section 3, we can also use linearly conver-\ngent stochastic methods such as stochastic variance reduced gradient (SVRG) in [10] to update the\niterates. The SVRG method succeeds in reducing the computational complexity of deterministic\n\ufb01rst-order methods by computing a single gradient per iteration and using a delayed version of the\naverage gradient to update the iterates. Indeed, we can exploit the idea of SVRG to develop low\ncomputational complexity adaptive sample size methods to improve the performance of determin-\nistic adaptive sample size algorithms. Moreover, the adaptive sample size variant of SVRG (Ada\nSVRG) enhances the proven bounds for SVRG to solve ERM problems.\nWe proceed to extend the idea of adaptive sample size scheme to the SVRG algorithm. To do so,\nconsider wm as an iterate within the statistical accuracy, E[Rm(wm) Rm(w\u21e4m)] \uf8ff Vm, for a set\nSm which contains m samples. Consider sn and qn as the numbers of outer and inner loops for the\nupdate of SVRG, respectively, when the size of the training set is n. Further, consider \u02dcw and \u02c6w as\nthe sequences of iterates for the outer and inner loops of SVRG, respectively. In the adaptive sample\n\n6\n\n\fsize SVRG (Ada SVRG) method to minimize the risk Rn, we set the approximate solution wm for\nthe previous ERM problem as the initial iterate for the outer loop, i.e., \u02dcw0 = wm. Then, the outer\nloop update which contains gradient computation is de\ufb01ned as\n\nrRn( \u02dcwk) =\n\nk = 0, . . . , sn 1,\nand the inner loop for the k-th outer loop contains qn iterations of the following update\n\nrf ( \u02dcwk, zi) + cVn \u02dcwk\n\nfor\n\n1\nn\n\nnXi=1\n\n(15)\n\n\u02c6wt+1,k = \u02c6wt,k \u2318n (rf ( \u02c6wt,k, zit) + cVn \u02c6wt,k rf ( \u02dcwk, zit) cVn \u02dcwk + rRn( \u02dcwk)) ,\n\n(16)\nfor t = 0, . . . , qn 1, where the iterates for the inner loop at step k are initialized as \u02c6w0,k = \u02dcwk,\nand it is index of the function which is chosen un\ufb01rmly at random from the set {1, . . . , n} at the\ninner iterate t. The outcome of each inner loop \u02c6wqn,k is used as the variable for the next outer loop,\ni.e., \u02dcwk+1 = \u02c6wqn,k. We de\ufb01ne the outcome of sn outer loops \u02dcwsn as the approximate solution for\nthe risk Rn, i.e., wn = \u02dcwsn.\nIn the following theorem we derive a bound on the number of required outer loops sn to ensure that\nthe variable wn generated by the updates in (15) and (16) will be in the statistical accuracy of Rn in\nexpectation, i.e., E[Rn(wn) Rn(w\u21e4n)] \uf8ff Vn. To reach the smallest possible lower bound for sn,\nwe properly choose the number of inner loop iterations qn and the learning rate \u2318n.\nTheorem 4. Consider the variable wm as a Vm-optimal solution of the risk Rm, i.e., a solution such\nthat E[Rm(wm) Rm(w\u21e4m)] \uf8ff Vm, where Vm = O(1/m\u21b5). Consider the sets Sm \u21e2S n \u21e2T\nsuch that n = 2m, and suppose Assumption 1 holds. Further, set the number of inner loop iterations\nas qn = n and the learning rate as \u2318n = 0.1/(M + cVn). Then, the variable wn generated based\non the update of Ada SVRG in (15)-(16) satis\ufb01es E[Rn(wn) Rn(w\u21e4n)] \uf8ff Vn if the number of\niterations sn is larger than\n\n(17)\nMoreover, to reach the statistical accuracy VN of the full training set T the overall computational\ncomplexity of Ada SVRG is given by\n\nc\n\nsn log2h3 \u21e5 2\u21b5 + (2\u21b5 1)\u21e32 +\n4N log2h3 \u21e5 2\u21b5 + (2\u21b5 1)\u21e32 +\n\n2kw\u21e4k2\u2318i.\n2kw\u21e4k2\u2318i .\n\nc\n\n(18)\n\nProof. See Section 7.4.\n\nThe result in (17) shows that the minimum number of outer loop iterations for Ada SVRG is equal to\nsn = blog2[3 \u21e5 2\u21b5 + (2\u21b5 1)(2 + (c/2)kw\u21e4k2)]c+1. This bound leads to the result in (18) which\nshows that the overall computational complexity of Ada SVRG to reach the statistical accuracy of\nthe full training set T is of the order O(N ). This bound not only improves the bound O(N 1+\u21b5/2)\nfor Ada AGD, but also enhances the complexity of SVRG for reaching the same target accuracy\nwhich is given by O((N + \uf8ff) log(N \u21b5)) = O(N log(N \u21b5)).\n5 Experiments\n\nIn this section, we compare the adaptive sample size versions of a group of \ufb01rst-order methods, in-\ncluding gradient descent (GD), accelerated gradient descent (AGD), and stochastic variance reduced\ngradient (SVRG) with their standard (\ufb01xed sample size) versions. In the main paper, we only use\nthe RCV1 dataset. Further numerical experiments on MNIST dataset can be found in Section 7.5 in\nthe supplementary material. We use N = 10, 000 samples of the RCV1 dataset for the training set\nand the remaining 10, 242 as the test set. The number of features in each sample is p = 47, 236. In\nour experiments, we use logistic loss. The constant c should be within the order of gradients Lips-\nchitz continuity constant M, and, therefore, we set it as c = 1 since the samples are normalized and\nM = 1. The size of the initial training set for adaptive methods is m0 = 400. In our experiments\nwe assume \u21b5 = 0.5 and therefore the added regularization term is (1/pn)kwk2.\nThe plots in Figure 1 compare the suboptimality of GD, AGD, and SVRG with their adaptive sample\nsize versions. As our theoretical results suggested, we observe that the adaptive sample size scheme\nreduces the overall computational complexity of all of the considered linearly convergent \ufb01rst-order\n\n7\n\n\f2\n\n10\n\n1\n\n10\n\n0\n\n10\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\n-1\n\n10\n\n-2\n\n10\n\n0\n\nGD\nAda GD\n\n20\n\n40\n\n60\n\n80\n\n100\n\n2\n\n10\n\n1\n\n10\n\n0\n\n10\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\n-1\n\n10\n\n-2\n\n10\n\n0\n\nAGD\nAda AGD\n\n20\n\n40\n\n60\n\n80\n\n100\n\n2\n\n10\n\n1\n\n10\n\n0\n\n10\n\n-1\n\n10\n\n-2\n\n10\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\n-3\n\n10\n\n0\n\nSVRG\nAda SVRG\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nNumber of e\ufb00ective passes\n\nNumber of e\ufb00ective passes\n\nNumber of e\ufb00ective passes\n\nFigure 1: Suboptimality vs. number of effective passes for RCV1 dataset with regularization of O(1/pn).\n\nr\no\nr\nr\ne\n\nt\ns\ne\nT\n\n50%\n\n45%\n\n40%\n\n35%\n\n30%\n\n25%\n\n20%\n\n15%\n\n10%\n\n 5%\n\n0\n\nr\no\nr\nr\ne\n\nt\ns\ne\nT\n\n50%\n\n45%\n\n40%\n\n35%\n\n30%\n\n25%\n\n20%\n\n15%\n\n10%\n\n 5%\n\n0\n\nGD\nAda GD\n\n20\n\n40\n\n60\n\n80\n\n100\n\nAGD\nAda AGD\n\n20\n\n40\n\n60\n\n80\n\n100\n\nr\no\nr\nr\ne\n\nt\ns\ne\nT\n\n50%\n\n45%\n\n40%\n\n35%\n\n30%\n\n25%\n\n20%\n\n15%\n\n10%\n\n 5%\n\n0\n\nSVRG\nAda SVRG\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nNumber of e\ufb00ective passes\n\nNumber of e\ufb00ective passes\n\nNumber of e\ufb00ective passes\n\nFigure 2: Test error vs. number of effective passes for RCV1 dataset with regularization of O(1/pn).\n\nmethods. If we compare the test errors of GD, AGD, and SVRG with their adaptive sample size\nvariants, we reach the same conclusion that the adaptive sample size scheme reduces the overall\ncomputational complexity to reach the statistical accuracy of the full training set. In particular, the\nleft plot in Figure 2 shows that Ada GD approaches the minimum test error of 8% after 55 effective\npasses, while GD can not improve the test error even after 100 passes. Indeed, GD will reach lower\ntest error if we run it for more iterations. The central plot in Figure 2 showcases that Ada AGD\nreaches 8% test error about 5 times faster than AGD. This is as predicted by log(N \u21b5) = log(100) =\n4.6. The right plot in Figure 2 illustrates a similar improvement for Ada SVRG. We have observed\nsimilar performances for other datasets such as MNIST \u2013 see Section 7.5 in supplementary material.\n\n6 Discussions\n\nWe presented an adaptive sample size scheme to improve the convergence guarantees for a class\nof \ufb01rst-order methods which have linear convergence rates under strong convexity and smoothness\nassumptions. The logic behind the proposed adaptive sample size scheme is to replace the solution\nof a relatively hard problem \u2013 the ERM problem for the full training set \u2013 by a sequence of relatively\neasier problems \u2013 ERM problems corresponding to a subset of samples. Indeed, whenever m < n,\nsolving the ERM problems in (4) for loss Rm is simpler than the one for loss Rn because:\n\n(i) The adaptive regularization term of order Vm makes the condition number of Rm smaller\n\nthan the condition number of Rn \u2013 which uses a regularizer of order Vn.\n\n(ii) The approximate solution wm that we need to \ufb01nd for Rm is less accurate than the approx-\n\nimate solution wn we need to \ufb01nd for Rn.\n\n(iii) The computation cost of an iteration for Rm \u2013 e.g., the cost of evaluating a gradient \u2013 is\n\nlower than the cost of an iteration for Rn.\n\nProperties (i)-(iii) combined with the ability to grow the sample size geometrically, reduce the over-\nall computational complexity for reaching the statistical accuracy of the full training set. We par-\nticularized our results to develop adaptive (Ada) versions of AGD and SVRG. In both methods we\nfound a computational complexity reduction of order O(log(1/VN )) = O(log(N \u21b5)) which was\ncorroborated in numerical experiments. The idea and analysis of adaptive \ufb01rst order methods apply\ngenerically to any other approach with linear convergence rate (Theorem 2). The development of\nsample size adaptation for sublinear methods is left for future research.\n\nAcknowledgments\nThis research was supported by NSF CCF 1717120 and ARO W911NF1710438.\n\n8\n\n\fReferences\n[1] Zeyuan Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In STOC,\n\n2017.\n\n[2] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[3] L\u00b4eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP-\n\nSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[4] L\u00b4eon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Informa-\ntion Processing Systems 20, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 161\u2013168,\n2007.\n\n[5] Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of\n\nlearning algorithms. PhD thesis, Ecole Polytechnique, 2002.\n\n[6] Hadi Daneshmand, Aur\u00b4elien Lucchi, and Thomas Hofmann. Starting small - learning with adaptive\nsample sizes. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016,\nNew York City, NY, USA, pages 1463\u20131471, 2016.\n\n[7] Aaron Defazio. A simple practical accelerated method for \ufb01nite sums. In Advances In Neural Information\n\nProcessing Systems, pages 676\u2013684, 2016.\n\n[8] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. In Advances in Neural Information Process-\ning Systems 27, Montreal, Quebec, Canada, pages 1646\u20131654, 2014.\n\n[9] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk mini-\nmizer in a single pass. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris,\nFrance, July 3-6, 2015, pages 728\u2013763, 2015.\n\n[10] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduc-\ntion. In Advances in Neural Information Processing Systems 26. Lake Tahoe, Nevada, United States.,\npages 315\u2013323, 2013.\n\n[11] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimization. In\n\nAdvances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n[12] Aryan Mokhtari, Hadi Daneshmand, Aur\u00b4elien Lucchi, Thomas Hofmann, and Alejandro Ribeiro. Adap-\ntive Newton method for empirical risk minimization to statistical accuracy. In Advances in Neural Infor-\nmation Processing Systems 29. Barcelona, Spain, pages 4062\u20134070, 2016.\n\n[13] Yurii Nesterov.\n\nIntroductory lectures on convex optimization: A basic course, volume 87. Springer\n\nScience & Business Media, 2013.\n\n[14] Yurii Nesterov et al. Gradient methods for minimizing composite objective function. 2007.\n\n[15] Nicolas Le Roux, Mark W. Schmidt, and Francis R. Bach. A stochastic gradient method with an expo-\nnential convergence rate for \ufb01nite training sets. In Advances in Neural Information Processing Systems\n25. Lake Tahoe, Nevada, United States., pages 2672\u20132680, 2012.\n\n[16] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and\n\nuniform convergence. The Journal of Machine Learning Research, 11:2635\u20132670, 2010.\n\n[17] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss.\n\nThe Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[18] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regu-\n\nlarized loss minimization. Mathematical Programming, 155(1-2):105\u2013145, 2016.\n\n[19] Vladimir Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 2013.\n\n[20] Blake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In\n\nAdvances in Neural Information Processing Systems, pages 3639\u20133647, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1250, "authors": [{"given_name": "Aryan", "family_name": "Mokhtari", "institution": "University of Pennsylvania"}, {"given_name": "Alejandro", "family_name": "Ribeiro", "institution": "University of Pennsylvania"}]}