{"title": "Without-Replacement Sampling for Stochastic Gradient Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 46, "page_last": 54, "abstract": "Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled *with* replacement. In contrast, sampling *without* replacement is far less understood, yet in practice it is very common, often easier to implement, and usually performs better. In this paper, we provide competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data. Moreover, we describe a useful application of these results in the context of distributed optimization with randomly-partitioned data, yielding a nearly-optimal algorithm for regularized least squares (in terms of both communication complexity and runtime complexity) under broad parameter regimes. Our proof techniques combine ideas from stochastic optimization, adversarial online learning and transductive learning theory, and can potentially be applied to other stochastic optimization and learning problems.", "full_text": "Without-Replacement Sampling\nfor Stochastic Gradient Methods\n\nDepartment of Computer Science and Applied Mathematics\n\nOhad Shamir\n\nWeizmann Institute of Science\n\nRehovot, Israel\n\nohad.shamir@weizmann.ac.il\n\nAbstract\n\nStochastic gradient methods for machine learning and optimization problems are\nusually analyzed assuming data points are sampled with replacement. In con-\ntrast, sampling without replacement is far less understood, yet in practice it is\nvery common, often easier to implement, and usually performs better. In this\npaper, we provide competitive convergence guarantees for without-replacement\nsampling under several scenarios, focusing on the natural regime of few passes\nover the data. Moreover, we describe a useful application of these results in the\ncontext of distributed optimization with randomly-partitioned data, yielding a\nnearly-optimal algorithm for regularized least squares (in terms of both communi-\ncation complexity and runtime complexity) under broad parameter regimes. Our\nproof techniques combine ideas from stochastic optimization, adversarial online\nlearning and transductive learning theory, and can potentially be applied to other\nstochastic optimization and learning problems.\n\n1\n\nIntroduction\n\nMany canonical machine learning problems boil down to solving a convex empirical risk minimization\nproblem of the form\n\nw\u2208W F (w) =\nmin\n\n1\nm\n\nm(cid:88)\n\ni=1\n\nfi(w),\n\n(1)\n\nwhere each individual function fi(\u00b7) is convex (e.g. the loss on a given example in the training\ndata), and the set W \u2286 Rd is convex. In large-scale applications, where both m, d can be huge, a\nvery popular approach is to employ stochastic gradient methods. Generally speaking, these methods\nmaintain some iterate wt \u2208 W, and at each iteration, sample an individual function fi(\u00b7), and perform\nsome update to wt based on \u2207fi(wt). Since the update is with respect to a single function, this\nupdate is usually computationally cheap. Moreover, when the sampling is done independently and\nuniformly at random, \u2207fi(wt) is an unbiased estimator of the true gradient \u2207F (wt), which allows\nfor good convergence guarantees after a reasonable number of iterations (see for instance [18, 15]).\nHowever, in practical implementations of such algorithms, it is actually quite common to use without-\nreplacement sampling, or equivalently, pass sequentially over a random shuf\ufb02ing of the functions\nfi. Intuitively, this forces the algorithm to process more equally all data points, and often leads to\nbetter empirical performance. Moreover, without-replacement sampling is often easier and faster to\nimplement, as it requires sequential data access, as opposed to the random data access required by\nwith-replacement sampling (see for instance [3, 16, 8]).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f1.1 What is Known so Far?\n\nUnfortunately, without-replacement sampling is not covered well by current theory. The challenge is\nthat unlike with-replacement sampling, the functions processed at every iteration are not statistically\nindependent, and their correlations are dif\ufb01cult to analyze. Since this lack of theory is the main\nmotivation for our paper, we describe the existing known results in some detail, before moving to our\ncontributions.\nTo begin with, there exist classic convergence results which hold deterministically for every order in\nwhich the individual functions are processed, and in particular when we process them by sampling\nwithout replacement (e.g. [14]). However, these can be exponentially worse than those obtained\nusing random without-replacement sampling, and this gap is inevitable (see for instance [16]).\nMore recently, Recht and R\u00e9 [16] studied this problem, attempting to show that at least for least\nsquares optimization, without-replacement sampling is always better (or at least not substantially\nworse) than with-replacement sampling on a given dataset. They showed this reduces to a fundamental\nconjecture about arithmetic-mean inequalities for matrices, and provided partial results in that\ndirection, such as when the individual functions themselves are assumed to be generated i.i.d. from\nsome distribution. However, the general question remains open.\nIn a recent breakthrough, G\u00fcrb\u00fczbalaban et al. [8] provided a new analysis of gradient descent\nalgorithms for solving Eq. (1) based on random reshuf\ufb02ing: Each epoch, the algorithm draws a new\npermutation on {1, . . . , m} uniformly at random, and processes the individual functions in that order.\nUnder smoothness and strong convexity assumptions, the authors obtain convergence guarantees\nof essentially O(1/k2) after k epochs, vs. O(1/k) using with-replacement sampling (with the O(\u00b7)\nnotation including certain dependencies on the problem parameters and data size). Thus, without-\nreplacement sampling is shown to be strictly better than with-replacement sampling, after suf\ufb01ciently\nmany passes over the data. However, this leaves open the question of why without-replacement\nsampling works well after a few \u2013 or even just one \u2013 passes over the data. Indeed, this is often the\nregime at which stochastic gradient methods are most useful, do not require repeated data reshuf\ufb02ing,\nand their good convergence properties are well-understood in the with-replacement case.\n\n1.2 Our Results\n\nIn this paper, we provide convergence guarantees for stochastic gradient methods, under several\nscenarios, in the natural regime where the number of passes over the data is small, and in particular\nthat no data reshuf\ufb02ing is necessary. We emphasize that our goal here will be more modest than those\nof [16, 8]: Rather than show superiority of without-replacement sampling, we only show that it will\nnot be signi\ufb01cantly worse (in a worst-case sense) than with-replacement sampling. Nevertheless, such\nguarantees are novel, and still justify the use of with-replacement sampling, especially in situations\nwhere it is advantageous due to data access constraints or other reasons. Moreover, these results\nhave a useful application in the context of distributed learning and optimization, as we will shortly\ndescribe. Our main contributions can be summarized as follows:\n\u2022 For convex functions on some convex domain W, we consider algorithms which perform a single\npass over a random permutation of m individual functions, and show that their suboptimality can\nbe characterized by a combination of two quantities, each from a different \ufb01eld: First, the regret\nwhich the algorithm can attain in the setting of adversarial online convex optimization [17, 9], and\nsecond, the transductive Rademacher complexity of W with respect to the individual functions, a\nnotion stemming from transductive learning theory [22, 6].\n\n\u2022 As a concrete application of the above, we show that if each function fi(\u00b7) corresponds to a convex\n\u221a\nLipschitz loss of a linear predictor, and the algorithm belongs to the class of algorithms which in the\nonline setting attain O(\nT ) regret on T such functions (which includes, for example, stochastic\n\u221a\ngradient descent), then the suboptimality using without-replacement sampling, after processing\nT functions, is O(1/\nT ). Up to numerical constants, the guarantee is the same as that obtained\nusing with-replacement sampling.\n\n\u2022 We turn to consider more speci\ufb01cally the stochastic gradient descent algorithm, and show that\nif the objective function F (\u00b7) is \u03bb-strongly convex, and the functions fi(\u00b7) are also smooth, then\nthe suboptimality bound becomes O(1/\u03bbT ), which matches the with-replacement guarantees\n\n2\n\n\f(although with replacement, smoothness is not needed, and the dependence on some parameters\nhidden in the O(\u00b7) is somewhat better).\n\n\u2022 In recent years, a new set of fast stochastic algorithms to solve Eq. (1) has emerged, such as\nSAG, SDCA, SVRG, and quite a few other variants. These algorithms are characterized by\ncheap stochastic iterations, involving computations of individual function gradients, yet unlike\ntraditional stochastic algorithms, enjoy a linear convergence rate (runtime scaling logarithmically\nwith the required accuracy). To the best of our knowledge, all existing analyses require sampling\nwith replacement. We consider a representative algorithm from this set, namely SVRG, and the\nproblem of regularized least squares, and show that similar guarantees can be obtained using\nwithout-replacement sampling. This result has a potentially interesting implication: Under the\nmild assumption that the problem\u2019s condition number is smaller than the data size, we get that\nSVRG can converge to an arbitrarily accurate solution (even up to machine precision), without the\nneed to reshuf\ufb02e the data \u2013 only a single shuf\ufb02e at the beginning suf\ufb01ces. Thus, at least for this\nproblem, we can obatin fast and high-quality solutions even if random data access is expensive.\n\u2022 A further application of the SVRG result is in the context of distributed learning: By simulating\nwithout-replacement SVRG on data randomly partitioned between several machines, we get a\nnearly-optimal algorithm for regularized least squares, in terms of communication and compu-\ntational complexity, as long as the condition number is smaller than the data size per machine\n(up to logarithmic factors). This builds on the work of Lee et al. [13], who were the \ufb01rst to\nrecognize the applicability of SVRG to distributed optimization. However, their results relied on\nwith-replacement sampling, and are applicable only for much smaller condition numbers.\n\n\u221a\nWe note that our focus is on scenarios where no reshuf\ufb02ings are necessary. In particular, the O(1/\nT )\nand O(1/\u03bbT ) bounds apply for all T \u2208 {1, 2, . . . , m}, namely up to one full pass over a random\npermutation of the entire data. However, our techniques are also applicable to a constant (> 1)\nnumber of passes, by randomly reshuf\ufb02ing the data after every pass. In a similar vein, our SVRG result\ncan be readily extended to a situation where each epoch of the algorithm is done on an independent\npermutation of the data. We leave a full treatment of this to future work.\n\n2 Preliminaries and Notation\n\nWe use bold-face symbols to denote vectors. Given a vector w, wi denotes it\u2019s i-th coordinate. We\nutilize the standard O(\u00b7), \u0398(\u00b7), \u2126(\u00b7) notation to hide constants, and \u02dcO, \u02dc\u0398(\u00b7), \u02dc\u2126(\u00b7) to hide constants\nand logarithmic factors.\nGiven convex functions f1(\u00b7), f2(\u00b7), . . . , fm(\u00b7) from Rd to R, we de\ufb01ne our objective function\nF : Rd \u2192 R as\n\nm(cid:88)\n\ni=1\n\nF (w) =\n\n1\nm\n\nfi(w),\n\nwith some \ufb01xed optimum w\u2217 \u2208 arg minw\u2208W F (w). In machine learning applications, each in-\ndividual fi(\u00b7) usually corresponds to a loss with respect to a data point, hence will use the terms\n\u201cindividual function\u201d, \u201closs function\u201d and \u201cdata point\u201d interchangeably throughout the paper.\nWe let \u03c3 be a permutation over {1, . . . , m} chosen uniformly at random. In much of the paper, we\nconsider methods which draw loss functions without replacement according to that permutation (that\nis, f\u03c3(1)(\u00b7), f\u03c3(2)(\u00b7), f\u03c3(3)(\u00b7), . . .). We will use the shorthand notation\n\nF1:t\u22121(\u00b7) =\n\n1\nt \u2212 1\n\nf\u03c3(i)(\u00b7) , Ft:m(\u00b7) =\n\n1\n\nm \u2212 t + 1\n\nf\u03c3(i)(\u00b7)\n\nto denote the average loss with respect to the \ufb01rst t \u2212 1 and last m \u2212 t + 1 loss functions respectively,\nas ordered by the permutation (intuitively, the losses in F1:t\u22121(\u00b7) are those already observed by the\nalgorithm at the beginning of iteration t, whereas the losses in Ft:m(\u00b7) are those not yet observed).\n(cid:80)t\u22121\nWe use the convention that F1:1(\u00b7) \u2261 0, and the same goes for other expressions of the form\ni=1 \u00b7\u00b7\u00b7 throughout the paper, when t = 1. We also de\ufb01ne quantities such as \u2207F1:t\u22121(\u00b7) and\n1\nt\u22121\n\u2207Ft:m(\u00b7) similarly.\n\n3\n\nt\u22121(cid:88)\n\ni=1\n\nm(cid:88)\n\ni=t\n\n\f\u03bb\n\n2 (cid:107)w \u2212 w\u2217(cid:107)2.\n\nif for any w, w(cid:48), f (w(cid:48)) \u2265 f (w) +\nA function f : Rd \u2192 R is \u03bb-strongly convex,\n2 (cid:107)w(cid:48) \u2212 w(cid:107)2, where g is any (sub)-gradient of f at w, and is \u00b5-smooth if for\n(cid:104)g, w(cid:48) \u2212 w(cid:105) + \u03bb\nany w, w(cid:48), f (w(cid:48)) \u2264 f (w) + (cid:104)g, w(cid:48) \u2212 w(cid:105) + \u00b5\n2 (cid:107)w(cid:48) \u2212 w(cid:107)2. \u00b5-smoothness also implies that\nthe function f is differentiable, and its gradient is \u00b5-Lipschitz. Based on these properties, it\nis easy to verify that if w\u2217 \u2208 arg min f (w), and f is \u03bb-strongly convex and \u00b5-smooth, then\n2 (cid:107)w \u2212 w\u2217(cid:107)2 \u2264 f (w) \u2212 f (w\u2217) \u2264 \u00b5\nWe will also require the notion of trandsuctive Rademacher complexity, as developed by El-Yaniv\nand Pechyony [6, De\ufb01nition 1], with a slightly different notation adapted to our setting:\nDe\ufb01nition 1. Let V be a set of vectors v = (v1, . . . , vm) in Rm. Let s, u be positive integers such\n(s+u)2 \u2208 (0, 1/2). We de\ufb01ne the transductive Rademacher\nthat s + u = m, and denote p := su\ni=1 rivi] , where r1, . . . , rm are\ns + 1\ni.i.d. random variables such that ri = 1 or \u22121 with probability p, and ri = 0 with probability 1 \u2212 2p.\nThis quantity is an important parameter is studying the richness of the set V, and will prove crucial\nin providing some of the convergence results presented later on. Note that it differs slightly from\nstandard Rademacher complexity, which is used in the theory of learning from i.i.d. data, where the\nRademacher variables ri only take \u22121, +1 values, and (1/s + 1/u) is replaced by 1/m).\n\nComplexity Rs,u(V) as Rs,u(V) =(cid:0) 1\n\n(cid:1) \u00b7 Er1,...,rm [supv\u2208V(cid:80)m\n\nu\n\n3 Convex Lipschitz Functions\nWe begin by considering loss functions f1(\u00b7), f2(\u00b7), . . . , fm(\u00b7) which are convex and L-Lipschitz\nover some convex domain W. We assume the algorithm sequentially goes over some permuted\nordering of the losses, and before processing the t-th loss function, produces an iterate wt \u2208 W.\nMoreover, we assume that the algorithm has a regret bound in the adversarial online setting, namely\nthat for any sequence of T convex Lipschitz losses f1(\u00b7), . . . , fT (\u00b7), and any w \u2208 W,\n\nT(cid:88)\n\nft(wt) \u2212 T(cid:88)\n\nt=1\n\nt=1\n\nft(w) \u2264 RT\n\n\u221a\n\nfor some RT scaling sub-linearly in T 1. For example, online gradient descent (which is equivalent\nto stochastic gradient descent when the losses are i.i.d.), with a suitable step size, satis\ufb01es RT =\nO(BL\nT ), where L is the Lipschitz parameter and B upper bounds the norm of any vector in W. A\nsimilar regret bound can also be shown for other online algorithms (see [9, 17, 23]).\nSince the ideas used for analyzing this setting will also be used in the more complicated results which\nfollow, we provide the analysis in some detail. We \ufb01rst have the following straightforward theorem,\nwhich upper bounds the expected suboptimality in terms of regret and the expected difference between\nthe average loss on pre\ufb01xes and suf\ufb01xes of the data.\nTheorem 1. Suppose the algorithm has a regret bound RT , and sequentially processes\nf\u03c3(1)(\u00b7), . . . , f\u03c3(T )(\u00b7) where \u03c3 is a random permutation on {1, . . . , m}. Then in expectation over \u03c3,\n\n(cid:34)\n\nE\n\n1\nT\n\nT(cid:88)\n\nt=1\n\n(cid:35)\n\nT(cid:88)\n\nt=2\n\nF (wt) \u2212 F (w\u2217)\n\n\u2264 RT\nT\n\n+\n\n1\nmT\n\n(t \u2212 1) \u00b7 E[F1:t\u22121(wt) \u2212 Ft:m(wt)].\n\n(cid:80)T\n\nThe left hand side in the inequality above can be interpreted as an expected bound on F (wt)\u2212F (w\u2217),\nwhere t is drawn uniformly at random from 1, 2, . . . , T . Alternatively, by Jensen\u2019s inequality\nand the fact that F (\u00b7) is convex, the same bound also applies on E[F ( \u00afwT ) \u2212 F (w\u2217)], where\n\u00afwT = 1\nT\nThe proof of the theorem relies on the following simple but key lemma, which expresses the expected\ndifference between with-replacement and without-replacement sampling in an alternative form,\nsimilar to Thm. 1 and one which lends itself to tools and ideas from transductive learning theory.\nThis lemma will be used in proving all our main results, and its proof appears in Subsection A.2\n\nt=1 wt.\n\n1For simplicity, we assume the algorithm is deterministic given f1, . . . , fm, but all results in this section also\nhold for randomized algorithms (in expectation over the algorithm\u2019s randomness), assuming the expected regret\nof the algorithm w.r.t. any w \u2208 W is at most RT .\n\n4\n\n\f\u2217\n\nm\n\nt=1\n\n1\nT\n\n1\nT\n\nt=1\n\n1\nT\n\nt=1\n\n1\nT\n\nt=1\n\n\u2217\n\n)\n\nE\n\n1\nT\n\nt=1\n\n\u2217\n\n(cid:34)\n\n(cid:34)\n\n+ E\n\n+ E\n\n= E\n\n= E\n\nT(cid:88)\n\nT(cid:88)\n(cid:34)\n\nF (wt) \u2212 F (w\n\n(cid:80)m\ni=1 si \u2212 s\u03c3(t)\n\nThen E(cid:2) 1\n(cid:34)\n\nProof of Thm. 1. Adding and subtracting terms, and using the facts that \u03c3 is a permutation chosen\nuniformly at random, and w\u2217 is \ufb01xed,\n\nLemma 1. Let \u03c3 be a permutation over {1, . . . , m} chosen uniformly at random. Let s1, . . . , sm \u2208 R\nbe random variables which conditioned on \u03c3(1), . . . , \u03c3(t \u2212 1), are independent of \u03c3(t), . . . , \u03c3(m).\n\nm \u00b7 E [s1:t\u22121 \u2212 st:m] for t > 1, and 0 for t = 1.\n\n(cid:3) equals t\u22121\n(cid:34)\n(cid:35)\nT(cid:88)\n(cid:0)f\u03c3(t)(wt) \u2212 f\u03c3(t)(w\nE(cid:2)F (wt) \u2212 f\u03c3(t)(wt)(cid:3). Since wt (as a random variable over the permutation\n(cid:80)T\n\n)(cid:1)(cid:35)\nT(cid:88)\n(cid:0)F (wt) \u2212 f\u03c3(t)(wt)(cid:1)(cid:35)\n\nT(cid:88)\n(cid:0)f\u03c3(t)(wt) \u2212 F (w\n)(cid:1)(cid:35)\n\n(cid:0)F (wt) \u2212 f\u03c3(t)(wt)(cid:1)(cid:35)\n\nT + 1\n\nApplying the regret bound assumption on the sequence of losses f\u03c3(1)(\u00b7), . . . , f\u03c3(T )(\u00b7), the above is\n(cid:80)T\nat most RT\n\u03c3 of the data) depends only on \u03c3(1), . . . , \u03c3(t \u2212 1), we can use Lemma 1 (where si = fi(wt), and\nt=2(t\u2212\nnoting that the expectation above is 0 when t = 1), and get that the above equals RT\n1) \u00b7 E[F1:t\u22121(wt) \u2212 Ft:m(wt)].\nHaving reduced the expected suboptimality to the expected difference E[F1:t\u22121(wt) \u2212 Ft:m(wt)],\nthe next step is to upper bound it with E[supw\u2208W (F1:t\u22121(w) \u2212 Ft:m(w))]: Namely, having split\nour loss functions at random to two groups of size t\u2212 1 and m\u2212 t + 1, how large can be the difference\nbetween the average loss of any w on the two groups? Such uniform convergence bounds are exactly\nthe type studied in transductive learning theory, where a \ufb01xed dataset is randomly split to a training\nset and a test set, and we consider the generalization performance of learning algorithms ran on the\ntraining set. Such results can be provided in terms of the transductive Rademacher complexity of W,\nand combined with Thm. 1, lead to the following bound in our setting:\nTheorem 2. Suppose that each wt is chosen from a \ufb01xed domain W, that the algorithm enjoys a\nregret bound RT , and that supi,w\u2208W |fi(w)| \u2264 B. Then in expectation over the random permutation\n\u03c3,\n\nT + 1\n\nT\n\nt=1\n\nmT\n\n(cid:35)\n\n(cid:34)\n\nT(cid:88)\n\nT(cid:88)\n\n,\n\n+\n\nE\n\nt=2\n\nt=1\n\n1\nT\n\n24B\u221a\nm\n\n1\nmT\n\nF (wt) \u2212 F (w\u2217)\n\n(t \u2212 1)Rt\u22121:m\u2212t+1(V) +\n\n\u2264 RT\nT\nwhere V = {(f1(w), . . . , fm(w) | w \u2208 W}.\nThus, we obtained a generic bound which depends on the online learning characteristics of the\nalgorithm, as well as the statistical learning properties of the class W on the loss functions. The proof\n(as the proofs of all our results from now on) appears in Section A.\nWe now instantiate the theorem to the prototypical case of bounded-norm linear predictors, where the\nloss is some convex and Lipschitz function of the prediction (cid:104)w, x(cid:105) of a predictor w on an instance\nx, possibly with some regularization:\nCorollary 1. Under the conditions of Thm. 2, suppose that W \u2286 {w : (cid:107)w(cid:107) \u2264 \u00afB}, and each loss\nfunction fi has the form (cid:96)i((cid:104)w, xi(cid:105)) + r(w) for some L-Lipschitz (cid:96)i, (cid:107)xi(cid:107) \u2264 1, and a \ufb01xed function\n\nr. Then E(cid:104) 1\n\nT\n\n(cid:80)T\nt=1 F (wt) \u2212 F (w\u2217)\n\n(cid:105) \u2264 RT\n\n\u221a\n2(12+\n\u221a\n\n2) \u00afBL\nm\n\n.\n\nT +\n\nAs discussed earlier, in the setting of Corollary 1, typical regret bounds are on the order of O( \u00afBL\nT ).\nThus, the expected suboptimality is O( \u00afBL/\nT ), all the way up to T = m (i.e. a full pass over a\nrandom permutation of the data). Up to constants, this is the same as the suboptimality attained by T\niterations of with-replacement sampling, using stochastic gradient descent or similar algorithms.\n\n\u221a\n\n\u221a\n\n4 Faster Convergence for Strongly Convex Functions\nWe now consider more speci\ufb01cally the stochastic gradient descent algorithm on a convex domain W,\nwhich can be described as follows: We initialize at some w1 \u2208 W, and perform the update steps\n\nwt+1 = \u03a0W (wt \u2212 \u03b7tgt),\n\n5\n\n\fwhere \u03b7t > 0 are \ufb01xed step sizes, \u03a0W is projection on W, and gt is a subgradient of f\u03c3(t)(\u00b7) at\nwt. Moreover, we assume the objective function F (\u00b7) is \u03bb-strongly convex for some \u03bb > 0. In this\nscenario, using with-replacement sampling (i.e. gt is a subgradient of an independently drawn fi(\u00b7)),\nperforming T iterations as above and returning a randomly sampled iterate wt or their average results\nin expected suboptimality \u02dcO(G2/\u03bbT ), where G2 is an upper bound on the expected squared norm of\ngt [15, 18]. Here, we study a similar situation in the without-replacement case.\nIn the result below, we will consider speci\ufb01cally the case where each fi(w) is a Lipschitz and smooth\nloss of a linear predictor w, possibly with some regularization. The smoothness assumption is needed\nto get a good bound on the transductive Rademacher complexity of quantities such as (cid:104)\u2207fi(w), w(cid:105).\nHowever, the technique can be potentially applied to more general cases.\nTheorem 3. Suppose W has diameter B, and that F (\u00b7) is \u03bb-strongly convex on W. Assume that\nfi(w) = (cid:96)i((cid:104)w, xi(cid:105)) + r(w) where (cid:107)xi(cid:107) \u2264 1, r(\u00b7) is possibly some regularization term, and each\n(cid:96)i is L-Lipschitz and \u00b5-smooth on {z : z = (cid:104)w, x(cid:105) , w \u2208 W,(cid:107)x(cid:107) \u2264 1}. Furthermore, suppose\nsupw\u2208W (cid:107)\u2207fi(w)(cid:107) \u2264 G. Then for any 1 < T \u2264 m, if we run SGD for T iterations with step size\n\u03b7t = 2/\u03bbt, we have (for a universal positive constant c)\n\n(cid:35)\n\n(cid:34)\n\nE\n\n1\nT\n\nT(cid:88)\n\nt=1\n\nF (wt) \u2212 F (w\u2217)\n\n\u2264 c \u00b7 ((L + \u00b5B)2 + G2) log(T )\n\n\u03bbT\n\n.\n\n(cid:80)T\n(cid:80)T\n\nAs in the results of the previous section, the left hand side is the expected optimality of a single\nwt where t is chosen uniformly at random, or an upper bound on the expected suboptimality of the\naverage \u00afwT = 1\nt=1 wt. This result is similar to the with-replacement case, up to numerical\nT\nconstants and the additional (L + \u00b5B2) term in the numerator. We note that in some cases, G2 is\nthe dominant term anyway2. However, it is not clear that the (L + \u00b5B2) term is necessary, and\nremoving it is left to future work. We also note that the log(T ) factor in the theorem can be removed\nby considering not 1\nt=1 F (wt), but rather only an average over some suf\ufb01x of the iterates, or\nT\nweighted averaging (see for instance [15, 12, 21], where the same techniques can be applied here).\nThe proof of Thm. 3 is somewhat more challenging than the results of the previous section, since\nwe are attempting to get a faster rate of O(1/T ) rather than O(1/\nT ), all the way up to T = m. A\n\u221a\nsigni\ufb01cant technical obstacle is that our proof technique relies on concentration of averages around\nexpectations, which on T samples does not go down faster than O(1/\nT ). To overcome this, we\napply concentration results not on the function values (i.e. F1:t\u22121(wt) \u2212 Ft:m(wt) as in the previous\nsection), but rather on gradient-iterate inner products, i.e. (cid:104)\u2207F1:t\u22121(wt) \u2212 \u2207Ft:m(wt), wt \u2212 w\u2217(cid:105),\nwhere w\u2217 is the optimal solution. To get good bounds, we need to assume these gradients have a\ncertain structure, which is why we need to make the assumption in the theorem about the form of each\nfi(\u00b7). Using transductive Rademacher complexity tools, we manage to upper bound the expectation\nof these inner products by quantities roughly of the form\nt (assuming here\nt < m/2 for simplicity). We now utilize the fact that in the strongly convex case, (cid:107)wt \u2212 w\u2217(cid:107) itself\ndecreases to zero with t at a certain rate, to get fast rates decreasing as 1/t.\n\n(cid:113)E[(cid:107)wt \u2212 w\u2217(cid:107)2]/\n\n\u221a\n\n\u221a\n\n5 Without-Replacement SVRG for Least Squares\n\nIn this section, we will consider a more sophisticated stochastic gradient approach, namely the SVRG\nalgorithm of [11], designed to solve optimization problems with a \ufb01nite sum structure as in Eq. (1).\nUnlike purely stochastic gradient procedures, this algorithm does require several passes over the\ndata. However, assuming the condition number 1/\u03bb is smaller than the data size (assuming each\nfi(\u00b7) is O(1) smooth, and \u03bb is the strong convexity parameter of F (\u00b7)), only O(m log(1/\u0001)) gradient\nevaluations are required to get an \u0001-accurate solution, for any \u0001. Thus, we can get a high-accuracy\nsolution after the equivalent of a small number of data passes. As discussed in the introduction, over\nthe past few years several other algorithms have been introduced and shown to have such a behavior.\nWe will focus on the algorithm in its basic form, where the domain W equals Rd.\nThe existing analysis of SVRG ([11]) assumes stochastic iterations, which involves sampling the data\nwith replacement. Thus, it is natural to consider whether a similar convergence behavior occurs using\n2G is generally on the order of L + \u03bbB, which is the same as L + \u00b5B if L is the dominant term. This\n\nhappens for instance with the squared loss, whose Lipschitz parameter is on the order of \u00b5B.\n\n6\n\n\fwithout-replacement sampling. As we shall see, a positive reply has at least two implications: The\n\ufb01rst is that as long as the condition number is smaller than the data size, SVRG can be used to obtain\na high-accuracy solution, without the need to reshuf\ufb02e the data: Only a single shuf\ufb02e at the beginning\nsuf\ufb01ces, and the algorithm terminates after a small number of sequential passes (logarithmic in the\nrequired accuracy). The second implication is that such without-replacement SVRG can be used\nto get a nearly-optimal algorithm for convex distributed learning and optimization on randomly\npartitioned data, as long as the condition number is smaller than the data size at each machine.\nThe SVRG algorithm using without-replacement sampling on a dataset of size m is described\nas Algorithm 1. It is composed of multiple epochs (indexed by s), each involving one gradient\ncomputation on the entire dataset, and T stochastic iterations, involving gradient computations with\nrespect to individual data points. Although the gradient computation on the entire dataset is expensive,\nit is only needed to be done once per epoch. Since the algorithm will be shown to have linear\nconvergence as a function of the number of epochs, this requires only a small (logarithmic) number\nof passes over the data.\n\nAlgorithm 1 SVRG using Without-Replacement Sampling\n\nParameters: \u03b7, T , permutation \u03c3 on {1, . . . , m}\nInitialize \u02dcw1 at 0\nfor s = 1, 2, . . . do\n\n(cid:80)m\ni=1 \u2207fi( \u02dcws)\n\nw(s\u22121)T +1 := \u02dcws\n\u02dcn := \u2207F ( \u02dcws) = 1\nfor t = (s \u2212 1)T + 1, . . . , sT do\n\nwt+1 := wt \u2212 \u03b7(cid:0)\u2207f\u03c3(t)(wt) \u2212 \u2207f\u03c3(t)( \u02dcws) + \u02dcn(cid:1)\n\nm\n\nend for\nLet \u02dcws+1 be the average of w(s\u22121)T +1, . . . , wsT , or one of them drawn uniformly at random.\n\nend for\n\nWe will consider speci\ufb01cally the regularized least mean squares problem, where\n\nfi(w) =\n\n((cid:104)w, xi(cid:105) \u2212 yi)2 +\n\n1\n2\n\n\u02c6\u03bb\n2\n\n(cid:107)w(cid:107)2 .\n\n(cid:80)m\n\n(2)\n\nfor some xi, yi and \u02c6\u03bb > 0. Moreover, we assume that F (w) = 1\ni=1 fi(w) is \u03bb-strongly convex\nm\n(note that necessarily \u03bb \u2265 \u02c6\u03bb). For convenience, we will assume that (cid:107)xi(cid:107) ,|yi|, \u03bb are all at most 1 (this\nis without much loss of generality, since we can always re-scale the loss functions by an appropriate\nfactor). Note that under this assumption, each fi(\u00b7) as well as F (\u00b7) are also 1 + \u02c6\u03bb \u2264 2-smooth.\nTheorem 4. Suppose each loss function fi(\u00b7) corresponds to Eq. (2), where xi \u2208 Rd, maxi (cid:107)xi(cid:107) \u2264 1,\nmaxi |yi| \u2264 1, \u02c6\u03bb > 0, and that F (\u00b7) is \u03bb-strongly convex, where \u03bb \u2208 (0, 1). Moreover, let B \u2265 1\nbe such that (cid:107)w\u2217(cid:107)2 \u2264 B and maxt F (wt) \u2212 F (w\u2217) \u2264 B with probability 1 over the random\npermutation. There is some universal constant c0 \u2265 1, such that for any c \u2265 c0 and any \u0001 \u2208 (0, 1), if\nwe run algorithm 1, using parameters \u03b7, T satisfying\n\n\u03b7 =\n\n1\nc\n\n, T \u2265 9\n\u03b7\u03bb\n\n, m \u2265 c log2\n\n(cid:18) 64dmB2\n\n(cid:19)\n\nT,\n\n\u03bb\u0001\n\nthen after S = (cid:100)log4(9/\u0001)(cid:101) epochs of the algorithm, \u02dcwS+1 satis\ufb01es E[F ( \u02dcwS+1)\u2212minw F (w)] \u2264 \u0001.\nIn particular, by taking \u03b7 = \u0398(1) and T = \u0398(1/\u03bb), the algorithm attains an \u0001-accurate solution after\nO(log(1/\u0001)/\u03bb) stochastic iterations of without-replacement sampling, and O(log(1/\u0001)) sequential\npasses over the data to compute gradients of F (\u00b7). This implies that as long as 1/\u03bb (which stands\nfor the condition number of the problem) is smaller than O(m/ log(1/\u0001)), the number of without-\nreplacement stochastic iterations is smaller than the data size m. Thus, assuming the data is randomly\nshuf\ufb02ed, we can get a solution using only sequential data passes, without the need to reshuf\ufb02e.\nIn terms of the log factors, we note that the condition maxt F (wt) \u2212 F (w\u2217) \u2264 B with probability 1\nis needed for technical reasons in our analysis, and we conjecture that it can be improved. However,\nsince B appears only inside log factors, even a crude bound would suf\ufb01ce. In appendix C, we indeed\nshow that under there is always a valid B satisfying log(B) = O (log(1/\u0001) log(T ) + log(1/\u03bb)).\nRegarding the logarithmic dependence on the dimension d, it is due to an application of a matrix\nBernstein inequality for d \u00d7 d matrices, and can possibly be improved.\n\n7\n\n\fm\n\nm\n\n5.1 Application of Without-Replacement SVRG to distributed learning\nAn important variant of the problems we discussed so far is when training data (or equivalently,\nthe individual loss functions f1(\u00b7), . . . , fm(\u00b7)) are split between different machines, who need to\ncommunicate in order to reach a good solution. This naturally models situations where data is too\nlarge to \ufb01t at a single machine, or where we wish to speed up the optimization process by parallelizing\nit on several computers. Over the past few years, there has been much research on this question in the\nmachine learning community, with just a few examples including [24, 2, 1, 5, 4, 10, 20, 19, 25, 13].\nA substantial number of these works focus on the setting where the data is split equally at random\nbetween k machines (or equivalently, that data is assigned to each machine by sampling without\nreplacement from {f1(\u00b7), . . . , fm(\u00b7)})). Intuitively, this creates statistical similarities between the\ndata at different machines, which can be leveraged to aid the optimization process. Recently, Lee\net al. [13] proposed a simple and elegant approach, which applies at least in certain parameter\n(cid:80)m\nregimes. This is based on the observation that SVRG, according to its with-replacement analysis,\nrequires O(log(1/\u0001)) epochs, where in each epoch one needs to compute an exact gradient of the\nobjective function F (\u00b7) = 1\ni=1 fi(\u00b7), and O(1/\u03bb) gradients of individual losses fi(\u00b7) chosen\nuniformly at random (where \u0001 is the required suboptimality, and \u03bb is the strong convexity parameter\nof F (\u00b7)). Therefore, if each machine had i.i.d. samples from {f1(\u00b7), . . . , fm(\u00b7)}, whose union\n(cid:80)m\ncover {f1(\u00b7), . . . , fm(\u00b7)}, the machines could just simulate SVRG: First, each machine splits its\ndata to batches of size O(1/\u03bb). Then, each SVRG epoch is simulated by the machines computing a\ngradient of F (\u00b7) = 1\ni=1 fi(\u00b7) \u2013 which can be fully parallelized and involves one communication\nround (assuming a broadcast communication model) \u2013 and one machine computing gradients with\nrespect to one of its batches. Overall, this would require O(log(1/\u0001)) communication rounds, and\nO(m/k + 1/\u03bb) log(1/\u0001) runtime, where m/k is the number of data points per machine (ignoring\ncommunication time, and assuming constant time to compute a gradient of fi(\u00b7)). Under the\nreasonable assumption that the strong convexity parameter \u03bb is at least k/m, this requires runtime\nlinear in the data size per machine, and logarithmic in the target accuracy \u0001, with only a logarithmic\nnumber of communication rounds. Up to log factors, this is essentially the best one can hope for\nwith a distributed algorithm. Moreover, a lower bound in [13] indicates that at least in the worst case,\nO(log(1/\u0001)) communication rounds is impossible to obtain if \u03bb is signi\ufb01cantly smaller than k/m.\nUnfortunately, the reasoning above crucially relies on each machine having access to i.i.d. samples,\nwhich can be reasonable in some cases, but is different than the more common assumption that the\ndata is randomly split among the machines. To circumvent this issue, [13] propose to communicate\nindividual data points / losses fi(\u00b7) between machines, so as to simulate i.i.d. sampling. However,\nby the birthday paradox, this only works well in the regime where the overall number of samples\nrequired (O((1/\u03bb) log(1/\u0001)) is not much larger than\nm. Otherwise, with-replacement and without-\nreplacement sampling becomes statistically very different, and a large number of data points would\n\u221a\nneed to be communicated. In other words, if communication is an expensive resource, then the\nsolution proposed in [13] only works well when \u03bb is at least order of 1/\nm. In machine learning\n\u221a\napplications, the strong convexity parameter \u03bb often comes from explicit regularization designed to\nprevent over-\ufb01tting, and needs to scale with the data size, usually between 1/\nm and 1/m. Thus,\nthe solution above is communication-ef\ufb01cient only when \u03bb is relatively large.\nHowever, the situation immediately improves if we can use a without-replacement version of SVRG,\nwhich can easily be simulated with randomly partitioned data: The stochastic batches can now be\nsimply subsets of each machine\u2019s data, which are statistically identical to sampling {f1(\u00b7), . . . , fm(\u00b7)}\nwithout replacement. Thus, no data points need to be sent across machines, even if \u03bb is small. For\nclarity, we present an explicit pseudocode as Algorithm 2 in Appendix D.\nLet us consider the analysis of no-replacement SVRG provided in Thm. 4. According to this analysis,\nby setting T = \u0398(1/\u03bb), then as long as the total number of batches is at least \u2126(log(1/\u0001)), and\n\u03bb = \u02dc\u2126(1/m), then the algorithm will attain an \u0001-suboptimal solution in expectation. In other words,\n\u221a\nwithout any additional communication, we extend the applicability of distributed SVRG (at least for\nregularized least squares) from the \u03bb = \u02dc\u2126(1/\nWe emphasize that this formal analysis only applies to regularized squared loss, which is the scope of\nThm. 4. However, this algorithmic approach can be applied to any loss function, and we conjecture\nthat it will have similar performance for any smooth losses and strongly-convex objectives.\nAcknowledgments: This research is supported in part by an FP7 Marie Curie CIG grant, an ISF grant\n425/13, and by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).\n\nm) regime to 1/\u03bb = \u02dc\u2126(1/m).\n\n\u221a\n\n8\n\n\fReferences\n[1] A. Agarwal, O. Chapelle, M. Dud\u00edk, and J. Langford. A reliable effective terascale linear learning system.\n\nCoRR, abs/1110.4198, 2011.\n\n[2] M.-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity and\n\nprivacy. In COLT, 2012.\n\n[3] L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. Springer, 2012.\n\n[4] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient\n\nmethods. In NIPS, 2011.\n\n[5] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using\n\nmini-batches. Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[6] R. El-Yaniv and D. Pechyony. Transductive rademacher complexity and its applications. Journal of\n\nArti\ufb01cial Intelligence Research, 35:193\u2013234, 2009.\n\n[7] D. Gross and V. Nesme. Note on sampling without replacing from a \ufb01nite collection of matrices. arXiv\n\npreprint arXiv:1001.2738, 2010.\n\n[8] M. G\u00fcrb\u00fczbalaban, A. Ozdaglar, and P. Parrilo. Why random reshuf\ufb02ing beats stochastic gradient descent.\n\narXiv preprint arXiv:1510.08560, 2015.\n\n[9] E. Hazan. Introduction to online convex optimization. Book draft, 2015.\n\n[10] M. Jaggi, V. Smith, M. Tak\u00e1c, J. Terhorst, S. Krishnan, T. Hofmann, and M. Jordan. Communication-\n\nef\ufb01cient distributed dual coordinate ascent. In NIPS, 2014.\n\n[11] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nNIPS, 2013.\n\n[12] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an o (1/t) convergence rate\n\nfor the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.\n\n[13] J. Lee, T. Ma, and Q. Lin. Distributed stochastic variance reduced gradient methods. arXiv preprint\n\narXiv:1507.07595, 2015.\n\n[14] A. Nedi\u00b4c and D. Bertsekas. Convergence rate of incremental subgradient algorithms.\n\noptimization: algorithms and applications, pages 223\u2013264. Springer, 2001.\n\nIn Stochastic\n\n[15] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic\n\noptimization. arXiv preprint arXiv:1109.5647, 2011.\n\n[16] B. Recht and C. R\u00e9. Beneath the valley of the noncommutative arithmetic-geometric mean inequality:\n\nconjectures, case-studies, and consequences. In COLT, 2012.\n\n[17] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine\n\nLearning, 4(2):107\u2013194, 2011.\n\n[18] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From Theory to Algorithms.\n\nCambridge University Press, 2014.\n\n[19] O. Shamir and N. Srebro. On distributed stochastic optimization and learning. In Allerton, 2014.\n\n[20] O. Shamir, N. Srebro, and T. Zhang. Communication-ef\ufb01cient distributed optimization using an approxi-\n\nmate newton-type method. In ICML, 2014.\n\n[21] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results\n\nand optimal averaging schemes. In ICML, 2013.\n\n[22] V. Vapnik. Statistical learning theory. Wiley New York, 1998.\n\n[23] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research, 11:2543\u20132596, 2010.\n\n[24] Y. Zhang, J. Duchi, and M. Wainwright. Communication-ef\ufb01cient algorithms for statistical optimization.\n\nJournal of Machine Learning Research, 14:3321\u20133363, 2013.\n\n[25] Y. Zhang and L. Xiao. Communication-ef\ufb01cient distributed optimization of self-concordant empirical loss.\n\nIn ICML, 2015.\n\n9\n\n\f", "award": [], "sourceid": 16, "authors": [{"given_name": "Ohad", "family_name": "Shamir", "institution": "Weizmann Institute of Science"}]}