{"title": "Distributed Stochastic Optimization via Adaptive SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 1910, "page_last": 1919, "abstract": "Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent (SGD), is a serial method that is surprisingly hard to parallelize. In this paper, we propose an efficient distributed stochastic optimization method by combining adaptivity with variance reduction techniques. Our analysis yields a linear speedup in the number of machines, constant memory footprint, and only a logarithmic number of communication rounds. Critically, our approach is a black-box reduction that parallelizes any serial online learning algorithm, streamlining prior analysis and allowing us to leverage the significant progress that has been made in designing adaptive algorithms. In particular, we achieve optimal convergence rates without any prior knowledge of smoothness parameters, yielding a more robust algorithm that reduces the need for hyperparameter tuning. We implement our algorithm in the Spark distributed framework and exhibit dramatic performance gains on large-scale logistic regression problems.", "full_text": "Distributed Stochastic Optimization via Adaptive\n\nSGD\n\nAshok Cutkosky\n\nStanford University, USA\u21e4\ncutkosky@google.com\n\nR\u00f3bert Busa-Fekete\n\nYahoo! Research, New York, USA\n\nbusafekete@oath.com\n\nAbstract\n\nStochastic convex optimization algorithms are the most popular way to train ma-\nchine learning models on large-scale data. Scaling up the training process of these\nmodels is crucial, but the most popular algorithm, Stochastic Gradient Descent\n(SGD), is a serial method that is surprisingly hard to parallelize. In this paper,\nwe propose an ef\ufb01cient distributed stochastic optimization method by combining\nadaptivity with variance reduction techniques. Our analysis yields a linear speedup\nin the number of machines, constant memory footprint, and only a logarithmic\nnumber of communication rounds. Critically, our approach is a black-box reduction\nthat parallelizes any serial online learning algorithm, streamlining prior analysis\nand allowing us to leverage the signi\ufb01cant progress that has been made in designing\nadaptive algorithms. In particular, we achieve optimal convergence rates without\nany prior knowledge of smoothness parameters, yielding a more robust algorithm\nthat reduces the need for hyperparameter tuning. We implement our algorithm\nin the Spark distributed framework and exhibit dramatic performance gains on\nlarge-scale logistic regression problems.\n\n1 Setup\n\nWe consider a fundamental problem in machine learning, stochastic convex optimization:\n\nmin\nw2W\n\nF (w) := E\nf\u21e0D\n\n[f (w)]\n\n(1)\n\nHere, W is a convex subset of Rd and D is a distribution over L-smooth convex functions W ! R.\nWe do not have direct access to F , and the distribution D is unknown, but we do have the ability\nto generate i.i.d. samples f \u21e0D through some kind of stream or oracle. In practice, each function\nf \u21e0D corresponds to a new datapoint in some learning problem. Algorithms for this problem are\nwidely applicable: for example, in logistic regression the goal is to optimize F (w) = E[f (w)] =\nE[log(1 + exp(ywT x))] when the (x, y) pairs are the (feature vector, label) pairs coming from a\n\ufb01xed data distribution. Given a budget of N oracle calls (e.g. a dataset of size N), we wish to \ufb01nd a\n\u02c6w such that F ( \u02c6w) F (w?) (called the suboptimality) is as small as possible as fast as possible using\nas little memory as possible, where w? 2 argmin F .\nThe most popular algorithm for solving (1) is Stochastic Gradient Descent (SGD), which achieves\nstatistically optimal O(1/pN ) suboptimality in O(N ) time and constant memory. However, in\nmodern large-scale machine learning problems the number of data points N is often gigantic, and so\neven the linear time-complexity of SGD becomes onerous. We need a parallel algorithm that runs in\nonly O(N/m) time using m machines. We address this problem in this paper, evaluating solutions on\nthree metrics: time complexity, space complexity, and communication complexity. Time complexity\nis the total time taken to process the data points. Space complexity is the amount of space required per\n\n\u21e4now at Google\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmachine. Note that in our streaming model, an algorithm that keeps only the most recently seen data\npoint in memory is considered to run in constant memory. Communication complexity is measured\nin terms of the number of \u201crounds\u201d of communication in which all the machines synchronize. In\nmeasuring these quantities we often suppress all constants other than those depending on N and m\nand all logarithmic factors.\nIn this paper we achieve the ideal parallelization complexity (up to a logarithmic factor) of \u02dcO(N/m)\ntime, O(1) space and \u02dcO(1) rounds of communication, so long as m < pN. Further, in contrast to\nmuch prior work, our algorithm is a reduction that enables us to generically parallelize any serial\nonline learning algorithm that obtains a suf\ufb01ciently adaptive convergence guarantee (e.g. [8, 14, 6])\nin a black-box way. This signi\ufb01cantly simpli\ufb01es our analysis by decoupling the learning rates or other\ninternal variables of the serial algorithm from the parallelization procedure. This technique allows\nour algorithm to adapt to an unknown smoothness parameter L in the problem, resulting in optimal\nconvergence guarantees without requiring tuning of learning rates. This is an important aspect of the\nalgorithm: even prior analyses that meet the same time, space and communication costs [9, 12, 13]\nrequire the user to input the smoothness parameter to tune a learning rate. Incorrect values for this\nparameter can result in failure to converge, not just slower convergence. In contrast, our algorithm\nautomatically adapts to the true value of L with no tuning. Empirically, we \ufb01nd that the parallelized\nimplementation of a serial algorithm matches the performance of the serial algorithm in terms of\nsample-complexity, while bestowing signi\ufb01cant runtime savings.\n\n2 Prior Work\n\nOne popular strategy for parallelized stochastic optimization is minibatch-SGD [7], in which one\ncomputes m gradients at a \ufb01xed point in parallel and then averages these gradients to produce a single\nSGD step. When m is not too large compared to the variance in D, this procedure gives a linear\nspeedup in theory and uses constant memory. Unfortunately, minibatch-SGD obtains a communication\ncomplexity that scales as pN (or N 1/4 for accelerated variants). In modern problems when N is\nextremely large, this overhead is prohibitively large. We achieve a communication complexity that is\nlogarithmic in N, allowing our algorithm to be run as a near-constant number of map-reduce jobs\neven for very large N. We summarize the state of the art for some prior algorithms algorithms in\nTable 1.\nMany prior approaches to reducing communication complexity can be broadly categorized into those\nthat rely on Newton\u2019s method and those that rely on the variance-reduction techniques introduced\nin the SVRG algorithm [11]. Algorithms that use Newton\u2019s method typically make the assumption\nthat D is a distribution over quadratic losses [20, 22, 16, 21], and leverage the fact that the expected\nHessian is constant to compute a Newton step in parallel. Although quadratic losses are an excellent\nstarting point, it is not clear how to generalize these approaches to arbitrary non-quadratic smooth\nlosses such as encountered in logistic regression.\nAlternative strategies stemming from SVRG work by alternating between a \u201cbatch phase\u201d in which\none computes a very accurate gradient estimate using a large batch of examples and an \u201cSGD phase\u201d\nin which one runs SGD, using the batch gradient to reduce the variance in the updates [9, 12, 18, 10].\nOur approach also follows this overall strategy (see Section 3 for a more detailed discussion of this\nprocedure). However, all prior algorithms in this category make use of carefully speci\ufb01ed learning\nrates in the SGD phase, while our approach makes use of any adaptive serial optimization algorithm,\neven ones that do not resemble SGD at all, such as [6, 14]. This results in a streamlined analysis and\na more general \ufb01nal algorithm. Not only do we recover prior results, we can leverage the adaptivity\nof our base algorithm to obtain better results on sparse losses and to avoid any dependencies on the\nsmoothness parameter L, resulting in a much more robust procedure.\nThe rest of this paper is organized as follows. In Section 3 we provide a high-level overview of our\nstrategy. In Section 4 we introduce some basic facts about the analysis of adaptive algorithms using\nonline learning, in Section 5 we sketch our intuition for combining SVRG and the online learning\nanalysis, and in Section 6 we describe and analyze our algorithm. In Section 7 we show that the\nconvergence rate is statistically optimal and show that a parallelized implementation achieves the\nstated complexities. Finally in Section 9 we give some experimental results.\n\n2\n\n\fTable 1: Comparison of distributed optimization algorithms with a dataset of size N and m machines.\nLogarithmic factors and all constants not depending on N or m have been dropped.\n\nMethod\nNewton inspired [20, 22, 16, 21]\naccel. minibatch-SGD [5]\nprior SVRG-like [9, 12, 18, 10]\nThis work\n\n3 Overview of Approach\n\nQuadratic Loss\n\nSpace Communication Adapts to L\n\nNeeded N/m\n\nNot Needed\nNot Needed\nNot Needed\n\n1\n1\n1\n\n1\n\nN 1/4\n\n1\n1\n\nNo\nNo\nNo\nYes\n\nOur overall strategy for parallelizing a serial SGD algorithm is based upon the stochastic variance-\nreduced gradient (SVRG) algorithm [11]. SVRG is a technique for improving the sample complexity\nof SGD given access to a stream of i.i.d. samples f \u21e0D (as in our setting), as well as the ability to\ncompute exact gradients rF (v) in a potentially expensive operation. The basic intuition is to use an\nexact gradient rF (v) at some \u201canchor point\u201d v 2 W as a kind of \u201chint\u201d for what the exact gradient\nis at nearby points w. Speci\ufb01cally, SVRG leverages the theorem that rf (w) rf (v) + rF (v) is\nan unbiased estimate of rF (w) with variance approximately bounded by L(F (v) F (w?)) (see (8)\nin [11]). Using this fact, the SVRG strategy is:\n\n1. Choose an \u201canchor point\u201d v = w0.\n2. Compute an exact gradient rF (v) (this is an expensive operation).\n3. Perform T SGD updates: wt+1 = wt \u2318(rf (wt) rf (v) + rF (v)) for T i.i.d. samples\n4. Choose a new anchor point v by averaging the T SGD iterates, set w0 = v and repeat 2-4.\n\nf \u21e0D using the \ufb01xed anchor v.\n\nBy reducing the suboptimality of the anchor point v, the variance in the gradients also decreases,\nproducing a virtuous cycle in which optimization progress reduces noise, which allows faster opti-\nmization progress. This approach has two drawbacks that we will address. First, it requires computing\nthe exact gradient rF (v), which is impossible in our stochastic optimization setup. Second, prior\nanalyses require speci\ufb01c settings for \u2318 that incorporate L and fail to converge with incorrect settings,\nrequiring the user to manually tune \u2318 to obtain the desired performance. To deal with the \ufb01rst issue,\nwe can approximate rF (v) by averaging gradients over a mini-batch, which allows us to approximate\nSVRG\u2019s variance-reduced gradient estimate, similar to [9, 12]. This requires us to keep track of\nthe errors introduced by this approximation. To deal with the second issue, we incorporate analysis\ntechniques from online learning which allow us replace the constant step-size SGD with any adaptive\nstochastic optimization algorithm in a black-box manner. This second step forms the core of our\ntheoretical contribution, as it both simpli\ufb01es analysis and allows us to adapt to L.\nThe overall roadmap for our analysis has \ufb01ve steps:\n\n1. We model the errors introduced by approximating the anchor-point gradient rF (v) by a\nminibatch-average as a \u201cbias\u201d, so that we think of our algorithm as operating on slightly\nbiased but low-variance gradient estimates.\n\n2. Focusing \ufb01rst only the bias aspect, we analyze the performance of online learning algorithms\nwith biased gradient estimates and show that so long as the bias is suf\ufb01ciently small, the\nalgorithm will still converge quickly (Section 4).\n\n3. Next focusing on the variance-reduction aspect, we show that any online learning algorithm\nwhich enjoys a suf\ufb01ciently adaptive convergence guarantee produces a similar \u201cvirtuous\ncycle\u201d as observed with constant-step-size SGD in the analysis of SVRG, resulting in fast\nconvergence (sketched in Section 5, proved in Appendices C and D).\n4. Combine the previous three steps to show that applying SVRG using these approximate\nvariance-reduced gradients and an adaptive serial SGD algorithm achieves O(L/pN )\nsuboptimality using only O(pN ) serial SGD updates (Sections 6 and 7).\n\n3\n\n\f5. Observe that the batch processing in step 3 can be done in parallel, that this step consumes\nthe vast majority of the computation, and that it only needs to be repeated logarithmically\nmany times (Section 7).\n\n4 Biased Online Learning\n\nA popular way to analyze stochastic gradient descent and related algorithms is through online\nlearning [19]. In this framework, an algorithm repeatedly outputs vectors wt for t = 1, 2, . . . in some\nconvex space W , and receives gradients gt such that E[gt] = rF (wt) for some convex objective\nfunction F .2 Typically one attempts to bound the linearized regret:\n\nRT (w?) =\n\ngt \u00b7 (wt w?)\n\nTXt=1\n\nWhere w? = argmin F . We can apply online learning algorithms to stochastic optimization via\nonline-to-batch conversion [3], which tells us that\n\nE[F (w) F (w?)] \uf8ff E[RT (w?)]\n\nT\n\nt=1 wt.\n\nwhere w = 1\nThus, an algorithm that guarantees small regret immediately guarantees convergence in stochastic\noptimization. Online learning algorithms typically obtain some sort of (deterministic!) guarantee like\n\nT PT\n\nRT (w?) \uf8ff R(w?,kg1k, . . . ,kgTk)\n\nt=1 kgtk2.\n\nfollowing simple result:\n\nwhere R is increasing in each kgtk. For example, when the convex space W has diameter D,\n\nAdaGrad [8] obtains RT (w?) \uf8ff Dq2PT\nAs foreshadowed in Section 3, we will need to consider the case of biased gradients. That is,\nE[gt] = rF (wt) + bt for some unknown bias vector bt. Given these biased gradients, a natural\nquestion is: to what extent does controlling the regret RT (w?) =PT\nt=1 gt \u00b7 (wt w?) affect our\nability to control the suboptimality E[PT\nt=1 F (wt) F (w?)]? We answer this question with the\nT PT\nProposition 1. De\ufb01ne RT (w?) = PT\nt=1 wt where E[gt] =\nt=1 gt \u00b7 (wt w?) and w = 1\nrF (wt) + bt. Then\nE[kbtk(kwt w?k)]\nT PT\n\nIf the domain V has diameter D, then E[F (w) F (w?)] \uf8ff E[RT (w?)]\nOur main convergence results will require algorithms with regret bounds of the form R(w?) \uf8ff\n (w?)qPT\nt=1 kgtk for various . This is an acceptable restric-\ntion because there are many examples of such algorithms, including AdaGrad [8], SOLO [15],\nPiSTOL [14] and FreeRex [6]. Further, in Proposition 3 we show a simple trick to remove the\ndependence on kwt w?k, allowing our results to extend to unbounded domains.\n5 Variance-Reduced Online Learning\n\nt=1 kgtk2 or R(w?) \uf8ff (w?)qPT\n\nE[F (w) F (w?)] \uf8ff E[RT (w?)]\n\nT\n\n+ 1\nT\n\n+ D\n\nT\n\nt=1 E[kbtk]\n\nTXt=1\n\nIn this section we sketch an argument that using variance reduction in conjunction with a online learn-\nt=1 kgtk2 results in a very fast convergence\nt=1 E[F (wt) F (w?)] = O(1) up to log factors. A similar result holds for regret guarantees\nt=1 kgtk via a similar argument, which we leave to Appendix D. To do\n2The online learning literature often allows for adversarially generated gt, but we consider only stochastically\n\ning algorithm guaranteeing regret R(w?) \uf8ff (w?)qPT\nofPT\nlike R(w?) \uf8ff (w?)qPT\n\ngenerated gt here.\n\n4\n\n\fthis we make use of a critical lemma of variance reduction which asserts that a variance-reduced\ngradient estimate gt of rF (wt) with anchor point vt has E[kgtk2] \uf8ff L(F (wt) + F (vt) 2F (w?))\nup to constants. This gives us the following informal result:\nProposition 2. [Informal statement of Proposition 8] Given a point wt 2 W , let gt be an unbiased\nestimate of rF (wt) such that E[kgtk2] \uf8ff L(F (wt) + F (vt) 2F (w?)). Suppose w1, . . . , wT are\nt=1 kgtk2. Then\n\ngenerated by an online learning algorithm with regret at most R(w?) \uf8ff (w?)qPT\nL E[F (vt) F (w?)]1A\n\nE\" TXt=1\n\nProof. The proof is remarkably simple, and we sketch it in one line here. The full statement and\nproof can be found in Appendix D.\n\nTXt=1\n\n(2)\n\nE\" TXt=1\n\nF (wt) F (w?)# = O0@L (w?)2 + (w?)vuut\nvuut\nkgtk235\nF (wt) F (w?)# \uf8ff (w?) E24\n\uf8ff (w?)vuut\nTXt=1\n\nTXt=1\nL E[F (wt) F (w?) + F (vt) F (w?)]\nNow square both sides and use the quadratic formula to solve for EhPT\nt=1 F (wt) F (w?)i.\nNotice that in Proposition 2, the online learning algorithm\u2019s regret guarantee (w?)qPT\n\nt=1 kgtk2\ndoes not involve the smoothness parameter L, and yet nevertheless L shows up in equation (2). It is\nthis property that will allow us to adapt to L without requiring any user-supplied information.\n\nAlgorithm 1 SVRG OL (SVRG with Online Learning)\n1: Initialize: Online learning algorithm A; Batch size \u02c6N; epoch lengths 0 = T0, . . . , TK; Set\nTa:b =Pb\ni=a Ti.\n2: Get initial vector w1 from A, set vt w1.\n3: for k = 1 to K do\nSample \u02c6N functions f1, . . . , f \u02c6N \u21e0D\n4:\n\u02c6NP \u02c6N\nr \u02c6F (vk) 1\n5:\ni=1 rfi(vk)\n(this step can be done in parallel).\n6:\nfor t = T0:k1 + 1 to T0:k do\n7:\nSample f \u21e0D .\n8:\nGive gt = rf (wt) rf (vk) + r \u02c6F (vk) to A.\n9:\nGet updated vector wt+1 from A.\n10:\n11:\n12:\n13: end for\n\nend for\nvk+1 1\n\nt=T0:k1+1 wt.\n\nTkPT0:k\n\nVariance reduction allows us to generate estimates gt satisfying the hypothesis of Proposition 2, so\nthat we can control our convergence rate by picking appropriate vts. We want to change vt very few\ntimes because changing anchor points requires us to compute a high-accuracy estimate of rF (vt).\nThus we change vt only when t is a power of 2 and set v2n to be the average of the last 2n1 iterates\nwt. By Jensen, this allows us to boundPT\nt=1 E[F (wt) F (w?)], and\nso applying Proposition 2 we can concludePT\n\nt=1 E[F (vt) F (w?) byPT\nt=1 E[F (wt) F (w?)] = O(1).\n\n6 SVRG with Online Learning\n\nWith the machinery of the previous sections, we are now in a position to derive and analyze our main\nalgorithm, presented in SVRG OL.\n\n5\n\n\fSVRG OL implements the procedure described in Section 3. For each of a series of K rounds, we\ncompute a batch gradient estimate r \u02c6F (vk) for some \u201canchor point\u201d vk. Then we run Tk iterations of\nan online learning algorithm. To compute the tth gradient gt given to the online learning algorithm\nin response to an output point wt, SVRG OL approximates the variance-reduction trick of SVRG,\nsetting gt = rf (wt) rf (vk) + r \u02c6F (vk) for some new sample f \u21e0D . After the Tk iterations\nhave elapsed, a new anchor point vk+1 is chosen and the process repeats.\nIn this section we characterize SVRG OL\u2019s performance when the base algorithm A has a regret\nt=1 kgtk2. We can also perform essentially similar analysis for regret\nt=1 kgtk, but we postpone this to Appendix E.\n\nguarantee of (w?)qPT\nguarantees like (w?)qPT\nIn order to analyze SVRG OL, we need to bound the error kr \u02c6F (vk) rF (vk)k uniformly for all\nk \uf8ff K. This can be accomplished through an application of Hoeffding\u2019s inequality:\nLemma 1. Suppose that D is a distribution over G-Lipschitz functions. Then with probability at\nleast 1 , maxk kr \u02c6F (vk) rF (vk)k \uf8ffq 2G2 log(K/)+G2\n\nThe proof of Lemma 1 is deferred to Appendix A. The following Theorem is now an immediate\nconsequences of the concentration bound Lemma 1 and Propositions 8 and 9 (see Appendix).\nTheorem 1. Suppose the online learning algorithm A guarantees regret RT (w?) \uf8ff\n (w?)qPT\nt=1 kgtk2. Set bt = kr \u02c6F (vk) rF (vk)k for t 2 [T0:k1 + 1, T1:k] (where Ta:b :=\nPb\nT PT\ni=a Ti). Suppose that Tk/Tk1 \uf8ff \u21e2 for all k. Then for w = 1\n2PT\nt=1 E[kbtk(kwt w?k)]\n\nE[F (w) F (w?)] \uf8ff\n\n32(1 + \u21e2) (w?)2L\n\nt=1 wt,\n\n+\n\nT\n\nT\n\n\u02c6N\n\n.\n\n2 (w?)q8LT1 E[F (v1) F (w?)] + 2PT\n\nT\n\n+\n\nt=1 E[kbtk2]\n\nimplies\n\nIn particular, if D is a distribution over G-Lipschitz functions, then with probability at least 1 1\nwe have kbtk \uf8ffq 2G2 log(KT )+G2\nT\nfor all t. Further, if \u02c6N > T 2 and V has diameter D, then this\nE[F (w) F (w?)] \uf8ff\n\n4 (w?)plog(KT )\n\n32(1 + \u21e2) (w?)2L\n\nT\n\n\u02c6N\n\n+\n\nTpT\n8 (w?)pLT1 E[F (v1) F (w?)]\n\nT\n\n+\n\n=O plog(KT )\n\nT\n\n!\n\n+\n\nGD\nT\n\n+ 2\n\nGp2 log(KT ) + 1D\n\nT\n\nWe note that although this theorem requires a \ufb01nite diameter for the second result, we present a\nsimple technique to deal with unbounded domains and retain the same result in Appendix D\n\n7 Statistical and Computational Complexity\n\nIn this section we describe how to choose the batch size \u02c6N and epoch sizes Tk in order to obtain\noptimal statistical complexity and computational complexity. The total amount of data used by SVRG\nOL is N = K \u02c6N + T0:K = K \u02c6N + T . If we choose \u02c6N = T 2, this is O(K \u02c6N ). Set Tk = 2Tk1, with\nsome T1 > 0 so that \u21e2 = max Tk/Tk1 = 2 and O(K = log(N )). Then our Theorem 1 guarantees\nsuboptimality O(plog(T K)/T ), which is O(pK log(T K)/pK \u02c6N ) = O(pK log(T K)/pN ).\nThis matches the optimal O(1/pN ) up to logarithmic factors and constants.\nThe parallelization step is simple: we parallelize the computation of r \u02c6F (vk) by having m machines\ncompute and sum gradients for \u02c6N /m new examples each, and then averaging these m sums together\n\n6\n\n\fon one machine. Notice that this can be done with constant memory footprint by streaming the\nexamples in - the algorithm will not make any further use of these examples so it\u2019s safe to forget\nthem. Then we run the Tk steps of the inner loop in serial, which again can be done in constant\nmemory footprint. This results in a total runtime of O(K \u02c6N /m + T ) - a linear speedup so long as\nm < KN/T . For algorithms with regret bounds matching the conditions of Theorem 1, we get\noptimal convergence rates by setting \u02c6N = T 2, in which case our total data usage is N = O(K \u02c6N ).\nThis yields the following calculation:\nTheorem 2. Set Tk = 2Tk1. Suppose the base optimizer A in SVRG OL guarantees regret\nRT (w?) \uf8ff (w?)qPT\nt=1 kgtk2, and the domain W has \ufb01nite diameter D. Let \u02c6N =\u21e5( T 2) and\nN = K \u02c6N + T be the total number of data points observed. Suppose we compute the batch gradients\nr \u02c6F (vk) in parallel on m machines with m < pN. Then for w = 1\nE[F (w) F (w?)] = \u02dcO\u2713 1\npN\u25c6\n\nt=1 wt we obtain\n\nT PT\n\nin time \u02dcO(N/m), and space O(1), and K = \u02dcO(1) communication rounds.\n\n8\n\nImplementation\n\n8.1 Linear Losses and Dense Batch Gradients\nMany losses of practical interest take the form f (w) = `(w \u00b7 x, y) for some label y and feature\nvector x 2 Rd where d is extremely large, but x is s-sparse. These losses have the property\nthat rf (w) = `0(w \u00b7 x, y)x is also s-sparse. Since d can often be very large, it is extremely\ndesirable to perform all parameter updates in O(s) time rather than O(d) time. This is relatively\neasy to accomplish for most SGD algorithms, but our strategy involves correcting the variance\nin rf (w) using a dense batch gradient r \u02c6F (vk) and so we are in danger of losing the signi\ufb01cant\ncomputational speedup that comes from taking advantage of sparsity. We address this problem\nthrough an importance-sampling scheme.\nSuppose the ith coordinate of x is non-zero with probability pi. Given a vector v, let I(v) be the\nvector whose ith component is 0 if wi = 0, or 1/pi is wi 6= 0. Then E[I(rf (w))] is equal to the\nall-ones vector. Using this notation, we replace the variance-reduced gradient estimate rf (w) \nrf (vk) + r \u02c6F (vk) with rf (w) rf (vk) + r \u02c6F (vk) I(rf (w)), where indicates component-\nwise multiplication. Since E[I(rf (w))] is the all-ones vector, E[r \u02c6F (vk) I(rf (w))] = r \u02c6F (vk)\nand so the expected value of this estimate has not changed. However, it is clear that the sparsity of the\nestimate is now equal to the sparsity of rf (w). Performing this transformation introduces additional\nvariance into the estimate, and could slow down our convergence by a constant factor. However, we\n\ufb01nd that even with this extra variance we still see impressive speedups (see Section 9).\n\n8.2 Spark implementation\nImplementing our algorithm in the Spark environment is fairly straightforward. SVRG OL switches\nbetween two phases: a batch gradient computation phase and a serial phase in whicn we run the\nonline learning algorithm. The serial phase is carried out by the driver while the batch gradient is\ncomputed by executors. We initially divide the training data into C approximately 100M chunks,\nand we use min(1000, C) executors. Tree aggregation with depth of 5 is used to gather the gradient\nfrom the executors, which is similar to the operation implemented by Vowpal Wabbing (VW) [1]. We\nuse asynchronous collects to move the instances used in the next serial SGD phase of SVRG OL to\nthe driver while the batch gradient is being computed. We used feature hashing with 23 bits to limit\nmemory consumption.\n\n8.3 Batch sizes\nOur theoretical analysis asks for exponentially increasing serial phase lengths Tk and a batch size\nof of \u02c6N = T 2. In practice we use slightly different settings. We have a constant serial phase length\nTk = T0 for all k, and an increasing batch size \u02c6Nk = kC for some constant C. We usually set\n\n7\n\n\fC = T0. The constant Tk is motivated by observing that the requirement for exponentially increasing\nTk comes from a desire to offset potential poor performance in the \ufb01rst serial phase (which gives the\ndependence on T1 in Theorem 1). In practice we do not expect this to be an issue. The increasing\nbatch size is motivated by the empirical observation that earlier serial phases (when we are farther\nfrom the optimum) typically do not require as accurate a batch gradient in order to make fast progress.\n\nTable 2: Statistics of the datasets. The compressed size of the data is reported. B=Billion, M=Million\n\nData\n\u2013\nKDD10\nKDD12\nADS SMALL\nADS LARGE\nEMAIL\n\nData size (Gb)\n# Instance\nTest\nTrain\nTrain\nTest\n0.02\n0.5\n19.2M 0.74M\n0.5\n1.6\n119.7M 29.9M\n40.3\n1.216B 0.356B\n155.0\n5.613B 1.097B 1049.1\n486.1\n57.6\n637.4\n1.236B 0.994B\n\n# Features Avg # feat % positives\n\n29 890 095\n54 686 452\n2 970 211\n12 133 899\n37 091 273\n\n29.34\n11.0\n92.96\n95.72\n132.12\n\n86.06%\n4.44%\n8.55%\n9.42%\n18.74%\n\n9 Experiments\n\nTo verify our theoretical results, we carried out experiments on large-scale (order 100 million\ndatapoints) public datasets, such as KDD10 and KDD12 3 and on proprietary data (order 1 billion\ndatapoints), such as click-prediction for ads [4] and email click-prediction datasets [2]. The main\nstatistics of the datasets are shown in Table 2. All of these are large datasets with sparse features,\nand heavily imbalanced in terms of class distribution. We solved these binary classi\ufb01cation tasks\nwith logistic regression. We tested two well-know scalable logistic regression implementation: Spark\nML 2.2.0 and Vowpal Wabbit 7.10.0 (VW) 4. To optimize the logistic loss we used the L-BFGS\nalgorithm implemented by both packages. We also tested minibatch SGD and non-adaptive SVRG\nimplementations. However, we observe that the relationship between non-adaptive SVRG updates and\nthe updates in our algorithm are analogous to the relationship between the updates in constant-step-\nsize SGD and (for example) AdaGrad. Since our experiments involved sparse high-dimensional data,\nadaptive step sizes are very important and one should not expect these algorithms to be competitive\n(and indeed they were not).\n\nFigure 1: Test loss of three SGD algorithms (PiSTOL [14], Vowpal Wabbit (labeled as SGD VW)\n[17] and FreeRex [6]) and SVRG OL (labeled as SVRG OL, using FreeRex as the base optimizer)\non the benchmark datasets.\n\nFirst we compared SVRG OL to several non-parallelized baseline SGD optimizers on the different\ndatasets. We plot the loss a function of the number of datapoints processed, as well as the total\nruntime (Figure 1). Measuring the number of datapoints processed gives us a sense of the statistical\nef\ufb01ciency of the algorithm and gives a metric that is independent of implementation quality details.\nWe see that, remarkably, SVRG OL\u2019s actually performs well as a function of number of datapoints\n\n3https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html\n4 https://github.com/JohnLangford/vowpal_wabbit/releases/tag/7.10\n\n8\n\n\fprocessed and so is a competitive serial algorithm before even any parallelization. Thus it is no\nsurprise that we see signi\ufb01cant speedups when the batch computation is parallelized.\nTo assess the trend of the speed-up with the size of the\ntraining data, we plotted the relative speed-up of SVRG\nOL versus FreeRex which is used as base optimizer in\nSVRG OL. Figure 2 shows the fraction of running time\nof non-parallel and parallel algorithms needed to achieve\nthe same performance in terms of test loss. The x-axis\nscales with the running time of the parallel SVRG OL\nalgorithm. The speed-up increases with training time, and\nthus the number of training instances processed. This\nresult suggests that our method will indeed match with\nthe theoretical guarantees in case of large enough datasets,\nalthough this trend is hard to verify rigorously in our test\nregime.\nIn our second experiment, we proceed to compare SVRG\nOL to Spark ML and VW in Table 4. These two LBFGS-based algorithms were superior in all\nmetrics to minibatch SGD and non-adaptive SVRG algorithms and so we report only the comparison\nto Spark ML and VW (see Section F for full results). We measure the number of communication\nrounds, the total training error, the error on a held-out test set, the Area Under the Curve (AUC),\nand total runtime in minutes. Table 4 illustrates that SVRG OL compares well to both Spark ML\nand VW. Notably, SVRG OL uses dramatically fewer communication rounds. On the smaller KDD\ndatasets, we also see much faster runtimes, possibly due to overhead costs associated with the other\nalgorithms. It is important to note that our SVRG OL makes only one pass over the dataset, while\nthe competition makes one pass per communication round, resulting in 100s of passes. Nevertheless,\nwe obtain competitive \ufb01nal error due to SVRG OL\u2019s high statistical ef\ufb01ciency.\n\nFigure 2: The speed-up ratio of\nSVRG OL versus FreeRex on\nvarious datasets.\n\nTable 3: Average loss and AUC achieved by Logistic regression implemented in Spark ML, VW and\nSVRG OL. \u201cCom.\u201d refers to number of communication rounds and time is measured in minutes.The\nresults on KDD10, ADS LARGE and EMAIL data is presented in App. F due to lack of space.\nDataset\n\nADS SMALL\n\nSpark ML\nSpark ML\nVW\nVW\nSVRG OL\n\nCom.\n100\n550\n100\n500\n4\n\n10 Conclusion\n\nKDD12\nTest\n\nTrain\n0.15756\n0.15755\n0.15398\n0.14866\n0.152740 0.154985 78.431\n\n0.15589 75.485\n36\n0.15570 75.453 180\n0.15725 77.871\n44\n0.15550 78.881 150\n8\n\nAUC Time Com. Train\n\nTest\n\nAUC Time\n100 0.23372 0.22288 83.356\n42\n500 0.23365 0.22286 83.365 245\n100 0.23381 0.22347 83.214 114\n500 0.23157 0.22251 83.499 396\n14\n94\n\n0.23147 0.22244 83.479\n\nWe have presented SVRG OL, a generic stochastic optimization framework which combines adaptive\nonline learning algorithms with variance reduction to obtain communication ef\ufb01ciency in parallel\narchitectures. Our analysis signi\ufb01cantly streamlines previous work by making black-box use of\nany adaptive online learning algorithm, thus disentangling the variance-reduction and serial phases\nof SVRG algorithms. We require only a logarithmic number of communication rounds, and we\nautomatically adapt to an unknown smoothness parameter, yielding both fast performance and\nrobustness to hyperparameter tuning. We developed a Spark implementation of SVRG OL and solved\nreal large scale sparse learning problems with competitive performance to L-BFGS implemented by\nVW and Spark ML.\n\nReferences\n[1] Alekh Agarwal, Olivier Chapelle, Miroslav Dud\u00edk, and John Langford. A reliable effective terascale linear\n\nlearning system. The Journal of Machine Learning Research, 15(1):1111\u20131133, 2014.\n\n9\n\n\f[2] Alina Beygelzimer, Robert Busa-Fekete, Guy Halawi, and Francesco Orabona. Algorithmic noti\ufb01cations\n\nfor mail: Debiasing mail data to predict user actions. In under review, 2018.\n\n[3] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[4] Haibin Cheng and Erick Cant\u00fa-Paz. Personalized click prediction in sponsored search. In Proceedings\nof the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York,\nNY, USA, February 4-6, 2010, pages 351\u2013360, 2010. doi: 10.1145/1718487.1718531. URL http:\n//doi.acm.org/10.1145/1718487.1718531.\n\n[5] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via acceler-\nated gradient methods. In Advances in Neural Information Processing Systems 24: 25th Annual Conference\non Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011,\nGranada, Spain., pages 1647\u20131655, 2011.\n\n[6] Ashok Cutkosky and Kwabena Boahen. Online learning without prior information. In Satyen Kale and\nOhad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings\nof Machine Learning Research, pages 643\u2013677, Amsterdam, Netherlands, 07\u201310 Jul 2017. PMLR. URL\nhttp://proceedings.mlr.press/v65/cutkosky17a.html.\n\n[7] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using\n\nmini-batches. Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[9] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk minimizer\nin a single pass. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France,\nJuly 3-6, 2015, pages 728\u2013763, 2015.\n\n[10] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Kone\u02c7cn`y, and Scott\nSallinen. Stopwasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems,\npages 2251\u20132259, 2015.\n\n[11] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\nIn Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information\nProcessing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United\nStates., pages 315\u2013323, 2013.\n\n[12] Lihua Lei and Michael I Jordan. Less than a single pass: Stochastically controlled stochastic gradient\n\nmethod. arXiv preprint arXiv:1609.03261, 2016.\n\n[13] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization via scsg\n\nmethods. In Advances in Neural Information Processing Systems, pages 2345\u20132355, 2017.\n\n[14] Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic\n\nlearning. In Advances in Neural Information Processing Systems, pages 1116\u20131124, 2014.\n\n[15] Francesco Orabona and D\u00e1vid P\u00e1l. Scale-free online learning. arXiv preprint arXiv:1601.01974, 2016.\n[16] Sashank J Reddi, Jakub Kone\u02c7cn`y, Peter Richt\u00e1rik, Barnab\u00e1s P\u00f3cz\u00f3s, and Alex Smola. Aide: Fast and\n\ncommunication ef\ufb01cient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.\n\n[17] St\u00e9phane Ross, Paul Mineiro, and John Langford. Normalized online learning.\n\narXiv:1305.6646, 2013.\n\narXiv preprint\n\n[18] Vatsal Shah, Megasthenis Asteris, Anastasios Kyrillidis, and Sujay Sanghavi. Trading-off variance and\n\ncomplexity in stochastic gradient descent. arXiv preprint arXiv:1603.06861, 2016.\n\nMachine Learning, 4(2):107\u2013194, 2012.\n\n[19] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R in\n[20] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication ef\ufb01cient distributed optimization using an\napproximate newton-type method. CoRR, abs/1312.7853, 2013. URL http://arxiv.org/abs/1312.\n7853.\n\n[21] Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication ef\ufb01cient distributed stochastic\noptimization with minibatch prox. CoRR, abs/1702.06269, 2017. URL http://arxiv.org/abs/1702.\n06269.\n\n[22] Yuchen Zhang and Lin Xiao. Communication-ef\ufb01cient distributed optimization of self-concordant empirical\n\nloss. CoRR, abs/1501.00263, 2015. URL http://arxiv.org/abs/1501.00263.\n\n10\n\n\f", "award": [], "sourceid": 962, "authors": [{"given_name": "Ashok", "family_name": "Cutkosky", "institution": "Google"}, {"given_name": "R\u00f3bert", "family_name": "Busa-Fekete", "institution": "Yahoo! Research"}]}