{"title": "A Multi-Batch L-BFGS Method for Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1055, "page_last": 1063, "abstract": "The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.", "full_text": "A Multi-Batch L-BFGS Method for Machine\n\nLearning\n\nAlbert S. Berahas\n\nNorthwestern University\n\nEvanston, IL\n\nalbertberahas@u.northwestern.edu\n\nJorge Nocedal\n\nNorthwestern University\n\nEvanston, IL\n\nj-nocedal@northwestern.edu\n\nMartin Tak\u00e1\u02c7c\n\nLehigh University\n\nBethlehem, PA\n\ntakac.mt@gmail.com\n\nAbstract\n\nThe question of how to parallelize the stochastic gradient descent (SGD) method\nhas received much attention in the literature. In this paper, we focus instead on batch\nmethods that use a sizeable fraction of the training set at each iteration to facilitate\nparallelism, and that employ second-order information. In order to improve the\nlearning process, we follow a multi-batch approach in which the batch changes\nat each iteration. This can cause dif\ufb01culties because L-BFGS employs gradient\ndifferences to update the Hessian approximations, and when these gradients are\ncomputed using different data points the process can be unstable. This paper shows\nhow to perform stable quasi-Newton updating in the multi-batch setting, illustrates\nthe behavior of the algorithm in a distributed computing platform, and studies its\nconvergence properties for both the convex and nonconvex cases.\n\n1\n\nIntroduction\n\nIt is common in machine learning to encounter optimization problems involving millions of parameters\nand very large datasets. To deal with the computational demands imposed by such applications, high\nperformance implementations of stochastic gradient and batch quasi-Newton methods have been\ndeveloped [1, 11, 9]. In this paper we study a batch approach based on the L-BFGS method [20] that\nstrives to reach the right balance between ef\ufb01cient learning and productive parallelism.\nIn supervised learning, one seeks to minimize empirical risk,\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nF (w) :=\n\n1\nn\n\nf (w; xi, yi) def=\n\n1\nn\n\nfi(w),\n\ni=1 denote the training examples and f (\u00b7; x, y) : Rd \u2192 R is the composition of a\nwhere (xi, yi)n\nprediction function (parametrized by w) and a loss function. The training problem consists of \ufb01nding\nan optimal choice of the parameters w \u2208 Rd with respect to F , i.e.,\n\nfi(w).\n\n(1.1)\n\nn(cid:88)\n\ni=1\n\nF (w) =\n\nmin\nw\u2208Rd\n\n1\nn\n\nAt present, the preferred optimization method is the stochastic gradient descent (SGD) method [23, 5],\nand its variants [14, 24, 12], which are implemented either in an asynchronous manner (e.g. when\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fusing a parameter server in a distributed setting) or following a synchronous mini-batch approach that\nexploits parallelism in the gradient evaluation [2, 22, 13]. A drawback of the asynchronous approach\nis that it cannot use large batches, as this would cause updates to become too dense and compromise\nthe stability and scalability of the method [16, 22]. As a result, the algorithm spends more time in\ncommunication as compared to computation. On the other hand, using a synchronous mini-batch\napproach one can achieve a near-linear decrease in the number of SGD iterations as the mini-batch\nsize is increased, up to a certain point after which the increase in computation is not offset by the\nfaster convergence [26].\nAn alternative to SGD is a batch method, such as L-BFGS, which is able to reach high training\naccuracy and allows one to perform more computation per node, so as to achieve a better balance\nwith communication costs [27]. Batch methods are, however, not as ef\ufb01cient learning algorithms as\nSGD in a sequential setting [6]. To bene\ufb01t from the strength of both methods some high performance\nsystems employ SGD at the start and later switch to a batch method [1].\nMulti-Batch Method. In this paper, we follow a different approach consisting of a single method\nthat selects a sizeable subset (batch) of the training data to compute a step, and changes this batch at\neach iteration to improve the learning abilities of the method. We call this a multi-batch approach\nto differentiate it from the mini-batch approach used in conjunction with SGD, which employs a\nvery small subset of the training data. When using large batches it is natural to employ a quasi-\nNewton method, as incorporating second-order information imposes little computational overhead\nand improves the stability and speed of the method. We focus here on the L-BFGS method, which\nemploys gradient information to update an estimate of the Hessian and computes a step in O(d) \ufb02ops,\nwhere d is the number of variables. The multi-batch approach can, however, cause dif\ufb01culties to\nL-BFGS because this method employs gradient differences to update Hessian approximations. When\nthe gradients used in these differences are based on different data points, the updating procedure can\nbe unstable. Similar dif\ufb01culties arise in a parallel implementation of the standard L-BFGS method, if\nsome of the computational nodes devoted to the evaluation of the function and gradient are unable to\nreturn results on time \u2014 as this again amounts to using different data points to evaluate the function\nand gradient at the beginning and the end of the iteration. The goal of this paper is to show that stable\nquasi-Newton updating can be achieved in both settings without incurring extra computational cost, or\nspecial synchronization. The key is to perform quasi-Newton updating based on the overlap between\nconsecutive batches. The only restriction is that this overlap should not be too small, something that\ncan be achieved in most situations.\nContributions. We describe a novel implementation of the batch L-BFGS method that is robust in\nthe absence of sample consistency; i.e., when different samples are used to evaluate the objective\nfunction and its gradient at consecutive iterations. The numerical experiments show that the method\nproposed in this paper \u2014 which we call the multi-batch L-BFGS method \u2014 achieves a good balance\nbetween computation and communication costs. We also analyze the convergence properties of the\nnew method (using a \ufb01xed step length strategy) on both convex and nonconvex problems.\n\n2 The Multi-Batch Quasi-Newton Method\n\nIn a pure batch approach, one applies a gradient based method, such as L-BFGS [20], to the\ndeterministic optimization problem (1.1). When the number n of training examples is large, it is\nnatural to parallelize the evaluation of F and \u2207F by assigning the computation of the component\nfunctions fi to different processors. If this is done on a distributed platform, it is possible for some\nof the computational nodes to be slower than the rest. In this case, the contribution of the slow (or\nunresponsive) computational nodes could be ignored given the stochastic nature of the objective\nfunction. This leads, however, to an inconsistency in the objective function and gradient at the\nbeginning and at the end of the iteration, which can be detrimental to quasi-Newton methods. Thus,\nwe seek to \ufb01nd a fault-tolerant variant of the batch L-BFGS method that is capable of dealing with\nslow or unresponsive computational nodes.\nA similar challenge arises in a multi-batch implementation of the L-BFGS method in which the entire\ni=1} is not employed at every iteration, but rather, a subset of the data is\ntraining set T = {(xi, yi)n\nused to compute the gradient. Speci\ufb01cally, we consider a method in which the dataset is randomly\ndivided into a number of batches \u2014 say 10, 50, or 100 \u2014 and the minimization is performed with\nrespect to a different batch at every iteration. At the k-th iteration, the algorithm chooses a batch\n\n2\n\n\fSk \u2282 {1, . . . , n}, computes\n1\n|Sk|\n\nF Sk (wk) =\n\n(cid:88)\n\ni\u2208Sk\n\nfi (wk) ,\n\n\u2207F Sk (wk) = gSk\n\nk =\n\n1\n|Sk|\n\n\u2207fi (wk) ,\n\n(2.2)\n\nand takes a step along the direction \u2212HkgSk\nk , where Hk is an approximation to \u22072F (wk)\u22121. Allow-\ning the sample Sk to change freely at every iteration gives this approach \ufb02exibility of implementation\nand is bene\ufb01cial to the learning process, as we show in Section 4. (We refer to Sk as the sample of\ntraining points, even though Sk only indexes those points.)\nThe case of unresponsive computational nodes and the multi-batch method are similar. The main\ndifference is that node failures create unpredictable changes to the samples Sk, whereas a multi-batch\nmethod has control over sample generation. In either case, the algorithm employs a stochastic approx-\nimation to the gradient and can no longer be considered deterministic. We must, however, distinguish\nour setting from that of the classical SGD method, which employs small mini-batches and noisy\ngradient approximations. Our algorithm operates with much larger batches so that distributing the\nfunction evaluation is bene\ufb01cial and the compute time of gSk\nk is not overwhelmed by communication\ncosts. This gives rise to gradients with relatively small variance and justi\ufb01es the use of a second-order\nmethod such as L-BFGS.\nRobust Quasi-Newton Updating. The dif\ufb01culties created by the use of a different sample Sk at each\niteration can be circumvented if consecutive samples Sk and Sk+1 overlap, so that Ok = Sk\u2229Sk+1 (cid:54)=\n\u2205. One can then perform stable quasi-Newton updating by computing gradient differences based on\nthis overlap, i.e., by de\ufb01ning\n(2.3)\nin the notation given in (2.2). The correction pair (yk, sk) can then be used in the BFGS update.\nWhen the overlap set Ok is not too small, yk is a useful approximation of the curvature of the objective\nfunction F along the most recent displacement, and will lead to a productive quasi-Newton step. This\nobservation is based on an important property of Newton-like methods, namely that there is much\nmore freedom in choosing a Hessian approximation than in computing the gradient [7, 3]. Thus, a\nsmaller sample Ok can be employed for updating the inverse Hessian approximation Hk than for\ncomputing the batch gradient gSk\nk . In summary, by ensuring that\nk\nunresponsive nodes do not constitute the vast majority of all working nodes in a fault-tolerant parallel\nimplementation, or by exerting a small degree of control over the creation of the samples Sk in the\nmulti-batch method, one can design a robust method that naturally builds upon the fundamental\nproperties of BFGS updating.\nWe should mention in passing that a commonly used strategy for ensuring stability of quasi-Newton\nupdating in machine learning is to enforce gradient consistency [25], i.e., to use the same sample\nSk to compute gradient evaluations at the beginning and the end of the iteration. Another popular\nremedy is to use the same batch Sk for multiple iterations [19], alleviating the gradient inconsistency\nproblem at the price of slower convergence. In this paper, we assume that achieving such sample\nconsistency is not possible (in the fault-tolerant case) or desirable (in a multi-batch framework), and\nwish to design a new variant of L-BFGS that imposes minimal restrictions in the sample changes.\n\nin the search direction \u2212HkgSk\n\nsk+1 = wk+1 \u2212 wk,\n\nyk+1 = gOk\n\nk+1 \u2212 gOk\nk ,\n\n2.1 Speci\ufb01cation of the Method\nAt the k-th iteration, the multi-batch BFGS algorithm chooses a set Sk \u2282 {1, . . . , n} and computes a\nnew iterate\n(2.4)\nis the batch gradient (2.2) and Hk is the inverse BFGS Hessian\n\nwhere \u03b1k is the step length, gSk\nk\nmatrix approximation that is updated at every iteration by means of the formula\n\nwk+1 = wk \u2212 \u03b1kHkgSk\nk ,\n\nHk+1 = V T\n\nk HkVk + \u03c1ksksT\nk ,\n\n\u03c1k = 1\nyT\nk sk\n\n,\n\nVk = I \u2212 \u03c1kyksT\nk .\n\nTo compute the correction vectors (sk, yk), we determine the overlap set Ok = Sk \u2229 Sk+1 consisting\nof the samples that are common at the k-th and k + 1-st iterations. We de\ufb01ne\n\n(cid:88)\n\ni\u2208Sk\n\nF Ok (wk) =\n\n1\n|Ok|\n\n(cid:88)\n\ni\u2208Ok\n\nfi (wk) ,\n\n\u2207F Ok (wk) = gOk\n\nk =\n\n1\n|Ok|\n\n3\n\n(cid:88)\n\ni\u2208Ok\n\n\u2207fi (wk) ,\n\n\fand compute the correction vectors as in (2.3). In this paper we assume that \u03b1k is constant.\nIn the limited memory version, the matrix Hk is de\ufb01ned at each iteration as the result of applying\nm BFGS updates to a multiple of the identity matrix, using a set of m correction pairs {si, yi}\nkept in storage. The memory parameter m is typically in the range 2 to 20. When computing the\nmatrix-vector product in (2.4) it is not necessary to form that matrix Hk since one can obtain this\nproduct via the two-loop recursion [20], using the m most recent correction pairs {si, yi}. After the\nstep has been computed, the oldest pair (sj, yj) is discarded and the new curvature pair is stored.\nA pseudo-code of the proposed method is given below, and depends on several parameters. The\nparameter r denotes the fraction of samples in the dataset used to de\ufb01ne the gradient, i.e., r = |S|n .\nThe parameter o denotes the length of overlap between consecutive samples, and is de\ufb01ned as a\nfraction of the number of samples in a given batch S, i.e., o = |O|\n|S|\n\n.\n\nAlgorithm 1 Multi-Batch L-BFGS\nInput: w0 (initial iterate), T = {(xi, yi), for i = 1, . . . , n} (training set), m (memory parameter), r\n(batch, fraction of n), o (overlap, fraction of batch), k \u2190 0 (iteration counter).\n(cid:46) As shown in Firgure 1\n1: Create initial batch S0\n2: for k = 0, 1, 2, ... do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nCalculate the search direction pk = \u2212HkgSk\nChoose the step length \u03b1k > 0\nCompute wk+1 = wk + \u03b1kpk\nCreate the next batch Sk+1\nCompute the curvature pairs sk+1 = wk+1 \u2212 wk and yk+1 = gOk\nReplace the oldest pair (si, yi) by sk+1, yk+1\n\n(cid:46) Using L-BFGS formula\n\nk\n\nk+1 \u2212 gOk\n\nk\n\n2.2 Sample Generation\n\nWe now discuss how the sample Sk+1 is created at each iteration (Line 8 in Algorithm 1).\nDistributed Computing with Faults. Consider a distributed implementation in which slave nodes\nread the current iterate wk from the master node, compute a local gradient on a subset of the\ndataset, and send it back to the master node for aggregation in the calculation (2.2). Given a time\n(computational) budget, it is possible for some nodes to fail to return a result. The schematic in\nFigure 1a illustrates the gradient calculation across two iterations, k and k+1, in the presence of faults.\nHere Bi, i = 1, ..., B denote the batches of data that each slave node i receives (where T = \u222aiBi),\nand \u02dc\u2207f (w) is the gradient calculation using all nodes that responded within the preallocated time.\n\nFigure 1: Sample and Overlap formation.\n\nLet Jk \u2282 {1, 2, ..., B} and Jk+1 \u2282 {1, 2, ..., B} be the set of indices of all nodes that returned a\ngradient at the k-th and k + 1-st iterations, respectively. Using this notation Sk = \u222aj\u2208JkBj and\nSk+1 = \u222aj\u2208Jk+1Bj, and we de\ufb01ne Ok = \u222aj\u2208Jk\u2229Jk+1Bj. The simplest implementation in this\nsetting preallocates the data on each compute node, requiring minimal data communication, i.e., only\n\n4\n\nMASTERNODESLAVENODESMASTERNODE(a)(b)wkwkwkwkB1B1B2B2B3B3BBBB\u00b7\u00b7\u00b7wk+1wk+1wk+1wk+1B1B1B2B2B3B3BBBB\u00b7\u00b7\u00b7\u02dcrfBB(wk)\u02dcrfBB(wk)\u02dcrfB3(wk)\u02dcrfB3(wk)\u02dcrfB1(wk)\u02dcrfB1(wk)\u02dcrfBB(wk+1)\u02dcrfBB(wk+1)\u02dcrfB1(wk+1)\u02dcrfB1(wk+1)\u02dcrf(wk)\u02dcrf(wk+1)SHUFFLEDDATAndSHUFFLEDDATAO0O0O1O1O2O2O3O3O4O4O5O5O6S0S1S2S3S4S5S6\fone data transfer. In this case the samples Sk will be independent if node failures occur randomly.\nOn the other hand, if the same set of nodes fail, then sample creation will be biased, which is harmful\nboth in theory and practice. One way to ensure independent sampling is to shuf\ufb02e and redistribute the\ndata to all nodes after a certain number of iterations.\nMulti-batch Sampling. We propose two strategies for the multi-batch setting.\nFigure 1b illustrates the sample creation process in the \ufb01rst strategy. The dataset is shuf\ufb02ed and\nbatches are generated by collecting subsets of the training set, in order. Every set (except S0) is\nof the form Sk = {Ok\u22121, Nk, Ok}, where Ok\u22121 and Ok are the overlapping samples with batches\nSk\u22121 and Sk+1 respectively, and Nk are the samples that are unique to batch Sk. After each pass\nthrough the dataset, the samples are reshuf\ufb02ed, and the procedure described above is repeated. In our\nimplementation samples are drawn without replacement, guaranteeing that after every pass (epoch)\nall samples are used. This strategy has the advantage that it requires no extra computation in the\nevaluation of gOk\nk\nThe second sampling strategy is simpler and requires less control. At every iteration k, a batch Sk is\ncreated by randomly selecting |Sk| elements from {1, . . . n}. The overlapping set Ok is then formed\nby randomly selecting |Ok| elements from Sk (subsampling). This strategy is slightly more expensive\nsince gOk\n\nk+1 requires extra computation, but if the overlap is small this cost is not signi\ufb01cant.\n\nk+1, but the samples {Sk} are not independent.\n\nand gOk\n\n3 Convergence Analysis\n\nIn this section, we analyze the convergence properties of the multi-batch L-BFGS method (Algorithm\n1) when applied to the minimization of strongly convex and nonconvex objective functions, using a\n\ufb01xed step length strategy. We assume that the goal is to minimize the empirical risk F given in (1.1),\nbut note that a similar analysis could be used to study the minimization of the expected risk.\n\n3.1 Strongly Convex case\n\n(cid:2)\n\n(cid:3)2\n\nDue to the stochastic nature of the multi-batch approach, every iteration of Algorithm 1 employs a\ngradient that contains errors that do not converge to zero. Therefore, by using a \ufb01xed step length\nstrategy one cannot establish convergence to the optimal solution w(cid:63), but only convergence to a\nneighborhood of w(cid:63) [18]. Nevertheless, this result is of interest as it re\ufb02ects the common practice of\nusing a \ufb01xed step length and decreasing it only if the desired testing error has not been achieved. It\nalso illustrates the tradeoffs that arise between the size of the batch and the step length.\nIn our analysis, we make the following assumptions about the objective function and the algorithm.\nAssumptions A.\n1. F is twice continuously differentiable.\n2. There exist positive constants \u02c6\u03bb and \u02c6\u039b such that \u02c6\u03bbI (cid:22) \u22072F O(w) (cid:22) \u02c6\u039bI for all w \u2208 Rd and all\n3. There is a constant \u03b3 such that ES\n\u2264 \u03b32 for all w \u2208 Rd and all sets S \u2282\n4. The samples S are drawn independently and \u2207F S(w) is an unbiased estimator of the true\nNote that Assumption A.2 implies that the entire Hessian \u22072F (w) also satis\ufb01es\n\nsets O \u2282 {1, 2, . . . , n}.\n{1, 2, . . . , n}.\ngradient \u2207F (w) for all w \u2208 Rd, i.e., ES[\u2207F S(w)] = \u2207F (w).\n\u2200w \u2208 Rd,\n\n\u03bbI (cid:22) \u22072F (w) (cid:22) \u039bI,\n\n(cid:107)\u2207F S(w)(cid:107)\n\nfor some constants \u03bb, \u039b > 0. Assuming that every sub-sampled function F O(w) is strongly convex\nis not unreasonable as a regularization term is commonly added in practice when that is not the case.\nWe begin by showing that the inverse Hessian approximations Hk generated by the multi-batch\nL-BFGS method have eigenvalues that are uniformly bounded above and away from zero. The proof\ntechnique used is an adaptation of that in [8].\nLemma 3.1. If Assumptions A.1-A.2 above hold, there exist constants 0 < \u00b51 \u2264 \u00b52 such that the\nHessian approximations {Hk} generated by Algorithm 1 satisfy\n\n\u00b51I (cid:22) Hk (cid:22) \u00b52I,\n\nfor k = 0, 1, 2, . . .\n\n5\n\n\fUtilizing Lemma 3.1, we show that the multi-batch L-BFGS method with a constant step length\nconverges to a neighborhood of the optimal solution.\nTheorem 3.2. Suppose that Assumptions A.1-A.4 hold and let F (cid:63) = F (w(cid:63)), where w(cid:63) is the\nminimizer of F . Let {wk} be the iterates generated by Algorithm 1 with \u03b1k = \u03b1 \u2208 (0,\n2\u00b51\u03bb ), starting\nfrom w0. Then for all k \u2265 0,\n\n1\n\nE[F (wk) \u2212 F (cid:63)] \u2264 (1 \u2212 2\u03b1\u00b51\u03bb)k[F (w0) \u2212 F (cid:63)] + [1 \u2212 (1 \u2212 \u03b1\u00b51\u03bb)k]\n\nk\u2192\u221e\u2212\u2212\u2212\u2212\u2192\n\n\u03b1\u00b52\n\n2\u03b32\u039b\n4\u00b51\u03bb\n\n.\n\n\u03b1\u00b52\n\n2\u03b32\u039b\n4\u00b51\u03bb\n\nThe bound provided by this theorem has two components: (i) a term decaying linearly to zero, and\n(ii) a term identifying the neighborhood of convergence. Note that a larger step length yields a\nmore favorable constant in the linearly decaying term, at the cost of an increase in the size of the\nneighborhood of convergence. We will consider again these tradeoffs in Section 4, where we also\nnote that larger batches increase the opportunities for parallelism and improve the limiting accuracy\nin the solution, but slow down the learning abilities of the algorithm.\nOne can establish convergence of the multi-batch L-BFGS method to the optimal solution w(cid:63) by\nemploying a sequence of step lengths {\u03b1k} that converge to zero according to the schedule proposed\nby Robbins and Monro [23]. However, that provides only a sublinear rate of convergence, which is of\nlittle interest in our context where large batches are employed and some type of linear convergence is\nexpected. In this light, Theorem 3.2 is more relevant to practice.\n\n3.2 Nonconvex case\n\nThe BFGS method is known to fail on noconvex problems [17, 10]. Even for L-BFGS, which\nmakes only a \ufb01nite number of updates at each iteration, one cannot guarantee that the Hessian\napproximations have eigenvalues that are uniformly bounded above and away from zero. To establish\nconvergence of the BFGS method in the nonconvex case cautious updating procedures have been\nproposed [15]. Here we employ a cautious strategy that is well suited to our particular algorithm; we\nskip the update, i.e., set Hk+1 = Hk, if the curvature condition\n\nyT\nk sk \u2265 \u0001(cid:107)sk(cid:107)2\n\n(3.5)\nis not satis\ufb01ed, where \u0001 > 0 is a predetermined constant. Using said mechanism we show that the\neigenvalues of the Hessian matrix approximations generated by the multi-batch L-BFGS method are\nbounded above and away from zero (Lemma 3.3). The analysis presented in this section is based on\nthe following assumptions.\nAssumptions B.\n1. F is twice continuously differentiable.\n2. The gradients of F are \u039b-Lipschitz continuous, and the gradients of F O are \u039bO-Lipschitz\n\n3. The function F (w) is bounded below by a scalar (cid:98)F .\ncontinuous for all w \u2208 Rd and all sets O \u2282 {1, 2, . . . , n}.\n(cid:2)\n4. There exist constants \u03b3 \u2265 0 and \u03b7 > 0 such that ES\n\u2264 \u03b32 + \u03b7(cid:107)\u2207F (w)(cid:107)2 for all\nw \u2208 Rd and all sets S \u2282 {1, 2, . . . , n}.\n5. The samples S are drawn independently and \u2207F S(w) is an unbiased estimator of the true\ngradient \u2207F (w) for all w \u2208 Rd, i.e., E[\u2207F S(w)] = \u2207F (w).\nLemma 3.3. Suppose that Assumptions B.1-B.2 hold and let \u0001 > 0 be given. Let {Hk} be the\nHessian approximations generated by Algorithm 1, with the modi\ufb01cation that Hk+1 = Hk whenever\n(3.5) is not satis\ufb01ed. Then, there exist constants 0 < \u00b51 \u2264 \u00b52 such that\nfor k = 0, 1, 2, . . .\n\n(cid:107)\u2207F S(w)(cid:107)\n\n(cid:3)2\n\n\u00b51I (cid:22) Hk (cid:22) \u00b52I,\n\nWe can now follow the analysis in [4, Chapter 4] to establish the following result about the behavior\nof the gradient norm for the multi-batch L-BFGS method with a cautious update strategy.\nTheorem 3.4. Suppose that Assumptions B.1-B.5 above hold, and let \u0001 > 0 be given. Let {wk} be\nthe iterates generated by Algorithm 1, with \u03b1k = \u03b1 \u2208 (0, \u00b51\n2\u03b7\u039b ), starting from w0, and with the\n\n\u00b52\n\n6\n\n\fmodi\ufb01cation that Hk+1 = Hk whenever (3.5) is not satis\ufb01ed. Then,\n\n2[F (w0) \u2212 (cid:98)F ]\n\n\u03b1\u00b51L\n\n\u03b1\u00b52\n2\u03b32\u039b\n\u00b51\n\n\u2264\n\n+\n\nE(cid:104) 1\n\nL\n\n(cid:107)\u2207F (wk)(cid:107)2(cid:105)\n\nL\u22121(cid:88)\n\nk=0\n\nL\u2192\u221e\u2212\u2212\u2212\u2212\u2192\n\n\u03b1\u00b52\n2\u03b32\u039b\n\u00b51\n\n.\n\nThis result bounds the average norm of the gradient of F after the \ufb01rst L \u2212 1 iterations, and shows\nthat the iterates spend increasingly more time in regions where the objective function has a small\ngradient.\n\nk+1 \u2212 gSk\n\n4 Numerical Results\nIn this Section, we present numerical results that evaluate the proposed robust multi-batch L-BFGS\nscheme (Algorithm 1) on logistic regression problems. Figure 2 shows the performance on the\nwebspam dataset1, where we compare it against three methods: (i) multi-batch L-BFGS without\nenforcing sample consistency (L-BFGS), where gradient differences are computed using different\nsamples, i.e., yk = gSk+1\nk ; (ii) multi-batch gradient descent (Gradient Descent), which is\nobtained by setting Hk = I in Algorithm 1; and, (iii) serial SGD, where at every iteration one\nsample is used to compute the gradient. We run each method with 10 different random seeds, and,\nwhere applicable, report results for different batch (r) and overlap (o) sizes. The proposed method\nis more stable than the standard L-BFGS method; this is especially noticeable when r is small. On\nthe other hand, serial SGD achieves similar accuracy as the robust L-BFGS method and at a similar\nrate (e.g., r = 1%), at the cost of n communications per epochs versus\nr(1\u2212o) communications per\nepoch. Figure 2 also indicates that the robust L-BFGS method is not too sensitive to the size of\noverlap. Similar behavior was observed on other datasets, in regimes where r \u00b7 o was not too small.\nWe mention in passing that the L-BFGS step was computed using the a vector-free implementation\nproposed in [9].\n\n1\n\nFigure 2: webspam dataset. Comparison of Robust L-BFGS, L-BFGS (multi-batch L-BFGS without\nenforcing sample consistency), Gradient Descent (multi-batch Gradient method) and SGD for various\nbatch (r) and overlap (o) sizes. Solid lines show average performance, and dashed lines show worst\nand best performance, over 10 runs (per algorithm). K = 16 MPI processes.\n\nWe also explore the performance of the robust multi-batch L-BFGS method in the presence of node\nfailures (faults), and compare it to the multi-batch variant that does not enforce sample consistency\n(L-BFGS). Figure 3 illustrates the performance of the methods on the webspam dataset, for various\n\n1LIBSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.\n\n7\n\n00.511.522.5310\u2212410\u22122100102104Epochsk\u2207F(w)kwebspam \u03b1 = 1 r= 1% K = 16 o=20% Robust L\u2212BFGSL\u2212BFGSGradient DescentSGD00.511.522.5310\u2212410\u2212310\u2212210\u22121100Epochsk\u2207F(w)kwebspam \u03b1 = 1 r= 5% K = 16 o=20% Robust L\u2212BFGSL\u2212BFGSGradient DescentSGD00.511.522.5310\u2212410\u2212310\u2212210\u22121100Epochsk\u2207F(w)kwebspam \u03b1 = 1 r= 10% K = 16 o=20% Robust L\u2212BFGSL\u2212BFGSGradient DescentSGD00.511.522.5310\u2212410\u2212310\u2212210\u22121100101Epochsk\u2207F(w)kwebspam \u03b1 = 1 r= 1% K = 16 o=5% Robust L\u2212BFGSL\u2212BFGSGradient DescentSGD00.511.522.5310\u2212410\u22122100102Epochsk\u2207F(w)kwebspam \u03b1 = 1 r= 1% K = 16 o=10% Robust L\u2212BFGSL\u2212BFGSGradient DescentSGD00.511.522.5310\u2212410\u2212310\u2212210\u22121100Epochsk\u2207F(w)kwebspam \u03b1 = 1 r= 1% K = 16 o=30% Robust L\u2212BFGSL\u2212BFGSGradient DescentSGD\fprobabilities of node failures p \u2208 {0.1, 0.3, 0.5}, and suggests that the robust L-BFGS variant is\nmore stable.\n\nFigure 3: webspam dataset. Comparison of Robust L-BFGS and L-BFGS (multi-batch L-BFGS\nwithout enforcing sample consistency), for various node failure probabilities p. Solid lines show\naverage performance, and dashed lines show worst and best performance, over 10 runs (per algorithm).\nK = 16 MPI processes.\n\nLastly, we study the strong and weak scaling properties of the robust L-BFGS method on arti\ufb01cial\ndata (Figure 4). We measure the time needed to compute a gradient (Gradient) and the associated\ncommunication (Gradient+C), as well as, the time needed to compute the L-BFGS direction (L-\nBFGS) and the associated communication (L-BFGS+C), for various batch sizes (r). The \ufb01gure\non the left shows strong scaling of multi-batch LBFGS on a d = 104 dimensional problem with\nn = 107 samples. The size of input data is 24GB, and we vary the number of MPI processes,\nK \u2208 {1, 2, . . . , 128}. The time it takes to compute the gradient decreases with K, however, for small\nvalues of r, the communication time exceeds the compute time. The \ufb01gure on the right shows weak\nscaling on a problem of similar size, but with varying sparsity. Each sample has 10 \u00b7 K non-zero\nelements, thus for any K the size of local problem is roughly 1.5GB (for K = 128 size of data\n192GB). We observe almost constant time for the gradient computation while the cost of computing\nthe L-BFGS direction decreases with K; however, if communication is considered, the overall time\nneeded to compute the L-BFGS direction increases slightly.\n\nFigure 4: Strong and weak scaling of multi-batch L-BFGS method.\n\n5 Conclusion\n\nThis paper describes a novel variant of the L-BFGS method that is robust and ef\ufb01cient in two settings.\nThe \ufb01rst occurs in the presence of node failures in a distributed computing implementation; the second\narises when one wishes to employ a different batch at each iteration in order to accelerate learning.\nThe proposed method avoids the pitfalls of using inconsistent gradient differences by performing\nquasi-Newton updating based on the overlap between consecutive samples. Numerical results show\nthat the method is ef\ufb01cient in practice, and a convergence analysis illustrates its theoretical properties.\n\nAcknowledgements\n\nThe \ufb01rst two authors were supported by the Of\ufb01ce of Naval Research award N000141410313, the\nDepartment of Energy grant DE-FG02-87ER25047 and the National Science Foundation grant\nDMS-1620022. Martin Tak\u00e1\u02c7c was supported by National Science Foundation grant CCF-1618717.\n\n8\n\n05010015020025030010\u2212610\u2212410\u22122100Iterations/Epochsk\u2207F(w)kwebspam \u03b1 = 0.1 p= 0.1 K = 16 Robust L\u2212BFGSL\u2212BFGS05010015020025030010\u2212510\u2212410\u2212310\u2212210\u22121100Iterations/Epochsk\u2207F(w)kwebspam \u03b1 = 0.1 p= 0.3 K = 16 Robust L\u2212BFGSL\u2212BFGS05010015020025030010\u2212510\u2212410\u2212310\u2212210\u22121100Iterations/Epochsk\u2207F(w)kwebspam \u03b1 = 0.1 p= 0.5 K = 16 Robust L\u2212BFGSL\u2212BFGS10110210\u2212610\u2212410\u22122100r = 0.04%Number of MPI processes \u2212 KElapsed Time [s]Strong Scaling r = 0.08%r = 0.16%r = 0.32%r = 0.63%r = 1.25%r = 2.50%r = 5.00%r = 10.00%GradientGradient+CL\u2212BFGSL\u2212BFGS+C10110210\u2212610\u2212510\u2212410\u2212310\u2212210\u22121r = 0.04%Number of MPI processes \u2212 KElapsed Time [s]Weak Scaling \u2212 Fix problem dimensions r = 0.08%r = 0.16%r = 0.32%r = 0.63%r = 1.25%r = 2.50%r = 5.00%r = 10.00%GradientGradient+CL\u2212BFGSL\u2212BFGS+C\fReferences\n[1] A. Agarwal, O. Chapelle, M. Dud\u00edk, and J. Langford. A reliable effective terascale linear learning system.\n\nThe Journal of Machine Learning Research, 15(1):1111\u20131133, 2014.\n\n[2] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods, volume 23.\n\nPrentice hall Englewood Cliffs, NJ, 1989.\n\n[3] R. Bollapragada, R. Byrd, and J. Nocedal. Exact and inexact subsampled newton methods for optimization.\n\narXiv preprint arXiv:1609.08502, 2016.\n\n[4] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. arXiv\n\npreprint arXiv:1606.04838, 2016.\n\n[5] L. Bottou and Y. LeCun. Large scale online learning. In NIPS, pages 217\u2013224, 2004.\n[6] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In NIPS, pages 161\u2013168, 2008.\n[7] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in\n\noptimization methods for machine learning. SIAM Journal on Optimization, 21(3):977\u2013995, 2011.\n\n[8] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale\n\noptimization. SIAM Journal on Optimization, 26(2):1008\u20131031, 2016.\n\n[9] W. Chen, Z. Wang, and J. Zhou. Large-scale L-BFGS using MapReduce. In NIPS, pages 1332\u20131340, 2014.\n[10] Y.-H. Dai. Convergence properties of the BFGS algoritm. SIAM Journal on Optimization, 13(3):693\u2013701,\n\n2002.\n\n[11] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al.\n\nLarge scale distributed deep networks. In NIPS, pages 1223\u20131231, 2012.\n\n[12] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for\n\nnon-strongly convex composite objectives. In NIPS, pages 1646\u20131654, 2014.\n\n[13] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 2016.\n[14] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nNIPS, pages 315\u2013323, 2013.\n\n[15] D.-H. Li and M. Fukushima. On the global convergence of the BFGS method for nonconvex unconstrained\n\noptimization problems. SIAM Journal on Optimization, 11(4):1054\u20131064, 2001.\n\n[16] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate\n\nanalysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015.\n\n[17] W. F. Mascarenhas. The BFGS method with exact line searches fails for non-convex objective functions.\n\nMathematical Programming, 99(1):49\u201361, 2004.\n\n[18] A. Nedi\u00b4c and D. Bertsekas. Convergence rate of incremental subgradient algorithms.\n\nOptimization: Algorithms and Applications, pages 223\u2013264. Springer, 2001.\n\nIn Stochastic\n\n[19] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng. On optimization methods for deep\n\nlearning. In ICML, pages 265\u2013272, 2011.\n\n[20] J. Nocedal and S. Wright. Numerical Optimization. Springer New York, 2 edition, 1999.\n[21] M. J. Powell. Some global convergence properties of a variable metric algorithm for minimization without\n\nexact line searches. Nonlinear programming, 9(1):53\u201372, 1976.\n\n[22] B. Recht, C. Re, S. Wright, and F. Niu. HOGWILD!: A lock-free approach to parallelizing stochastic\n\ngradient descent. In NIPS, pages 693\u2013701, 2011.\n\n[23] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\npages 400\u2013407, 1951.\n\n[24] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\nMathematical Programming, page 1\u201330, 2016.\n\n[25] N. N. Schraudolph, J. Yu, and S. G\u00fcnter. A stochastic quasi-Newton method for online convex optimization.\n\nIn AISTATS, pages 436\u2013443, 2007.\n\n[26] M. Tak\u00e1\u02c7c, A. Bijral, P. Richt\u00e1rik, and N. Srebro. Mini-batch primal and dual methods for SVMs. In ICML,\n\npages 1022\u20131030, 2013.\n\n[27] Y. Zhang and X. Lin. DiSCO: Distributed optimization for self-concordant empirical loss. In ICML, pages\n\n362\u2013370, 2015.\n\n9\n\n\f", "award": [], "sourceid": 611, "authors": [{"given_name": "Albert", "family_name": "Berahas", "institution": "Northwestern University"}, {"given_name": "Jorge", "family_name": "Nocedal", "institution": "Northwestern University"}, {"given_name": "Martin", "family_name": "Takac", "institution": "Lehigh University"}]}