{"title": "Probabilistic Line Searches for Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 181, "page_last": 189, "abstract": "In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user-controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.", "full_text": "Probabilistic Line Searches\nfor Stochastic Optimization\n\nMaren Mahsereci and Philipp Hennig\n\nMax Planck Institute for Intelligent Systems\nSpemannstra\u00dfe 38, 72076 T\u00a8ubingen, Germany\n[mmahsereci|phennig]@tue.mpg.de\n\nAbstract\n\nIn deterministic optimization, line searches are a standard tool ensuring stability\nand ef\ufb01ciency. Where only stochastic gradients are available, no direct equivalent\nhas so far been formulated, because uncertain gradients do not allow for a strict\nsequence of decisions collapsing the search space. We construct a probabilistic line\nsearch by combining the structure of existing deterministic methods with notions\nfrom Bayesian optimization. Our method retains a Gaussian process surrogate of\nthe univariate optimization objective, and uses a probabilistic belief over the Wolfe\nconditions to monitor the descent. The algorithm has very low computational cost,\nand no user-controlled parameters. Experiments show that it effectively removes\nthe need to de\ufb01ne a learning rate for stochastic gradient descent.\n\n1\n\nIntroduction\n\nStochastic gradient descent (SGD) [1] is currently the standard in machine learning for the optimization\nof highly multivariate functions if their gradient is corrupted by noise. This includes the online or\nbatch training of neural networks, logistic regression [2, 3] and variational models [e.g. 4, 5, 6]. In all\nthese cases, noisy gradients arise because an exchangeable loss-function L(x) of the optimization\nparameters x \u2208 RD, across a large dataset {di}i=1 ...,M , is evaluated only on a subset {dj}j=1,...,m:\n\nM(cid:88)\n\ni=1\n\nm(cid:88)\n\nj=1\n\nL(x) :=\n\n1\nM\n\n(cid:96)(x, di) \u2248 1\nm\n\n(cid:96)(x, dj) =: \u02c6L(x)\n\nm (cid:28) M.\n\n(1)\n\nIf the indices j are i.i.d. draws from [1, M ], by the Central Limit Theorem, the error \u02c6L(x) \u2212 L(x)\nis unbiased and approximately normal distributed. Despite its popularity and its low cost per step,\nSGD has well-known de\ufb01ciencies that can make it inef\ufb01cient, or at least tedious to use in practice.\nTwo main issues are that, \ufb01rst, the gradient itself, even without noise, is not the optimal search\ndirection; and second, SGD requires a step size (learning rate) that has drastic effect on the algorithm\u2019s\nef\ufb01ciency, is often dif\ufb01cult to choose well, and virtually never optimal for each individual descent\nstep. The former issue, adapting the search direction, has been addressed by many authors [see 7, for\nan overview]. Existing approaches range from lightweight \u2018diagonal preconditioning\u2019 approaches\nlike ADAGRAD [8] and \u2018stochastic meta-descent\u2019[9], to empirical estimates for the natural gradient\n[10] or the Newton direction [11], to problem-speci\ufb01c algorithms [12], and more elaborate estimates\nof the Newton direction [13]. Most of these algorithms also include an auxiliary adaptive effect on\nthe learning rate. And Schaul et al. [14] recently provided an estimation method to explicitly adapt\nthe learning rate from one gradient descent step to another. None of these algorithms change the\nsize of the current descent step. Accumulating statistics across steps in this fashion requires some\nconservatism: If the step size is initially too large, or grows too fast, SGD can become unstable and\n\u2018explode\u2019, because individual steps are not checked for robustness at the time they are taken.\n\n1\n\n\f)\nt\n(\nf\ne\nu\nl\na\nv\nn\no\ni\nt\nc\nn\nu\nf\n\n6.5\n\n\u008c\n\n\u008d\n\n6\n\n\u008e\n\n\u008f\n\n5.5\n\n0\n\n\u0090\n\n0.5\n\n\u009b\n\n1\n\ndistance t in line search direction\n\nFigure 1: Sketch: The task of a classic line search is to tune\nthe step taken by a optimization algorithm along a univariate\nsearch direction. The search starts at the endpoint \u008c of the\nprevious line search, at t = 0. A sequence of exponentially\ngrowing extrapolation steps \u008d,\u008e,\u008f \ufb01nds a point of positive\ngradient at \u008f. It is followed by interpolation steps \u0090,\u009b un-\ntil an acceptable point \u009b is found. Points of insuf\ufb01cient\ndecrease, above the line f (0) + c1tf(cid:48)(0) (gray area) are ex-\ncluded by the Armijo condition W-I, while points of steep\ngradient (orange areas) are excluded by the curvature con-\ndition W-II (weak Wolfe conditions in solid orange, strong\nextension in lighter tone). Point \u009b is the \ufb01rst to ful\ufb01l both\nconditions, and is thus accepted.\n\nThe principally same problem exists in deterministic (noise-free) optimization problems. There,\nproviding stability is one of several tasks of the line search subroutine. It is a standard constituent of\nalgorithms like the classic nonlinear conjugate gradient [15] and BFGS [16, 17, 18, 19] methods [20,\n\u00a73].1 In the noise-free case, line searches are considered a solved problem [20, \u00a73]. But the methods\nused in deterministic optimization are not stable to noise. They are easily fooled by even small\ndisturbances, either becoming overly conservative or failing altogether. The reason for this brittleness\nis that existing line searches take a sequence of hard decisions to shrink or shift the search space.\nThis yields ef\ufb01ciency, but breaks hard in the presence of noise. Section 3 constructs a probabilistic\nline search for noisy objectives, stabilizing optimization methods like the works cited above. As\nline searches only change the length, not the direction of a step, they could be used in combination\nwith the algorithms adapting SGD\u2019s direction, cited above. The algorithm presented below is thus a\ncomplement, not a competitor, to these methods.\n\n2 Connections\n\n2.1 Deterministic Line Searches\nThere is a host of existing line search variants [20, \u00a73]. In essence, though, these methods explore a\nunivariate domain \u2018to the right\u2019 of a starting point, until an \u2018acceptable\u2019 point is reached (Figure 1).\n\nMore precisely, consider the problem of minimizing L(x) : RD(cid:95) R, with access to \u2207L(x) :\nRD(cid:95) RD. At iteration i, some \u2018outer loop\u2019 chooses, at location xi, a search direction si \u2208 RD\n(e.g. by the BFGS rule, or simply si = \u2212\u2207L(xi) for gradient descent). It will not be assumed that\nsi has unit norm. The line search operates along the univariate domain x(t) = xi + tsi for t \u2208 R+.\nAlong this direction it collects scalar function values and projected gradients that will be denoted\n(cid:124)\ni \u2207L(x(t)) \u2208 R. Most line searches involve an initial extrapolation\nf (t) = L(x(t)) and f(cid:48)(t) = s\nphase to \ufb01nd a point tr with f(cid:48)(tr) > 0. This is followed by a search in [0, tr], by interval nesting or\nby interpolation of the collected function and gradient values, e.g. with cubic splines.2\n\n2.1.1 The Wolfe Conditions for Termination\n\nAs the line search is only an auxiliary step within a larger iteration, it need not \ufb01nd an exact root\nof f(cid:48); it suf\ufb01ces to \ufb01nd a point \u2018suf\ufb01ciently\u2019 close to a minimum. The Wolfe [21] conditions are a\nwidely accepted formalization of this notion; they consider t acceptable if it ful\ufb01lls\n\nf (t) \u2264 f (0) + c1tf(cid:48)(0)\n\n(W-I)\n\nand\n\nf(cid:48)(t) \u2265 c2f(cid:48)(0)\n\n(2)\nusing two constants 0 \u2264 c1 < c2 \u2264 1 chosen by the designer of the line search, not the user. W-I is\nthe Armijo [22], or suf\ufb01cient decrease condition. It encodes that acceptable functions values should\nlie below a linear extrapolation line of slope c1f(cid:48)(0). W-II is the curvature condition, demanding\n1In these algorithms, another task of the line search is to guarantee certain properties of surrounding\nestimation rule. In BFGS, e.g., it ensures positive de\ufb01niteness of the estimate. This aspect will not feature here.\n2This is the strategy in minimize.m by C. Rasmussen, which provided a model for our implementation. At\n\n(W-II),\n\nthe time of writing, it can be found at http://learning.eng.cam.ac.uk/carl/code/minimize/minimize.m\n\n2\n\n\f6.5\n\n)\nt\n(\nf\n\n6\n\n\u008c\u008d\n\n\u008e\n\n\u0090\n\n\u009b\n\n\u008f\n\n5.5\n\n)\nt\n(\na\np\n)\nt\n(\nb\np\n\n1\n\n0\n1\n\n0\n1\n0\n\u22121\n\n)\nt\n(\n\u03c1\n\n)\nt\n(\n\ne\nf\nl\no\nW\np\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n\nweak\nstrong\n\n0\n\n0.5\n\n1\n\ndistance t in line search direction\n\nFigure 2: Sketch of a probabilistic line search. As in\nFig. 1, the algorithm performs extrapolation (\u008d,\u008e,\u008f)\nand interpolation (\u0090,\u009b), but receives unreliable, noisy\nfunction and gradient values. These are used to con-\nstruct a GP posterior (top. solid posterior mean, thin\nlines at 2 standard deviations, local pdf marginal as\nshading, three dashed sample paths). This implies a\nbivariate Gaussian belief (\u00a73.3) over the validity of the\nweak Wolfe conditions (middle three plots. pa(t) is the\nmarginal for W-I, pb(t) for W-II, \u03c1(t) their correlation).\nPoints are considered acceptable if their joint probabil-\nity pWolfe(t) (bottom) is above a threshold (gray). An\napproximation (\u00a73.3.1) to the strong Wolfe conditions\nis shown dashed.\n\na decrease in slope. The choice c1 = 0 accepts any value below f (0), while c1 = 1 rejects all\npoints for convex functions. For the curvature condition, c2 = 0 only accepts points with f(cid:48)(t) \u2265 0;\nwhile c2 = 1 accepts any point of greater slope than f(cid:48)(0). W-I and W-II are known as the weak\nform of the Wolfe conditions. The strong form replaces W-II with |f(cid:48)(t)| \u2264 c2|f(cid:48)(0)| (W-IIa). This\nguards against accepting points of low function value but large positive gradient. Figure 1 shows a\nconceptual sketch illustrating the typical process of a line search, and the weak and strong Wolfe\nconditions. The exposition in \u00a73.3 will initially focus on the weak conditions, which can be precisely\nmodeled probabilistically. Section 3.3.1 then adds an approximate treatment of the strong form.\n\n2.2 Bayesian Optimization\n\nA recently blossoming sample-ef\ufb01cient approach to global optimization revolves around modeling\nthe objective f with a probability measure p(f ); usually a Gaussian process (GP). Searching for\nextrema, evaluation points are then chosen by a utility functional u[p(f )]. Our line search borrows\nthe idea of a Gaussian process surrogate, and a popular utility, expected improvement [23]. Bayesian\noptimization methods are often computationally expensive, thus ill-suited for a cost-sensitive task\nlike a line search. But since line searches are governors more than information extractors, the kind of\nsample-ef\ufb01ciency expected of a Bayesian optimizer is not needed. The following sections develop a\nlightweight algorithm which adds only minor computational overhead to stochastic optimization.\n\n3 A Probabilistic Line Search\nWe now consider minimizing y(t) = \u02c6L(x(t)) from Eq. (1). That is, the algorithm can access only\nnoisy function values and gradients yt, y(cid:48)\nt | f ) = N\n\n(3)\nThe Gaussian form is supported by the Central Limit argument at Eq. (1), see \u00a73.4 regarding estimation\nof the variances \u03c32\nf(cid:48). Our algorithm has three main ingredients: A robust yet lightweight Gaussian\nprocess surrogate on f (t) facilitating analytic optimization; a simple Bayesian optimization objective\nfor exploration; and a probabilistic formulation of the Wolfe conditions as a termination criterion.\n\nt at location t, with Gaussian likelihood\n\n(cid:20) f (t)\n\n(cid:20)\u03c32\n\n(cid:18)(cid:20)yt\n\n(cid:21)\n\np(yt, y(cid:48)\n\n;\n\nf(cid:48)(t)\n\n,\n\n(cid:21)(cid:19)\n\nf , \u03c32\n\n(cid:21)\n\nf\n0\n\n0\n\u03c32\nf(cid:48)\n\n.\n\ny(cid:48)\n\nt\n\n3.1 Lightweight Gaussian Process Surrogate\n\nWe model information about the objective in a probability measure p(f ). There are two requirements\non such a measure: First, it must be robust to irregularity of the objective. And second, it must allow\nanalytic computation of discrete candidate points for evaluation, because a line search should not call\nyet another optimization subroutine itself. Both requirements are ful\ufb01lled by a once-integrated Wiener\nprocess, i.e. a zero-mean Gaussian process prior p(f ) = GP(f ; 0, k) with covariance function\n\nk(t, t(cid:48)) = \u03b82(cid:2)1/3 min3(\u02dct, \u02dct(cid:48)) + 1/2|t \u2212 t(cid:48)| min2(\u02dct, \u02dct(cid:48))(cid:3) .\n\n(4)\n\n3\n\n\fHere \u02dct := t + \u03c4 and \u02dct(cid:48) := t(cid:48) + \u03c4 denote a shift by a constant \u03c4 > 0. This ensures this kernel is positive\nsemi-de\ufb01nite, the precise value \u03c4 is irrelevant as the algorithm only considers positive values of t\n(our implementation uses \u03c4 = 10). See \u00a73.4 regarding the scale \u03b82. With the likelihood of Eq. (3),\nthis prior gives rise to a GP posterior whose mean function is a cubic spline3 [25]. We note in passing\nthat regression on f and f(cid:48) from N observations of pairs (yt, y(cid:48)\nt) can be formulated as a \ufb01lter [26]\nand thus performed in O(N ) time. However, since a line search typically collects < 10 data points,\ngeneric GP inference, using a Gram matrix, has virtually the same, low cost.\nBecause Gaussian measures are closed under linear maps [27, \u00a710], Eq. (4) implies a Wiener process\n(linear spline) model on f(cid:48):\n\n(cid:18)(cid:20) f\n\n(cid:21)\n\nf(cid:48)\n\n(cid:20) k\n\nk\u2202\n\n(cid:21)(cid:19)\n\nk\u2202\nk\u2202 \u2202\n\np(f ; f(cid:48)) = GP\n\n; 0,\n\n,\n\n(5)\n\nk\u2202(t, t(cid:48)) = \u03b82(cid:2)I(t < t(cid:48))t2/2 + I(t \u2265 t(cid:48))(tt(cid:48) \u2212 t(cid:48)2/2)(cid:3)\nk\u2202 (t, t(cid:48)) = \u03b82(cid:2)I(t(cid:48) < t)t(cid:48)2/2 + I(t(cid:48) \u2265 t)(tt(cid:48) \u2212 t2/2)(cid:3)\n\nk\u2202 \u2202(t, t(cid:48)) = \u03b82 min(t, t(cid:48))\n\n.\n\n(6)\n\nwith (using the indicator function I(x) = 1 if x, else 0)\n\nk\u2202i \u2202j\n\n=\n\n\u2202i+jk(t, t(cid:48))\n\n\u2202ti\u2202t(cid:48)j\n\n,\n\nthus\n\nGiven a set of evaluations (t, y, y(cid:48)) (vectors, with elements ti, yti, y(cid:48)\n(3), the posterior p(f | y, y(cid:48)) is a GP with posterior mean \u00b5 and covariance and \u02dck as follows:\n\nti) with independent likelihood\n\n(cid:21)(cid:124)(cid:18)(cid:20)ktt + \u03c32\n\n(cid:20) ktt\n(cid:124)\n\nk\u2202\n\ntt\n\ntt\n\nk\u2202\ntt + \u03c32\n\nf(cid:48)I\n\nk\u2202 \u2202\n\nk\u2202\n\nf I\n\n(cid:123)(cid:122)\n\ntt\n=:g(cid:124)(t)\n\n(cid:21)\n\n(cid:20) y\n\ny(cid:48)\n\n(cid:21)(cid:19)\u22121\n(cid:125)\n\n\u00b5(t) =\n\n\u02dck(t, t(cid:48)) = ktt(cid:48) \u2212 g\n\n(cid:124)\n\n(t)\n\n,\n\n. (7)\n\n(cid:21)\n\n(cid:20) ktt(cid:48)\n\nk\u2202\n\ntt(cid:48)\n\nThe posterior marginal variance will be denoted by V(t) = \u02dck(t, t). To see that \u00b5 is indeed piecewise\ncubic (i.e. a cubic spline), we note that it has at most three non-vanishing derivatives4, because\n\nk\u22022\nk\u22023\n\n(t, t(cid:48)) = \u03b82I(t \u2264 t(cid:48))(t(cid:48) \u2212 t)\n(t, t(cid:48)) = \u2212\u03b82I(t \u2264 t(cid:48))\n\nk\u22022 \u2202(t, t(cid:48)) = \u03b82I(t \u2264 t(cid:48))\nk\u22023 \u2202(t, t(cid:48)) = 0.\n\n(8)\n\nThis piecewise cubic form of \u00b5 is crucial for our purposes: having collected N values of f and\nf(cid:48), respectively, all local minima of \u00b5 can be found analytically in O(N ) time in a single sweep\nthrough the \u2018cells\u2019 ti\u22121 < t < ti, i = 1, . . . , N (here t0 = 0 denotes the start location, where (y0, y(cid:48)\n0)\nare \u2018inherited\u2019 from the preceding line search. For typical line searches N < 10, c.f. \u00a74). In each\ncell, \u00b5(t) is a cubic polynomial with at most one minimum in the cell, found by a trivial quadratic\ncomputation from the three scalars \u00b5(cid:48)(ti), \u00b5(cid:48)(cid:48)(ti), \u00b5(cid:48)(cid:48)(cid:48)(ti). This is in contrast to other GP regression\nmodels\u2014for example the one arising from a Gaussian kernel\u2014which give more involved posterior\nmeans whose local minima can be found only approximately. Another advantage of the cubic spline\ninterpolant is that it does not assume the existence of higher derivatives (in contrast to the Gaussian\nkernel, for example), and thus reacts robustly to irregularities in the objective.\nIn our algorithm, after each evaluation of (yN , y(cid:48)\nN ), we use this property to compute a short list\nof candidates for the next evaluation, consisting of the \u2264 N local minimizers of \u00b5(t) and one\nadditional extrapolation node at tmax + \u03b1, where tmax is the currently largest evaluated t, and \u03b1 is\nan extrapolation step size starting at \u03b1 = 1 and doubled after each extrapolation step.\n\n3.2 Choosing Among Candidates\n\nThe previous section described the construction of < N + 1 discrete candidate points for the next\nevaluation. To decide at which of the candidate points to actually call f and f(cid:48), we make use of\na popular utility from Bayesian optimization. Expected improvement [23] is the expected amount,\n3Eq. (4) can be generalized to the \u2018natural spline\u2019, removing the need for the constant \u03c4 [24, \u00a76.3.1]. However,\nthis notion is ill-de\ufb01ned in the case of a single observation, which is crucial for the line search.\n4There is no well-de\ufb01ned probabilistic belief over f(cid:48)(cid:48) and higher derivatives\u2014sample paths of the Wiener\nprocess are almost surely non-differentiable almost everywhere [28, \u00a72.2]. But \u00b5(t) is always a member of the\nreproducing kernel Hilbert space induced by k, thus piecewise cubic [24, \u00a76.1].\n\n4\n\n\f\u03c3f = 0.28\n\u03c3f(cid:48) = 0.0049\n\n\u03c3f = 0.082\n\u03c3f(cid:48) = 0.014\n\n\u03c3f = 0.0028\n\u03c3f(cid:48) = 0.0049\n\n0.5\n\n0\nt \u2013 constraining\n\n1\n\n1.5\n\n2\n\n0\n\u22122\n\n1\n\n0\n\n)\nt\n(\nf\n\n0.2\n\n0\n\u22120.2\n1\n\n)\nt\n(\n\ne\nf\nl\no\nW\np\n\n0\n\n0.2\n\n0\n\u22120.2\n\n1\n\n0\n\n2\n\n0\n4\nt \u2013 extrapolation\n\n0.5\n\n0\n\u22120.5\n\n1\n\n0.5\n\n0\nt \u2013 interpolation\n\n1\n\n1.5\n\n0\n\n0\n\n0.5\n\n1.5\nt \u2013 immediate accept\n\n1\n\n1\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\nt \u2013 high noise interpolation\n\n\u03c3f = 0.17\n\u03c3f(cid:48) = 0.012\n\n\u03c3f = 0.24\n\u03c3f(cid:48) = 0.011\n\n0.2\n\n0\n\u22120.2\n\nFigure 3: Curated snapshots of line searches (from MNIST experiment, \u00a74), showing variability of\nthe objective\u2019s shape and the decision process. Top row: GP posterior and evaluations, bottom row:\napproximate pWolfe over strong Wolfe conditions. Accepted point marked red.\n\nunder the GP surrogate, by which the function f (t) might be smaller than a \u2018current best\u2019 value \u03b7 (we\nset \u03b7 = mini=0,...,N{\u00b5(ti)}, where ti are observed locations),\n\n(cid:33)\nuEI(t) = Ep(ft | y,y(cid:48))[min{0, \u03b7 \u2212 f (t)}]\n\n(cid:32)\n\n\u03b7 \u2212 \u00b5(t)\n\n(cid:114)V(t)\n\n=\n\n2\n\n1 + erf\n\n+\n\nexp\n\n2\u03c0\n\n\u03b7 \u2212 \u00b5(t)\n\n(cid:112)2V(t)\n\n(cid:18)\n\n(cid:19)\n\n\u2212 (\u03b7 \u2212 \u00b5(t))2\n2V(t)\n\n(9)\n\n.\n\nThe next evaluation point is chosen as the candidate maximizing this utility, multiplied by the\nprobability for the Wolfe conditions to be ful\ufb01lled, which is derived in the following section.\n\n3.3 Probabilistic Wolfe Conditions for Termination\n\nThe key observation for a probabilistic extension of W-I and W-II is that they are positivity constraints\non two variables at, bt that are both linear projections of the (jointly Gaussian) variables f and f(cid:48):\n\n=\n\nbt\n\n(cid:20)at\n(cid:21)\n(cid:18)(cid:20)at\n\n(cid:20)1\n(cid:20)ma\n(cid:21)\n\nc1t \u22121\n0\n\n0 \u2212c2\n\n0\n1\n\n(cid:21)\n\n(cid:20)C aa\n\nt\n\n\uf8f9\uf8fa\uf8fb \u2265 0.\n\n(cid:21)\uf8ee\uf8ef\uf8f0 f (0)\n(cid:21)(cid:19)\n\nf(cid:48)(0)\nf (t)\nf(cid:48)(t)\n\n,\n\nC ab\nt\nC bb\nt\nand\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nda db,\n\n(14)\n\nThe GP of Eq. (5) on f thus implies, at each value of t, a bivariate Gaussian distribution\n\nwith\nand\n\nC ab\n\np(at, bt) = N\n\n;\n\nt\nmb\nt\n\n,\n\nbt\n\nC ba\nt\nt = \u00b5(0) \u2212 \u00b5(t) + c1t\u00b5(cid:48)(0)\nma\nt = \u02dck00 + (c1t)2 \u02dck\u2202 \u2202\n00 + \u02dcktt + 2[c1t(\u02dck\u2202\nC aa\n00 \u2212 2c2\n0t + \u02dck\u2202 \u2202\n\u02dck\u2202 \u2202\n\u02dck\u2202 \u2202\nC bb\nt = c2\n2\nt = \u2212c2(\u02dck\u2202\n00 + c1t \u02dck\u2202 \u2202\n00) + (1 + c2) \u02dck\u2202\nt = C ba\n(cid:90) \u221e\n\n(cid:18)(cid:20)a\n(cid:21)\n\n(cid:20)0\n(cid:21)\n\ntt\n\nN\n\n;\n\n0\n\n,\n\nb\n\npWolfe\nt\n\n=\n\nt\u221a\n\u2212 ma\nCaa\n\nt\n\n(cid:90) \u221e\nt /(cid:112)C aa\n\nt\u221a\n\u2212 mb\nCbb\n\nt\n\nt = \u00b5(cid:48)(t) \u2212 c2\u00b5(cid:48)(0)\nmb\n0t) \u2212 \u02dck0t]\n00 \u2212 \u02dck\u2202\n\n0t + c1t \u02dck\u2202 \u2202\n\n0t \u2212 \u02dck\u2202\ntt.\n\n(cid:20) 1\n\n\u03c1t\n\n(cid:21)(cid:19)\n\n\u03c1t\n1\n\nThe quadrant probability pWolfe\nover a bivariate normal probability,\n\nt\n\n= p(at > 0 \u2227 bt > 0) for the Wolfe conditions to hold is an integral\n\nt . It can be computed ef\ufb01ciently [29], using readily\nwith correlation coef\ufb01cient \u03c1t = C ab\nt C bb\navailable code5 (on a laptop, one evaluation of pWolfe\ncost about 100 microseconds, each line search\nt\nrequires < 50 such calls). The line search computes this probability for all evaluation nodes, after\neach evaluation. If any of the nodes ful\ufb01lls the Wolfe conditions with pWolfe\n> cW , greater than\nsome threshold 0 < cW \u2264 1, it is accepted and returned. If several nodes simultaneously ful\ufb01ll this\nrequirement, the t of the lowest \u00b5(t) is returned. Section 3.4 below motivates \ufb01xing cW = 0.3.\n\nt\n\n5e.g. http://www.math.wsu.edu/faculty/genz/software/matlab/bvn.m\n\n5\n\n\f3.3.1 Approximation for strong conditions:\n\nAs noted in Section 2.1.1, deterministic optimizers tend to use the strong Wolfe conditions, which\nuse |f(cid:48)(0)| and |f(cid:48)(t)|. A precise extension of these conditions to the probabilistic setting is numeri-\ncally taxing, because the distribution over |f(cid:48)| is a non-central \u03c7-distribution, requiring customized\ncomputations. However, a straightforward variation to (14) captures the spirit of the strong Wolfe\nconditions, that large positive derivatives should not be accepted: Assuming f(cid:48)(0) < 0 (i.e. that the\nsearch direction is a descent direction), the strong second Wolfe condition can be written exactly as\n\nThe value \u22122c2f(cid:48)(0) is bounded to 95% con\ufb01dence by\n\n0 \u2264 bt = f(cid:48)(t) \u2212 c2f (0) \u2264 \u22122c2f(cid:48)(0).\n\n\u22122c2f(cid:48)(0) (cid:46) \u22122c2(|\u00b5(cid:48)(0)| + 2(cid:112)V(cid:48)(0)) =: \u00afb.\n\n(15)\n\n(16)\n\nHence, an approximation to the strong Wolfe conditions can be reached by replacing the in\ufb01nite\nupper integration limit on b in Eq. (14) with (\u00afb \u2212 mb\nt . The effect of this adaptation, which\nadds no overhead to the computation, is shown in Figure 2 as a dashed line.\n\nt)/(cid:112)C bb\n\n3.4 Eliminating Hyper-parameters\n\nAs a black-box inner loop, the line search should not require any tuning by the user. The preceding\nsection introduced six so-far unde\ufb01ned parameters: c1, c2, cW , \u03b8, \u03c3f , \u03c3f(cid:48). We will now show that\nc1, c2, cW , can be \ufb01xed by hard design decisions. \u03b8 can be eliminated by standardizing the opti-\nmization objective within the line search; and the noise levels can be estimated at runtime with low\noverhead for batch objectives of the form in Eq. (1). The result is a parameter-free algorithm that\neffectively removes the one most problematic parameter from SGD\u2014the learning rate.\n\nDesign Parameters c1, c2, cW Our algorithm inherits the Wolfe thresholds c1 and c2 from its\ndeterministic ancestors. We set c1 = 0.05 and c2 = 0.8. This is a standard setting that yields a\n\u2018lenient\u2019 line search, i.e. one that accepts most descent points. The rationale is that the stochastic\naspect of SGD is not always problematic, but can also be helpful through a kind of \u2018annealing\u2019 effect.\nThe acceptance threshold cW is a new design parameter arising only in the probabilistic setting. We\n\ufb01x it to cW = 0.3. To motivate this value, \ufb01rst note that in the noise-free limit, all values 0 < cW < 1\nare equivalent, because pWolfe then switches discretely between 0 and 1 upon observation of the\nfunction. A back-of-the-envelope computation (left out for space), assuming only two evaluations\nat t = 0 and t = t1 and the same \ufb01xed noise level on f and f(cid:48) (which then cancels out), shows\nthat function values barely ful\ufb01lling the conditions, i.e. at1 = bt1 = 0, can have pWolfe \u223c 0.2 while\nfunction values at at1 = bt1 = \u2212\u0001 for \u0001(cid:95) 0 with \u2018unlucky\u2019 evaluations (both function and gradient\nvalues one standard-deviation from true value) can achieve pWolfe \u223c 0.4. The choice cW = 0.3\nbalances the two competing desiderata for precision and recall. Empirically (Fig. 3), we rarely\nobserved values of pWolfe close to this threshold. Even at high evaluation noise, a function evaluation\ntypically either clearly rules out the Wolfe conditions, or lifts pWolfe well above the threshold.\n\nthe optimization objective: We set \u03b8 = 1 and scale yi(cid:94) (yi\u2212y0)/|y(cid:48)\n\nScale \u03b8 The parameter \u03b8 of Eq. (4) simply scales the prior variance. It can be eliminated by scaling\n0| within the code of\nthe line search. This gives y(0) = 0 and y(cid:48)(0) = \u22121, and typically ensures the objective ranges in\n0| causes\nthe single digits across 0 < t < 10, where most line searches take place. The division by |y(cid:48)\na non-Gaussian disturbance, but this does not seem to have notable empirical effect.\n\ni(cid:94) y(cid:48)\n\n0|, y(cid:48)\n\ni/|y(cid:48)\n\nNoise Scales \u03c3f , \u03c3f(cid:48) The likelihood (3) requires standard deviations for the noise on both function\nvalues (\u03c3f ) and gradients (\u03c3f(cid:48)). One could attempt to learn these across several line searches.\nHowever, in exchangeable models, as captured by Eq. (1), the variance of the loss and its gradient\ncan be estimated directly within the batch, at low computational overhead\u2014an approach already\nadvocated by Schaul et al. [14]. We collect the empirical statistics\n\nm(cid:88)\n\nj\n\n\u02c6S(x) :=\n\n1\nm\n\n(cid:96)2(x, yj),\n\nand\n\n\u02c6\u2207S(x) :=\n\n1\nm\n\n6\n\nm(cid:88)\n\nj\n\n\u2207(cid:96)(x, yj).2\n\n(17)\n\n\f\u03c32\nf =\n\n1\n\nm \u2212 1\n\nand\n\nf(cid:48) = si.2(cid:124)(cid:20)\n\n\u03c32\n\n1\n\nm \u2212 1\n\n(cid:16) \u02c6S(xk) \u2212 \u02c6L(xk)2(cid:17)\n\n(where .2 denotes the element-wise square) and estimate, at the beginning of a line search from xk,\n\n(cid:16) \u02c6\u2207S(xk) \u2212 (\u2207 \u02c6L).2(cid:17)(cid:21)\nthe two empirical estimates as described in \u00a73.4: \u03c3f(cid:94) \u03c3f /|y(cid:48)(0)|, and ditto for \u03c3f(cid:48). The overhead of\n\n.\n(18)\nThis amounts to the cautious assumption that noise on the gradient is independent. We \ufb01nally scale\n\nthis estimation is small if the computation of (cid:96)(x, yj) itself is more expensive than the summation over\nj (in the neural network examples of \u00a74, with their comparably simple (cid:96), the additional steps added\nonly \u223c 1% cost overhead to the evaluation of the loss). Of course, this approach requires a batch size\nm > 1. For single-sample batches, a running averaging could be used instead (single-sample batches\nare not necessarily a good choice. In our experiments, for example, vanilla SGD with batch size 10\nconverged faster in wall-clock time than unit-batch SGD). Estimating noise separately for each input\ndimension captures the often inhomogeneous structure among gradient elements, and its effect on the\nnoise along the projected direction. For example, in deep models, gradient noise is typically higher\non weights between the input and \ufb01rst hidden layer, hence line searches along the corresponding\ndirections are noisier than those along directions affecting higher-level weights.\n\n3.4.1 Propagating Step Sizes Between Line Searches\nAs will be demonstrated in \u00a74, the line search can \ufb01nd good step sizes even if the length of the\ndirection si (which is proportional to the learning rate \u03b1 in SGD) is mis-scaled. Since such scale\nissues typically persist over time, it would be wasteful to have the algorithm re-\ufb01t a good scale in each\nline search. Instead, we propagate step lengths from one iteration of the search to another: We set the\ninitial search direction to s0 = \u2212\u03b10\u2207 \u02c6L(x0) with some initial learning rate \u03b10. Then, after each line\nsearch ending at xi = xi\u22121 + t\u2217si, the next search direction is set to si+1 = \u22121.3 \u00b7 t\u2217\u03b10\u2207 \u02c6L(xi).\nThus, the next line search starts its extrapolation at 1.3 times the step size of its predecessor.\n\nRemark on convergence of SGD with line searches: We note in passing that it is straightforward\nto ensure that SGD instances using the line search inherit the convergence guarantees of SGD:\nPutting even an extremely loose bound \u00af\u03b1i on the step sizes taken by the i-th line search, such that\ni < \u221e, ensures the line search-controlled SGD converges in probability [1].\n\n(cid:80)\u221e\ni \u00af\u03b1i = \u221e and(cid:80)\u221e\n\ni \u00af\u03b12\n\n4 Experiments\n\nOur experiments were performed on the well-worn problems of training a 2-layer neural net with\nlogistic nonlinearity on the MNIST and CIFAR-10 datasets.6 In both cases, the network had 800 hid-\nden units, giving optimization problems with 636 010 and 2 466 410 parameters, respectively. While\nthis may be \u2018low-dimensional\u2019 by contemporary standards, it exhibits the stereotypical challenges\nof stochastic optimization for machine learning. Since the line search deals with only univariate\nsubproblems, the extrinsic dimensionality of the optimization task is not particularly relevant for an\nempirical evaluation. Leaving aside the cost of the function evaluations themselves, computation cost\nassociated with the line search is independent of the extrinsic dimensionality.\nThe central nuisance of SGD is having to choose the learning rate \u03b1, and potentially also a schedule for\nits decrease. Theoretically, a decaying learning rate is necessary to guarantee convergence of SGD [1],\nbut empirically, keeping the rate constant, or only decaying it cautiously, often work better (Fig. 4). In\na practical setting, a user would perform exploratory experiments (say, for 103 steps), to determine a\ngood learning rate and decay schedule, then run a longer experiment in the best found setting. In our\nnetworks, constant learning rates of \u03b1 = 0.75 and \u03b1 = 0.08 for MNIST and CIFAR-10, respectively,\nachieved the lowest test error after the \ufb01rst 103 steps of SGD. We then trained networks with vanilla\nSGD with and without \u03b1-decay (using the schedule \u03b1(i) = \u03b10/i), and SGD using the probabilistic\nline search, with \u03b10 ranging across \ufb01ve orders of magnitude, on batches of size m = 10.\nFig. 4, top, shows test errors after 10 epochs as a function of the initial learning rate \u03b10 (error bars\nbased on 20 random re-starts). Across the broad range of \u03b10 values, the line search quickly identi\ufb01ed\ngood step sizes \u03b1(t), stabilized the training, and progressed ef\ufb01ciently, reaching test errors similar\n\n6http://yann.lecun.com/exdb/mnist/ and http://www.cs.toronto.edu/\u02dckriz/cifar.html. Like other au-\n\nthors, we only used the \u201cbatch 1\u201d sub-set of CIFAR-10.\n\n7\n\n\fCIFAR10 2layer neural net\n\nSGD \ufb01xed \u03b1 SGD decaying \u03b1 Line Search\n\n100\n\n10\u22121\n\nMNIST 2layer neural net\n\n10\u22123\n\n10\u22122\n10\u22121\nintial learning rate\n\n100\n\n101\n\n10\u22122\n\n10\u22124\n\n10\u22123\n\n10\u22122\n10\u22121\nintial learning rate\n\n100\n\n101\n\n0.9\n\n0.8\n\n0.7\n\nr\no\nr\nr\ne\n\nt\ns\ne\nt\n\n0.6\n10\u22124\n\n1\n\nr\no\nr\nr\ne\n\nt\ns\ne\nt\n\n0.8\n\n0.6\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0 2 4 6 8 10\n\n0 2 4 6 8 10\n\n0 2 4 6 8 10\n\n0 2 4 6 8 10\n\n0 2 4 6 8 10\n\n0 2 4 6 8 10\n\nepoch\n\nepoch\n\nFigure 4: Top row: test error after 10 epochs as function of initial learning rate (note logarithmic\nordinate for MNIST). Bottom row: Test error as function of training epoch (same color and symbol\nscheme as in top row). No matter the initial learning rate, the line search-controlled SGD perform\nclose to the (in practice unknown) optimal SGD instance, effectively removing the need for exploratory\nexperiments and learning-rate tuning. All plots show means and 2 std.-deviations over 20 repetitions.\n\nto those reported in the literature for tuned versions of this kind of architecture on these datasets.\nWhile in both datasets, the best SGD instance without rate-decay just barely outperformed the line\nsearches, the optimal \u03b1 value was not the one that performed best after 103 steps. So this kind of\nexploratory experiment (which comes with its own cost of human designer time) would have led to\nworse performance than simply starting a single instance of SGD with the linesearch and \u03b10 = 1,\nletting the algorithm do the rest.\nAverage time overhead (i.e. excluding evaluation-time for the objective) was about 48ms per line\nsearch. This is independent of the problem dimensionality, and expected to drop signi\ufb01cantly with\noptimized code. Analysing one of the MNIST instances more closely, we found that the average\nlength of a line search was \u223c 1.4 function evaluations, 80% \u2212 90% of line searches terminated\nafter the \ufb01rst evaluation. This suggests good scale adaptation and thus ef\ufb01cient search (note that an\n\u2018optimally tuned\u2019 algorithm would always lead to accepts).\nThe supplements provide additional plots, of raw objective values, chosen step-sizes, encountered\ngradient norms and gradient noises during the optimization, as well as test-vs-train error plots, for each\nof the two datasets, respectively. These provide a richer picture of the step-size control performed by\nthe line search. In particular, they show that the line search chooses step sizes that follow a nontrivial\ndynamic over time. This is in line with the empirical truism that SGD requires tuning of the step size\nduring its progress, a nuisance taken care of by the line search. Using this structured information for\nmore elaborate analytical purposes, in particular for convergence estimation, is an enticing prospect,\nbut beyond the scope of this paper.\n\n5 Conclusion\n\nThe line search paradigm widely accepted in deterministic optimization can be extended to noisy\nsettings. Our design combines existing principles from the noise-free case with ideas from Bayesian\noptimization, adapted for ef\ufb01ciency. We arrived at a lightweight \u201cblack-box\u201d algorithm that exposes\nno parameters to the user. Our method is complementary to, and can in principle be combined with,\nvirtually all existing methods for stochastic optimization that adapt a step direction of \ufb01xed length.\nEmpirical evaluations suggest the line search effectively frees users from worries about the choice of\na learning rate: Any reasonable initial choice will be quickly adapted and lead to close to optimal\nperformance. Our matlab implementation will be made available at time of publication of this article.\n\n8\n\n\fReferences\n[1] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\n22(3):400\u2013407, Sep. 1951.\n\n[2] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In\n\nTwenty-\ufb01rst International Conference on Machine Learning (ICML 2004), 2004.\n\n[3] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th Int.\n\nConf. on Computational Statistic (COMPSTAT), pages 177\u2013186. Springer, 2010.\n\n[4] M.D. Hoffman, D.M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14(1):1303\u20131347, 2013.\n\n[5] J. Hensman, M. Rattray, and N.D. Lawrence. Fast variational inference in the conjugate exponential family.\n\nIn Advances in Neural Information Processing Systems (NIPS 25), pages 2888\u20132896, 2012.\n\n[6] T. Broderick, N. Boyd, A. Wibisono, A.C. Wilson, and M.I. Jordan. Streaming variational Bayes. In\n\nAdvances in Neural Information Processing Systems (NIPS 26), pages 1727\u20131735, 2013.\n\n[7] A.P. George and W.B. Powell. Adaptive stepsizes for recursive estimation with applications in approximate\n\ndynamic programming. Machine Learning, 65(1):167\u2013198, 2006.\n\n[8] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[9] N.N. Schraudolph. Local gain adaptation in stochastic gradient descent. In Ninth International Conference\n\non Arti\ufb01cial Neural Networks (ICANN) 99, volume 2, pages 569\u2013574, 1999.\n\n[10] S.-I. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for\n\nmultilayer perceptrons. Neural Computation, 12(6):1399\u20131409, 2000.\n\n[11] N.L. Roux and A.W. Fitzgibbon. A fast natural Newton method. In 27th International Conference on\n\nMachine Learning (ICML), pages 623\u2013630, 2010.\n\n[12] R. Rajesh, W. Chong, D. Blei, and E. Xing. An adaptive learning rate for stochastic variational inference.\n\nIn 30th International Conference on Machine Learning (ICML), pages 298\u2013306, 2013.\n\n[13] P. Hennig. Fast Probabilistic Optimization from Noisy Gradients. In 30th International Conference on\n\nMachine Learning (ICML), 2013.\n\n[14] T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In 30th International Conference on\n\nMachine Learning (ICML-13), pages 343\u2013351, 2013.\n\n[15] R. Fletcher and C.M. Reeves. Function minimization by conjugate gradients. The Computer Journal,\n\n7(2):149\u2013154, 1964.\n\n[16] C.G. Broyden. A new double-rank minimization algorithm. Notices of the AMS, 16:670, 1969.\n[17] R. Fletcher. A new approach to variable metric algorithms. The Computer Journal, 13(3):317, 1970.\n[18] D. Goldfarb. A family of variable metric updates derived by variational means. Math. Comp., 24(109):23\u2013\n\n26, 1970.\n\n[19] D.F. Shanno. Conditioning of quasi-Newton methods for function minimization. Math. Comp., 24(111):647\u2013\n\n656, 1970.\n\n[20] J. Nocedal and S.J. Wright. Numerical Optimization. Springer Verlag, 1999.\n[21] P. Wolfe. Convergence conditions for ascent methods. SIAM Review, pages 226\u2013235, 1969.\n[22] L. Armijo. Minimization of functions having Lipschitz continuous \ufb01rst partial derivatives. Paci\ufb01c Journal\n\nof Mathematics, 16(1):1\u20133, 1966.\n\n[23] D.R. Jones, M. Schonlau, and W.J. Welch. Ef\ufb01cient global optimization of expensive black-box functions.\n\nJournal of Global Optimization, 13(4):455\u2013492, 1998.\n\n[24] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT, 2006.\n[25] G. Wahba. Spline models for observational data. Number 59 in CBMS-NSF Regional Conferences series\n\nin applied mathematics. SIAM, 1990.\n\n[26] S. S\u00a8arkk\u00a8a. Bayesian \ufb01ltering and smoothing. Cambridge University Press, 2013.\n[27] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 3rd ed.\n\nedition, 1991.\n\n[28] R.J. Adler. The Geometry of Random Fields. Wiley, 1981.\n[29] Z. Drezner and G.O. Wesolowsky. On the computation of the bivariate normal integral. Journal of\n\nStatistical Computation and Simulation, 35(1-2):101\u2013107, 1990.\n\n9\n\n\f", "award": [], "sourceid": 95, "authors": [{"given_name": "Maren", "family_name": "Mahsereci", "institution": "MPI for Intelligent Systems T\u00fcbingen"}, {"given_name": "Philipp", "family_name": "Hennig", "institution": "MPI for Intelligent Systems, T\u00fcbingen"}]}