{"title": "Necessary and Sufficient Geometries for Gradient Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 11495, "page_last": 11505, "abstract": "We study the impact of the constraint set and gradient geometry on the convergence of online and stochastic methods for convex optimization, providing a characterization of the geometries for which stochastic gradient and adaptive gradient methods are (minimax) optimal. In particular, we show that when the constraint set is quadratically convex, diagonally pre-conditioned stochastic gradient methods are minimax optimal. We further provide a converse that shows that when the constraints are not quadratically convex---for example, any $\\ell_p$-ball for $p < 2$---the methods are far from optimal. Based on this, we can provide concrete recommendations for when one should use adaptive, mirror or stochastic gradient methods.", "full_text": "Necessary and Suf\ufb01cient Geometries\n\nfor Gradient Methods\n\nDaniel Levy\n\nStanford University\n\ndanilevy@stanford.edu\n\nJohn C. Duchi\n\nStanford University\n\njduchi@stanford.edu\n\nAbstract\n\nWe study the impact of the constraint set and gradient geometry on the convergence\nof online and stochastic methods for convex optimization, providing a charac-\nterization of the geometries for which stochastic gradient and adaptive gradient\nmethods are (minimax) optimal. In particular, we show that when the constraint\nset is quadratically convex, diagonally pre-conditioned stochastic gradient methods\nare minimax optimal. We further provide a converse that shows that when the\nconstraints are not quadratically convex\u2014for example, any `p-ball for p < 2\u2014the\nmethods are far from optimal. Based on this, we can provide concrete recommen-\ndations for when one should use adaptive, mirror or stochastic gradient methods.\n\n1\n\nIntroduction\n\n\u27132\u21e5\n\nminimize\n\nfP (\u2713) := EP [F (\u2713, X )] =Z F (\u2713, x)dP (x),\n\nWe study stochastic and online convex optimization in the following setting: for a collection\n{F (\u00b7, x), x 2X} of convex functions F (\u00b7, x) : Rd ! R and distribution P on X , we wish to\nsolve\n(1)\nwhere \u21e5 \u21e2 Rd is a closed convex set. The geometry of the underlying constraint set \u21e5 and structure\nof subgradients @F (\u00b7, x) of course impact the performance of algorithms for problem (1). Thus, while\nstochastic subgradient methods are a de facto choice for their simplicity and scalability [22, 19, 5],\ntheir convergence guarantees depend on the `2-diameter of \u21e5 and @F (\u00b7, x), so that for non-Euclidean\ngeometries (e.g. when \u21e5 is an `1-ball) one can obtain better convergence guarantees using mirror\ndescent, dual averaging or the more recent adaptive gradient methods [18, 19, 4, 20, 13]. We revisit\nthese ideas and precisely quantify optimal rates and gaps between the methods.\nOur main contribution is to show that the geometry of the constraint set and gradients interact in\na way completely analogous to Donoho et al.\u2019s classical characterization of optimal estimation\nin Gaussian sequence models [10], where one observes a vector \u2713 2 \u21e5 corrupted by Gaussian\nnoise, Y = \u2713 + N(0, 2I). For such problems, one can consider linear estimators\u2014b\u2713 = AY for\na A 2 Rd\u21e5d\u2014or potentially non-linear estimators\u2014b\u2713 =( Y ) where : Rd ! \u21e5. When \u21e5 is\nj ) | \u2713 2 \u21e5} is convex, Donoho et al. show there\nquadratically convex, meaning the set \u21e52 := {(\u27132\nexists a minimax rate optimal linear estimator; conversely, there are non-quadratically convex \u21e5 for\nwhich minimax rate optimal estimatorsb\u2713 must be nonlinear in Y .\nTo build our analogy, we turn to stochastic and online convex optimization. Consider Nesterov\u2019s\ndual averaging, where for a strongly convex h :\u21e5 ! R, one iterates for k = 1, 2, . . . by receiving a\n(random) Xk 2X , choosing gk 2 @F (\u2713k, Xk), and for a stepsize \u21b5k > 0 updating\n\n\u27132\u21e5 \u21e2Xi\uf8ffk\n\n1\n\u21b5k\n\nh(\u2713).\n\n\u2713k+1 := argmin\n\ng>i \u2713 +\n\n(2)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWhen \u21e5= Rd and h is Euclidean, that is, h(\u2713) = 1\n\n2 \u2713>A\u2713 for some A  0, the updates are linear in\nthe observed gradients gi, as \u2713k = \u21b5kA1Pi\uf8ffk gi. Drawing a parallel between  in the Gaussian\nsequence model [10] and h in dual averaging (2), a natural conjecture is that a dichotomy similar\nto that for the Gaussian sequence model holds for stochastic and online convex optimization: if\n\u21e5 is quadratically convex, there is a Euclidean h (yielding \u201clinear\u201d updates) that is minimax rate\noptimal, while there exist non-quadratically convex \u21e5 for which Euclidean distance-generating h are\narbitrarily suboptimal. We show that this analogy holds almost completely, with the caveat that we\nfully characterize minimax rates when the subgradients lie in a quadratically convex set or a weighted\n`r ball, r  1. (This issue does not arise for the Gaussian sequence model, as the observations Y\ncome from a \ufb01xed distribution, so there is no notion of alternative norms on Y .)\nMore precisely, we prove that for compact, convex, quadratically convex, orthosymmetric constraint\nsets \u21e5, subgradient methods with a \ufb01xed diagonal re-scaling are minimax rate optimal. This\nguarantees that for a large collection of constraints (e.g. `2 balls, weighted `p-bodies for p  2, or\nhyperrectangles) a diagonal re-scaling suf\ufb01ces. This is important in machine learning problems of\nappropriate geometry, for example, in linear classi\ufb01cation problems where the data (features) are\nsparse, so using a dense predictor \u2713 is natural [13, 14]. Conversely, we show that if the constraint\nset \u21e5 is a (scaled) `p ball, 1 \uf8ff p < 2, then, considering unconstrained updates (2), the regret of\nthe best method of linear type can bepd/ log d times larger than the minimax rate. As part of this,\nwe provide new information-theoretic lower bounds on optimization for general convex constraints\n\u21e5. In contrast to the frequent practice in literature of comparing regret upper bounds\u2014prima facie\nillogical\u2014we demonstrate the gap between linear and non-linear methods must hold.\nOur conclusions relate to the growing literature in adaptive algorithms [3, 13, 21, 9]. Our results\neffectively prescribe that these adaptive algorithms are useful when the constraint set is quadratically\nconvex as then there is a minimax optimal diagonal pre-conditioner. Even more, different sets suggest\ndifferent regularizers. For example, when the constraint set is a hyperrectangle, AdaGrad has regret\nat most p2 times that of the best post-hoc pre-conditioner, which we show is minimax optimal, while\n(non-adaptive) standard gradient methods can be pd suboptimal on such problems. Conversely, our\nresults strongly recommend against those methods for non-quadratically convex constraint sets. Our\nresults thus clarify and explicate the work of Wilson et al. [28]: when the geometry of \u21e5 and @F is\nappropriate for adaptive gradient methods or Euclidean algorithms, one should use them; when it is\nnot\u2014the constraints \u21e5 are not quadratically convex\u2014one should not.\n\n\u2713\n\nNotation d always refers to dimension and n to sample size. For a norm  : Rd ! R+,\nB(x0, r) := {x | (x  x0) \uf8ff r} denotes the ball of radius r around x0 in the  norm. For\np 2 [1,1] we use the shorthand Bp(x0, r) := Bk\u00b7kp(x0, r). The dual norm of  is \u21e4(z) =\nsup(x)\uf8ff1 x>z. For \u2713, \u2327 2 Rd, we abuse notation and de\ufb01ne \u27132 := (\u27132\nj )j\uf8ffd, |\u2713| := (|\u2713j|)j\uf8ffd,\n\u2327 := (\u2713j/\u2327j)j\uf8ffd and \u2713  \u2327 := (\u2713j\u2327j)j\uf8ffd. The function h : Rd ! R denotes a distance generating\nfunction, i.e. a function strongly convex w.r.t. a norm k\u00b7k ; Dh(x, y) = h(x) h(y)rh(y)>(x y)\ndenotes the Bregman divergence, where h is strongly convex w.r.t. k\u00b7k if and only if Dh(x, y) \n2 kx  yk2. The subdifferential of F (\u00b7, x) at \u2713 is @\u2713F (\u2713, x). I(X; Y ) is the (Shannon) mutual infor-\nmation between random variables X and Y . For a set \u2326 and f, g :\u2326 ! R, we write f . g if there\nexists a \ufb01nite numerical constant C such that f (t) \uf8ff Cg(t) for t 2 \u2326, and f \u21e3 g if g . f . g.\n2 Preliminaries\n\n1\n\nWe begin by de\ufb01ning the minimax framework in which we analyze procedures, review standard\nstochastic subgradient methods, and introduce the relevant geometric notions of convexity we require.\n\nMinimax rate for convex stochastic optimization We measure the complexity of families of\nproblems in two familiar ways: stochastic minimax complexity and regret [18, 1, 6]. Let \u21e5 \u21e2 Rd be\na closed convex set, X a sample space, and F a collection of functions F : Rd \u21e5X ! R. For a\ncollection P of distributions over X , recall (1) that fP (\u2713) :=R F (\u2713, x)dP (x) is the expected loss of\nthe point \u2713. Then the minimax stochastic risk is\n\nMS\n\nn(\u21e5,F,P) := inf\nb\u2713n\n\nsup\nF2F\n\nsup\nP2P\n\nE\uf8fffP (b\u2713n(X n\n\n2\n\n1 ))  inf\n\u27132\u21e5\n\nfP (\u2713) ,\n\n\f1\n\n1\n\n1\nn\n\nMR\n\nF2F,xn\n\n, so that\n\niid\u21e0 P and the in\ufb01mum ranges over all measurable functionsb\u2713n\n\nwhere the expectation is taken over X n\n1\nof X n. A related notion is the average minimax regret, which instead takes a supremum over samples\n1 2X n and measures losses instantaneously. In this case, an algorithm consists of a sequence of\nxn\ndecisionsb\u27131,b\u27132, . . . ,b\u2713n, whereb\u2713i is chosen conditional on samples xi1\nnXi=1hF\u21e3b\u2713ixi1\n\nIn the regret case we may of course identify xi with individual functions F , so this corresponds to\n\n , xi\u2318  F (\u2713, xi)i .\n\nn(\u21e5,F,X ) := inf\nb\u27131:n\n\nsup\n1 2X n,\u27132\u21e5\n\nn; thus we typically provide lower bounds on MS\n\nthe standard regret. In both of these de\ufb01nitions, we do not constrain the point estimatesb\u2713 to lie in the\n\nconstraint sets\u2014in language of learning theory, improper predictions\u2014but in our cases, this does\nnot change regret by more than a constant factor. As online-to-batch conversions make clear [7], we\nalways have MS\nn and upper bounds on MR\nn.\nWe study functions whose continuity properties are speci\ufb01ed by a norm  over Rd, de\ufb01ning\n\nn \uf8ff MR\nF ,r :=F : Rd \u21e5X ! R | for all \u2713 2 Rd, g 2 @\u2713F (\u2713, x), (g) \uf8ff r ,\n\n(3)\nwhich is equivalent to the Lipschitz condition |F (\u2713, x)  F (\u27130, x)|\uf8ff r\u21e4(\u2713  \u27130), where \u21e4 is the\ndual norm to . For a given norm  ( as a mnemonic for gradient), we use the shorthands\nn(\u21e5,F ,1,P)\nas the Lipschitzian properties of F in relation to \u21e5 determine the minimax regret and risk.\nStochastic gradient methods, mirror descent, and regret Let us brie\ufb02y review the canonical\nalgorithms for solving the problem (1) and their associated convergence guarantees. For an algorithm\noutputing points \u27131, . . . ,\u2713 n, the regret on the sequence F (\u00b7, xi) with respect to a point \u2713 is\n\nn(\u21e5,F ,1,X ) and MS\n\nn(\u21e5, ) := sup\nX\n\nn(\u21e5, ) := sup\nX\n\nP\u21e2P (X )\n\nMR\n\nMR\n\nMS\n\nsup\n\nRegretn(\u2713) :=\n\n[F (\u2713i, xi)  F (\u2713, xi)].\n\nnXi=1\n\nRecalling the de\ufb01nition Dh(\u2713, \u27130) = h(\u2713)  h(\u27130)  rh(\u27130)>(\u2713  \u27130) of the Bregman divergence,\nthe mirror descent algorithm [18, 4] iteratively sets\n\ngi 2 @\u2713F (\u2713i, xi) and updates \u2713MD\n\ni+1 := argmin\n\n\u27132\u21e5 \u21e2g>i \u2713 +\n\n1\n\u21b5\n\nDh(\u2713, \u2713i)\n\n(4)\n\nwhere \u21b5> 0 is a stepsize. When the function h is 1-strongly convex with respect to a norm k\u00b7k with\ndual norm k\u00b7k\u21e4\n\n, the iterates (4) and the iterates (2) of dual averaging satisfy (cf. [4, 6, 20])\n\nRegretn(\u2713) \uf8ff\n\n\u21b5\n\nDh(\u2713, \u27130)\n\n+\n\n\u21b5\n\n2Xi\uf8ffn\n\nkgik2\n\u21e4\n\nfor any \u2713 2 \u21e5.\n\n(5)\n\n2(p1) k\u2713k2\n\np, which is strongly convex with respect to the `p-norm k\u00b7kp.\n\nOne recovers the classical stochastic gradient method with the choice h(\u2713) = 1\n2, which is\nstrongly convex with respect to the `2-norm, while the p-norm algorithms [15, 23], de\ufb01ned for\n1 < p \uf8ff 2, use h(\u2713) = 1\nAs we previously stated in our de\ufb01nitions of minimax risk and regret, we do not constrain the point\nestimates to lie in the constraint set \u21e5, which is equivalent to taking \u21e5= Rd in the updates (4)\nor (2). The regret bound (5) still holds when considering unconstrained updates, whenever \u2713 2 \u21e5,\nand the regret of the algorithm with respect to a constraint set \u21e5 is simply sup\u27132\u21e5 Regretn(\u2713). Even\nwith unconstrained updates, the form (5) still captures small regret for all common constraint sets\nlog(2d),\n\u21e5 [23]. To make clear, let \u21e5 \u21e2 Rd be the `1-ball; taking h(\u2713) = 1\nq = p\n\np for p = 1 + 1\n\n2(p1) k\u2713k2\n\n2 k\u2713k2\n\np1 = 1 + log(2d), and \u27130 = 0 guarantees\n\u21b5\nh(\u2713) +\n\nsup\nk\u2713k1\uf8ff1\n\nRegretn(\u2713) \uf8ff\n\n2Xi\uf8ffn\nkgik2\neplog(2d)/n gives the familiar O(1) \u00b7 pn log d\nAssuming kgik1 \uf8ff 1 for all i and taking \u21b5 = 2\nregret.\n\nkgik2\n1 .\n\nsup\nk\u2713k1\uf8ff1\n\nnXi=1\n\n2 log(2d)\n\ne2\u21b5\n2\n\nq \uf8ff\n\n2\n\u21b5\n\n+\n\n\u21b5\n\n3\n\n\fWe frequently focus on distance generating functions of the form h(\u2713) = 1\n2 \u2713>A\u2713 for a \ufb01xed positive\nsemi-de\ufb01nite matrix A. For an arbitrary A, we will refer to these methods as Euclidean gradient\nmethods and for a diagonal A as diagonally-scaled gradient methods. It is important to note that,\nin this case, the mirror descent update is the stochastic gradient update with A1g, where g is a\nstochastic subgradient. We shall refer to all such methods as methods of linear type.\nQuadratic convexity and orthosymmetry For a set \u21e5, we let \u21e52 := {\u27132,\u2713 2 \u21e5} denote its\nsquare. The set \u21e5 is quadratically convex if \u21e52 is convex; typical examples of quadratically convex\nsets are weighted `p bodies for p  2 or hyperrectangles. We let QHull(\u21e5) be the quadratic convex\nhull of \u21e5, meaning the smallest convex and quadratically convex set containing \u21e5. The set \u21e5 \u21e2 Rd\nis orthosymmetric if it is invariant to \ufb02ipping the signs of any coordinate. Formally, if \u2713 2 \u21e5 then\ns 2 {\u00b11}d implies (sj\u2713j)j\uf8ffd 2 \u21e5. We extend this notion to norms: we say that a norm  is\northosymmetric if (g) = (|g|) for all g. Similarly, we will say that a norm  is quadratically\nconvex if  induces a quadratically convex unit ball.\n\n3 Minimax optimality and quadratically convex constraint sets\n\nWe begin our contributions by considering quadratically convex constraint sets, providing lower\nbounds on the minimax risk and matching upper bounds on the minimax regret of convex optimization\nover such sets. We further show that these are attained by diagonally-scaled gradient methods.\nWhile the analogy with the Gaussian sequence model is nearly complete, in distinction to the work\nof Donoho et al. (where results depend solely on the constraints \u21e5), our results necessarily depend on\nthe geometry of the subdifferential. Consequently, we distinguish throughout this section between\nquadratically and non-quadratically convex geometry of the gradients. To set the stage and preview\nour contributions, we begin our study with the familiar case of \u21e5= Bp(0, 1) and norm on the\nsubgradients  = k\u00b7kr (mnemonically,  for gradients), with p 2 [2,1] (so that \u21e5 is quadratically\nconvex) and r  1. We then turn to arbitrary quadratically convex constraint sets and \ufb01rst show\nresults in the case of general quadratically convex norms on the subgradients. We conclude the\nsection by proving that, when the subgradients do not lie in a quadratically convex set but lie in a\nweighted `r ball (for r 2 [1, 2]), diagonally-scaled gradient methods are still minimax rate optimal.\n3.1 A warm-up: p-norm constraint sets for p  2\nThe results for the basic case that the constraints \u21e5 are an `p-ball while the gradients belong to a\ndifferent `r-ball are special cases of the theorems to come, the proofs (appendicized) are simpler and\nprovide intuition for the later results. We distinguish between two cases depending on the value of r\nin the gradient norm. The case that r 2 [1, 2] corresponds roughly to \u201csparse\u201d gradients, while the\ncase r  2 corresponds to harder problems with dense gradients. We provide information theoretic\nproofs of the following two results in Appendices B.1 and B.2, respectively.\nProposition 1 (Sparse gradients). Let \u21e5= Bp(0, 1) with p  2 and (\u00b7) = k\u00b7kr where r 2 [1, 2].\nThen\n\n1\n\np\n\n2 1\nd\npn . MS\n\n1 ^\n\nn(\u21e5, ) \uf8ff MR\n\nn(\u21e5, ) . 1 ^\n\np\n\n1\n\n2 1\nd\npn\n\n.\n\nProposition 2 (Dense gradients). Let \u21e5= Bp(0, 1) with p  2 and (\u00b7) = k\u00b7k r with r  2. Then\n\nd\n\n1 ^\n\n2 1\n\nr\n\n1\n\n2 1\np d 1\npn\n\n. MS\n\nn(\u21e5, ) \uf8ff MR\n\nn(\u21e5, ) . 1 ^\n\nd\n\n2 1\n\nr\n\n1\n\n2 1\np d 1\npn\n\n.\n\nIn both cases, the stochastic gradient method achieves the regret upper bound via a straightforward\n2. That is, a method of linear type is optimal.\noptimization of the regret bounds (5) with h(\u2713) = 1\n\n2 k\u2713k2\n\n3.2 General quadratically convex constraints\nWe now turn to the more general case that \u21e5 is an arbitrary convex, compact, quadratically convex\nand orthosymmetric set. We combine two techniques to develop the results. The \ufb01rst essentially\nbuilds out of the ideas of Donoho et al. [10] in Gaussian sequence estimation, which shows that the\nlargest hyperrectangle in \u21e5 governs the performance of linear estimators; this gives us a lower bound.\n\n4\n\n\fThe key second technique is in the upper bound, where a strong duality result holds because of the\nquadratic convexity of \u21e5\u2014allowing us to prove minimax optimality of diagonally scaled Euclidean\nprocedures. As in the previous section, we divide our analysis into cases depending on whether the\ngradient norm  is quadratically convex or not (the analogs of r 7 2 in Propositions 1 and 2).\nWe begin with the lower bound, which relies on rectangular structures in the primal \u21e5 and dual\ngradient spaces. For the proposition, we use a specialization of the function families (3) to rectangular\nsets, where for M 2 Rd\n\n+ we de\ufb01ne\n\nF M :=\u21e2F : Rd \u21e5X ! R | for all \u2713 2 Rd, g 2 @\u2713f (\u2713, x), max\nj=1[aj, aj] \u21e2 \u21e5. Then\n\nProposition 3 (Duchi et al. [14], Proposition 1). Let M 2 Rd\nand assume the hyperrectangular containmentQd\ndXj=1\nn(\u21e5,F M ) \n\n8pn log 3\n\nMjaj.\n\nMS\n\nj\uf8ffd\n\n1\n\n|gj|\n\nMj \uf8ff 1 .\n\n+ and F M be as above. Let a 2 Rd\n\n+\n\nWe begin the analysis of the general case by studying the rates of diagonally-scaled gradient methods.\n\n3.2.1 Diagonal re-scaling in gradient methods\nAs we discuss in Section 2, diagonally-scaled gradient methods (componentwise re-scaling of the\n2 \u2713>\u21e4\u2713 for \u21e4 = diag() \u232b 0 in the mirror descent\nsubgradients) are equivalent to using h\u21e4(\u2713) := 1\nupdate (4). In this case, for any norm  on the gradients, the minimax regret bound (5) becomes\ng>\u21e41g# .\n\nsup\n\u27132\u21e5\nThe rightmost term of course upper bounds the minimax regret, so we may take an in\ufb01mum over \u21e4,\nyielding\n\n2n24sup\n\n\u2713>\u21e4\u2713 +Xi\uf8ffn\n\n2n\"sup\n\nRegretn,\u21e4(\u2713) \uf8ff\n\n\u2713>\u21e4\u2713 + n\n\ng2B (0,1)\n\n\u27132\u21e5\n\n\u27132\u21e5\n\nsup\n\n1\n\n1\n\nMR\n\nn(\u21e5, ) \uf8ff\n\n1\n2n\n\ninf\n\u232b0\n\nsup\n\u27132\u21e5\n\nj\u27132\n\nj + nXj\uf8ffd\n\n1\nj\n\ng2\n\nj\n\n(6)\n\ng>i \u21e41gi35 \uf8ff\ng2B (0,1)\uf8ffXj\uf8ffd\n\nsup\n\nThe regret bound (6) holds without assumptions on \u21e5 or . However, in the case when \u21e5 is\nquadratically convex, strong duality allows us to simplify this quantity:\nProposition 4. Let V, \u21e5 \u21e2 Rd be convex, quadratically convex and compact sets. Then\nv2) .\n\n\u27132\u21e5,v2V(>\u27132 +\u2713 1\n\u25c6>\n\n0(>\u27132 +\u2713 1\n\u25c6>\n\nv2) = sup\n\n\u27132\u21e5,v2V\n\ninf\n0\n\nsup\n\ninf\n\n+ \u21e5 Rd\n\n+ ! R, J(\u2327, w,  ) := >\u2327 + ( 1\n\nProof. The quadratic convexity of the sets \u21e5 and V implies that a (weighted) squared 2-norm\nbecomes a linear functional when lifted to the squared sets \u21e52 := {\u27132 | \u2713 2 \u21e5} and V 2. Indeed,\nde\ufb01ning J : R2d\n )>w, the function J is concave-convex: it is\nlinear (a fortiori concave) in (\u2327, w) and convex in . Thus, using that the set { 2 Rd\n+} is convex and\n\u21e52 \u21e5 V 2 is convex compact (because \u21e5 and V are quadratically convex compact), Sion\u2019s minimax\ntheorem [25] implies\nw)\n\u23272\u21e52,w2V 2(>\u2327 +\u2713 1\n\u25c6>\nw) .\n0(>\u2327 +\u2713 1\n\u25c6>\n\n\u27132\u21e5,v2V(>\u27132 +\u2713 1\n\u25c6>\n\nv2) = inf\n\n\u23272\u21e52,w2V 2\n\ninf\n0\n\n0\n\nsup\n\nsup\n\nsup\n\ninf\n\n=\n\nReplacing \u2327 with \u27132 and w with v2 gives the result.\n\nProposition 4 provides a powerful hammer for diagonally scaled Euclidean optimization algorithms,\nas we can choose an optimal scaling for any \ufb01xed pair \u2713, g, taking a worst case over such pairs:\n\n5\n\n\fCorollary 1. Let \u21e5 be a convex, quadratically convex, compact set. Then\n\nMR\n\nn(\u2713,  ) \uf8ff\n\n1\npn\n\nsup\n\n\u2713>g,\n\ng2QHull(B (0,1)),\u27132\u21e5\n\nand diagonally-scaled gradient methods achieves this regret.\n\nProof. We upper bound the minimax regret (6) by taking a supremum over the quadratic hull\n\ng 2 QHull (B(0, 1)), which contains B(0, 1). Using that for a, b > 0, inf >0 a + b/ = 2pab\n\nand applying Proposition 4 gives the proof.\n\nThe corollary allows us to provide concrete upper and lower bounds on minimax risk and regret, with\nthe results differing slightly based on whether the gradient norms are quadratically convex.\n\n3.2.2 Orthosymmetric and quadratically convex gradient norms\nWe now provide lower bounds on minimax risk complementary to Corollary 1, focusing \ufb01rst on the\ncase that the gradient norm  is quadratically convex.\nAssumption A1. The norm  is orthosymmetric and quadratically convex, meaning (s v) = (v)\nfor all s 2 {\u00b11}d and B(0, 1) is quadratically convex.\nWith this, we have the following theorem, which shows that diagonally-scaled gradient methods are\nminimax rate optimal, and that the constants are sharp up to a factor of 9, whenever the gradient\nnorms are quadratically convex. While the constant 9 is looser than that Donoho et al. [10] provide\nfor Gaussian sequence models, this theorem highlights the essential structural similarity between the\nsequence model case and stochastic optimization methods.\nTheorem 1. Let Assumption A1 hold and let \u21e5 be quadratically convex, orthosymmetric, and\ncompact. Then\n\n1\n\n1\npn\n\n8plog 3\nThere exists \u21e4 2 Rd\nWe present the proof in Appendix C.1.\n\nsup\n\u27132\u21e5\n\n\u21e4(\u2713) \uf8ff MS\n\nn(\u21e5, ) \uf8ff MR\n\nn(\u21e5, ) \uf8ff\n\n1\npn\n\nsup\n\u27132\u21e5\n\n\u21e4(\u2713).\n\n+ such that diagonally-scaled gradient methods with \u21e4 achieve this rate.\n\n3.2.3 Arbitrary gradient norms\nWhen the norm  on the gradients de\ufb01nes a non-quadratically convex norm ball B(0, 1)\u2014for\nexample, when the gradients belong to an `r-norm ball for r 2 [1, 2]\u2014our results become slightly\nless general. Nonetheless, when  is a weighted `r-norm ball (for r 2 [1, 2]), diagonally-scaled\ngradient methods are minimax rate optimal, as Corollary 2 will show; when the norms  are arbitrary\nwe have a slightly more complex result.\nTheorem 2. Let \u21e5 be an orthosymmetric, quadratically convex, convex and compact set and  an\narbitrary norm. Recall the de\ufb01nition ( \u2713\n\n(e.) )j = \u2713j/(ej). Then for any k 2 N,\n\n1\n\n8pn log 3\u27131 \n\nk\n\nn log 3\u25c6\n\nsup\n\n\u27132\u21e5,k\u2713k0\uf8ffk\n\n\u2713\n\n(e\u00b7)2 \uf8ff MS\n\n\uf8ff MR\n\nn(\u21e5, )\n\nn(\u21e5, ) \uf8ff\n\n1\npn\n\nsup\n\u27132\u21e5\n\nsup\n\n\u2713>g.\n\ng2QHull(B (0,1))\n\n(7)\n\nCorollary 1 gives the upper bound in the theorem. The lower bound consists of an application\nof Assouad\u2019s method [2], but, in parallel to the warm-up examples, we construct well-separated\nfunctions with \u201csparse\u201d gradients. See Appendix C.2 for a proof.\nWe can develop a corollary of this result when the norm  is a weighted-`r norm (for r 2 [1, 2]).\nWhile these do not induce quadratically convex norm balls, meaning the results of the previous\nsection do not apply, the previous theorem still guarantees that diagonally-scaled gradient methods\nare minimax rate optimal.\n\n6\n\n\fsup\n\n\u27132\u21e5\n\n\u2713\n\n(e\u00b7)2\n\n.\n\n1\n16\n\nsup\n\n\u27132\u21e5\n\n\u2713\n\n(e\u00b7)2 \uf8ff MS\n\nCorollary 2. Let the conditions of Theorem 2 hold and assume that (g) = k  gkr with r 2 [1, 2],\nj > 0 and (  g)j = jgj. Then for n  2d,\n\n1\npn\n\nn(\u21e5, ) \uf8ff\n\nn(\u21e5, ) \uf8ff MR\n\n+ such that diagonally-scaled gradient methods with \u21e4 achieve this rate.\n\n1\npn\nThere exists \u21e4 2 Rd\nA minor modi\ufb01cation of Theorem 2 gives the lower bound, while we obtain the upper bound by\nnoting that the quadratic hull of a weighted-`r norm ball for r 2 [1, 2] is the weighted-`2 norm ball.\nThe dual norm of (g) = k  gk2 being \u21e4(g) = kg/k2, the upper bound holds by duality. See\nAppendix C.3 for the (short) precise proof.\nTheorem 1 and Corollary 2 show that for a large collection of norms  on the gradients, diagonally-\nscaled gradient methods is minimax rate optimal. Arguing that diagonally-scaled gradient methods\nare minimax rate optimal when  is neither a weighted-`r norm nor induces a quadratically convex\nunit ball remains an open question, though weighted-`r norms for r 2 [1,1] cover the majority of\npractical applications of stochastic gradient methods.\nWe conclude this section by generalizing our results to constraint sets that are rotations of orthosym-\nmetric and quadratically convex sets. This is for example the case when features are sparse in an\nappropriate basis (e.g. wavelets [17]). Unsurprisingly, methods of linear type retain their optimality\nproperties.\nCorollary 3. Let \u21e50 be a compact, orthosymmetric, convex and quadratically convex set. Let\nU 2O n(R) be a rotation matrix and \u21e5:= U \u21e50 = {U\u2713 | \u2713 2 \u21e50}. Consider the collection\n\nF := {F : Rd \u21e5X ! R,8x 2X ,8\u2713 2 Rd,8g 2 @\u2713f (\u2713, x), (U T g) \uf8ff 1}.\n\nA method of linear type is minimax rate optimal for the pair (\u21e5,F).\nWe present the proof in Appendix C.4.\n\n4 Beyond quadratic convexity \u2013 the necessity of non-linear methods\nFor \u21e5 \u21e2 Rd quadratically convex, the results in Section 3 show that methods of linear type achieve\noptimal rates of convergence. When the constraint set is not quadratically convex, it is unclear\nwhether methods of linear type are suf\ufb01cient to achieve optimal rates. As we now show, they are not:\nwe exhibit a collection of problem instances where the constraint set is orthosymmetric, compact,\nand convex but not quadratically convex. On such problems, the constraint set has substantial\nconsequences; for some non-quadratically convex sets \u21e5, methods of linear type (e.g. the stochastic\ngradient method) can be minimax rate-optimal, while for other constraint sets, all methods of linear\n\ntype must have regret at least a factorpd/ log d worse than the minimax optimal rate, which\n(non-linear) mirror descent with appropriate distance generating function achieves.\nTo construct these problem instances, we turn to simple non-quadratically convex constraint sets: `p\nballs for p 2 [1, 2]. We measure subgradient norms in the dual `p\u21e4 norm, p\u21e4 = p1\np . Our analysis\nconsists of two steps: we \ufb01rst prove sharp minimax rates on these problem instances and show\nthat mirror descent with the right (non-linear) distance generating function is minimax rate optimal.\nThese results extend those of Agarwal et al. [1], who provide matching lower and upper bounds for\np  1 + c for a \ufb01xed numerical constant c > 0. In contrast, we prove sharp minimax rates for all\np  1. To precisely characterize the gap between linear and non-linear methods, we show that for any\nlinear pre-conditioner, we can exhibit functions for which the regret of Euclidean gradient methods\n2 k\u2713k2\nis nearly the simple upper regret bound of standard gradient methods, Eq. (5) with h(\u2713) = 1\n2.\nThus, when p is very close to 2 (nearly quadratically convex), the gap remains within a constant\n\nfactor, whereas when p is close to 1, the gap can be as large aspd/ log d.\n4.1 Minimax rates for p-norm constraint sets, p 2 [1, 2]\nFor p 2 [1, 2], we consider the constraint set \u21e5= Bp(0, 1) and bound gradients with norm  = k\u00b7k p\u21e4.\nWe begin by proving sharp minimax rates on this collection of problems and show that, in these cases,\nnon-linear mirror descent is minimax optimal.\n\n7\n\n\fn . MS\nMirror descent (4) with h(\u2713) := 1\n\nTheorem 3. Let p 2 [1, 2], \u21e5= Bp(0, 1) and  = k\u00b7k p\u21e4.\n(i) If 1 \uf8ff p \uf8ff 1 + 1/ log(2d), then\n1 ^q log(2d)\n(ii) If 1 + 1/ log(2d) < p \uf8ff 2, then\n1 ^q 1\n\nn(\u21e5, ) \uf8ff MR\n\nn(\u21e5, ) \uf8ff MR\n\n2(a1)k\u2713k2\n\nMirror descent with h(\u2713) := 1\n\nn(p1) . MS\n2(p1)k\u2713k2\n\nn(\u21e5, ) . 1 ^q log(2d)\n\nn\n\n.\n\na for a = 1 + 1\n\nlog(2d) achieves the optimal rate.\n\nn(\u21e5, ) . 1 ^q 1\n\nn(p1) .\n\np achieves the optimal rate.\n\nTo prove the theorem, we upper bound the regret of mirror descent with norm-based distance\ngenerating functions (cf. [24, Corollary 2.18]), which follows immediately from the regret bound (5).\nProposition 5. Let \u21e5 be closed convex,  a norm, and 1 < a \uf8ff 2, a\u21e4 = a\na1. Mirror descent with\nsup\u27132\u21e5 k\u2713\u27130ka\ndistance generating function h(\u2713) := 1\nachieves\nregret\n\npn supg2B (0,1) kgka\u21e4\n\na and stepsize \u21b5 =\n\n2(a1)k\u2713k2\nsup\u27132\u21e5 k\u2713ka supg2B (0,1) kgka\u21e4\n\n.\n\nMR\n\nn(\u21e5, ) \uf8ff\n\npn(a  1)\n\nWe present the full proof of Theorem 3 in Appendix D.1. We obtain the lower bound with the familiar\nreduction from estimation to testing and Assouad\u2019s method (see Appendix A.2).\n\n4.2 Exhibiting hard problems for Euclidean gradient methods\nTheorem 3 shows that (non-linear) mirror descent methods are minimax rate-optimal for `p-ball con-\nstraint sets, p 2 [1, 2], with gradients contained in the corresponding dual `p\u21e4-norm ball (p\u21e4 = p\np1).\nFor problems and p, standard subgradient methods achieve worst-case regret O(d1/21/p\u21e4/pn).\nThis is sharp: in the next theorem, we show that for any method of linear type, we can construct a\nsequence of (linear) functions such that the method\u2019s regret is at least this familiar upper bound of\nstandard subgradient methods, precisely quantifying the gap between linear and non-linear methods\nfor this problem class.\n\nTheorem 4. Let Regretn,A(\u2713) =Pn\ndescent method with distance generating function hA(\u2713) = 1\nFor any A \u232b 0 and p 2 [1, 2] with q = p\nand point \u2713 2 Rd with k\u2713kp \uf8ff 1 such that\n1\nRegretn,A(\u2713) \n2\n\ni=1 g>i (\u2713i \u2713) denote the regret of the (Euclidean) online mirror\n2 \u2713>A\u2713 for linear functions Fi(\u2713) = g>i \u2713.\np1, there exists a sequence of vectors gi 2 Rd, kgikq \uf8ff 1,\nminnn/2,p2n \u00b7 d1/21/qo .\n\nWe provide the proof in Appendix D.2. These results explicitly exhibit a gap between methods\nof linear type and non-linear mirror descent methods for this problem class. In contrast to the\nfrequent practice in literature of simply comparing regret upper bounds\u2014prima facie illogical\u2014we\ndemonstrate the gap indeed must hold.\nIn combination with Theorem 4, Proposition 5 precisely characterizes the gap between linear and\nnon-linear mirror descent on these problems for all values of p 2 [1, 2]. Indeed, when p = 1, for\nany pre-conditioner A, there exists a problem on which Euclidean gradient methods has regret at\nleast \u2326(1)pd/n. On the same problem, non-linear mirror descent has regret at most O(1)plog d/n,\nshowing the advertisedpd/ log d gap. When p  2  1/ log d (so \u21e5 is nearly quadratically convex),\n\nthe gap reduces to at most a constant factor.\n\n5 The need for adaptive methods\n\nWe have so far demonstrated that diagonal re-scaling is suf\ufb01cient to achieve minimax optimal rates\nfor problems over quadratically convex constraint sets. In practice, however, it is often the case\n\n8\n\n\fthat we do not know the geometry of the problem in advance, precluding selection of the optimal\nlinear pre-conditioner. To address this problem, adaptive gradient methods choose, at each step,\na (usually diagonal) matrix \u21e4i conditional on the subgradients observed thus far, {gl}l\uf8ffi. The\nalgorithm then updates the iterate based on the distance generating function hi(\u2713) := 1\n2 \u2713>\u21e4i\u2713. In\nthis section, we present a problem instance showing that when the \u201cscale\u201d of the subgradients varies\nacross dimensions, adaptive gradient methods are crucial to achieve low regret. While there exists an\noptimal pre-conditioner, if we do not assume knowledge of the geometry in advance, AdaGrad [13]\nachieves the minimax optimal regret while standard (non-adaptive) subgradient methods can be pd\nsuboptimal on the same problem.\nWe consider the following setting: \u21e5= B1(0, 1) and (g) = k  gk1, for an arbitrary  2\nRd,  0. Intuitively, j corresponds to the \u201cscale\u201d of the j-th dimension. On this problem, a\nstraightforward optimization of the regret bound (5) guarantees that stochastic gradient methods\nachieve regret pdn/ minj j. We exhibit a problem instance (in Appendix E) such that, for any\nstepsize \u21b5, online gradient descent attains this worst-case regret.\nTheorem 5. Let Regretn,\u21b5(\u2713) =Pi\uf8ffn g>i (\u2713i  \u2713) denote the regret of the online gradient descent\nmethod with stepsize \u21b5  0 for linear functions Fi(\u2713) = g>i \u2713. For any choice of \u21b5  0 and   0,\nthere exists a sequence of vectors {gi}i\uf8ffn \u21e2 Rd, (gi) \uf8ff 1 and point \u2713 2 \u21e5 such that\n\nRegretn,\u21b5(\u2713) \n\n1\n2\n\nmin( dn\n\n2kk1\n\n,\n\np2dn\n\nminj\uf8ffd j) .\n\nIn contrast, AdaGrad [13] achieves regret pnk1/k2, demonstrating suboptimality gap as large as\npd for some choices of . Indeed, let Regretn,AdaGrad(\u2713) be the regret of AdaGrad. Then\n\n(see [13, Corollary 6]), and by Cauchy-Schwarz,\n\ng2\ni,j.\n\nRegretn,AdaGrad(\u2713) \uf8ff 2p2Xj\uf8ffdsXi\uf8ffn\ni,j \uf8ff k1/k2sXi\uf8ffn\njsXi\uf8ffn\n\n2\nj g2\n\n1\n\nXj\uf8ffdsXi\uf8ffn\n\ng2\n\ni,j =Xj\uf8ffd\n\nk  gik2\n\n2 \uf8ff pnk1/k2 .\n\nTo concretely consider different scales across dimensions, we choose j = j. Theorem 5\nguarantees that there exists a collection of linear functions such that stochastic gradient meth-\nods suffer regret \u2326(1)pdn. Given that k1/k2 \uf8ff p\u21e3(2) \uf8ff \u21e1/p6, AdaGrad achieves regret\nO(1)pn\u2014amounting to a suboptimality gap of order pd\u2014exhibiting the need for adaptivity.\nThis pd gap is also the largest possible over subgradient methods, which may achieve regret\nqdPi\uf8ffn kgik2\ni,j for \u21e5= B1(0, 1). Finally, we note in passing that Ada-\nGrad is minimax optimal on this class of problems via a straightforward application of Theorem 1.\n\n2 \uf8ff pdPj\uf8ffdqPi\uf8ffn g2\n\n6 Discussion\n\nIn this paper, we provide concrete recommendations for when one should use adaptive, mirror or\nstandard gradient methods depending on the geometry of the problem. While we emphasize the\nimportance of adaptivity, the picture is not fully complete: for example, in the case of quadratically\nconvex constraint sets, while the best diagonal pre-conditioner achieves optimal rates, the extent\nto which adaptive gradient algorithms \ufb01nd this optimal pre-conditioner remains an open question.\nAnother avenue to explore involves the many \ufb02avors of adaptivity\u2014while the minimax framework\nassumes knowledge of the problem setting (e.g. a bound on the domain or the gradient norms), it is\noften the case that such parameters are unknown to the practitioner. To what extent can adaptivity\nmitigate this and achieve optimal rates, and is minimax (i.e. worst-case) optimality truly the right\nmeasure of performance? Finally, we close with a parting message about the value and costs of\nadaptive and related methods. One should turn to adaptive gradient methods (at most) in settings\nwhere methods of linear type are optimal. It is as our mothers told us when we were children: if you\nwant steak, don\u2019t order chicken.\n\n9\n\n\fAcknowledgments This work was supported by NSF-CAREER Award 1553086, ONR-YIP\nN00014-19-1-2288, and the Stanford DAWN Project. We thank Aditya Grover, Annie Marsden and\nHongseok Namkoong for valuable comments on the draft as well as Quentin Guignard for pointing\nus to the Banach-Mazur distance for Theorem 4.\n\nReferences\n[1] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower\nbounds on the oracle complexity of convex optimization. IEEE Transactions on Information\nTheory, 58(5):3235\u20133249, 2012.\n\n[2] P. Assouad. Deux remarques sur l\u2019estimation. Comptes Rendus des S\u00e9ances de l\u2019Acad\u00e9mie des\n\nSciences, S\u00e9rie I, 296(23):1021\u20131024, 1983.\n\n[3] P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In Advances in\n\nNeural Information Processing Systems 20, 2007.\n\n[4] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31:167\u2013175, 2003.\n\n[5] L. Bottou, F. Curtis, and J. Nocedal. Optimization methods for large-scale learning. SIAM\n\nReview, 60(2):223\u2013311, 2018.\n\n[6] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\n2006.\n\n[7] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, September 2004.\n\n[8] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley, 2006.\n\n[9] A. Cutcosky and T. Sarlos. Matrix-free preconditioning in online learning. In Proceedings of\n\nthe 36th International Conference on Machine Learning, 2019.\n\n[10] D. L. Donoho, R. C. Liu, and B. MacGibbon. Minimax risk over hyperrectangles, and implica-\n\ntions. Annals of Statistics, 18(3):1416\u20131437, 1990.\n\n[11] J. C. Duchi. Introductory lectures on stochastic convex optimization. In The Mathematics of\n\nData, IAS/Park City Mathematics Series. American Mathematical Society, 2018.\n\n[12] J. C. Duchi. Information theory and statistics. Lecture Notes for Statistics 311/EE 377, Stanford\nUniversity, 2019. URL http://web.stanford.edu/class/stats311/lecture-notes.\npdf. Accessed May 2019.\n\n[13] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[14] J. C. Duchi, M. I. Jordan, and H. B. McMahan. Estimation, optimization, and parallelism when\n\ndata is sparse. In Advances in Neural Information Processing Systems 26, 2013.\n\n[15] C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265\u2013299, 2003.\n\n[16] J. Hiriart-Urruty and C. Lemar\u00e9chal. Convex Analysis and Minimization Algorithms I. Springer,\n\nNew York, 1993.\n\n[17] S. Mallat. A Wavelet Tour of Signal Processing: The Sparse Way (Third Edition). Academic\n\nPress, 2008.\n\n[18] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\n\nWiley, 1983.\n\n[19] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n10\n\n\f[20] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program-\n\nming, 120(1):261\u2013283, 2009.\n\n[21] F. Orabona and K. Crammer. New adaptive algorithms for online classi\ufb01cation. In Advances in\n\nNeural Information Processing Systems 23, 2010.\n\n[22] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical\n\nStatistics, 22:400\u2013407, 1951.\n\n[23] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The\n\nHebrew University of Jerusalem, 2007.\n\n[24] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in\n\nMachine Learning, 4(2):107\u2013194, 2012.\n\n[25] M. Sion. On general minimax theorems. Paci\ufb01c Journal of Mathematics, 8(1):171\u2013176, 1958.\n[26] R. Vershynin. Lectures in geometric functional analysis. Unpublished manuscript, 2009. URL\n\nhttps://www.math.uci.edu/~rvershyn/papers/GFA-book.pdf.\n\n[27] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge\n\nUniversity Press, 2019.\n\n[28] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive\ngradient methods in machine learning. In Advances in Neural Information Processing Systems\n30, 2017.\n\n[29] B. Yu. Assouad, Fano, and Le Cam.\n\nSpringer-Verlag, 1997.\n\nIn Festschrift for Lucien Le Cam, pages 423\u2013435.\n\n11\n\n\f", "award": [], "sourceid": 6136, "authors": [{"given_name": "Daniel", "family_name": "Levy", "institution": "Stanford University"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}]}