{"title": "Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back", "book": "Advances in Neural Information Processing Systems", "page_first": 3576, "page_last": 3584, "abstract": "In stochastic convex optimization the goal is to minimize a convex function $F(x) \\doteq \\E_{f\\sim D}[f(x)]$ over a convex set $\\K \\subset \\R^d$ where $D$ is some unknown distribution and each $f(\\cdot)$ in the support of $D$ is convex over $\\K$. The optimization is based on i.i.d.~samples $f^1,f^2,\\ldots,f^n$ from $D$. A common approach to such problems is empirical risk minimization (ERM) that optimizes $F_S(x) \\doteq \\frac{1}{n}\\sum_{i\\leq n} f^i(x)$. Here we consider the question of how many samples are necessary for ERM to succeed and the closely related question of uniform convergence of $F_S$ to $F$ over $\\K$. We demonstrate that in the standard $\\ell_p/\\ell_q$ setting of Lipschitz-bounded functions over a $\\K$ of bounded radius, ERM requires sample size that scales linearly with the dimension $d$. This nearly matches standard upper bounds and improves on $\\Omega(\\log d)$ dependence proved for $\\ell_2/\\ell_2$ setting in (Shalev-Shwartz et al. 2009). In stark contrast, these problems can be solved using dimension-independent number of samples for $\\ell_2/\\ell_2$ setting and $\\log d$ dependence for $\\ell_1/\\ell_\\infty$ setting using other approaches. We also demonstrate that for a more general class of range-bounded (but not Lipschitz-bounded) stochastic convex programs an even stronger gap appears already in dimension 2.", "full_text": "Generalization of ERM in Stochastic Convex\n\nOptimization:\n\nThe Dimension Strikes Back\u2217\n\nVitaly Feldman\n\nIBM Research \u2013 Almaden\n\nAbstract\n\n.\n= 1\nn\n\n(cid:80)\n\nIn stochastic convex optimization the goal is to minimize a convex function\n= Ef\u223cD[f (x)] over a convex set K \u2282 Rd where D is some unknown\n.\nF (x)\ndistribution and each f (\u00b7) in the support of D is convex over K. The optimi-\nzation is commonly based on i.i.d. samples f 1, f 2, . . . , f n from D. A standard\napproach to such problems is empirical risk minimization (ERM) that optimizes\ni\u2264n f i(x). Here we consider the question of how many samples\nFS(x)\nare necessary for ERM to succeed and the closely related question of uniform\nconvergence of FS to F over K. We demonstrate that in the standard (cid:96)p/(cid:96)q setting\nof Lipschitz-bounded functions over a K of bounded radius, ERM requires sample\nsize that scales linearly with the dimension d. This nearly matches standard upper\nbounds and improves on \u2126(log d) dependence proved for (cid:96)2/(cid:96)2 setting in [18]. In\nstark contrast, these problems can be solved using dimension-independent number\nof samples for (cid:96)2/(cid:96)2 setting and log d dependence for (cid:96)1/(cid:96)\u221e setting using other\napproaches.\nWe further show that our lower bound applies even if the functions in the support\nof D are smooth and ef\ufb01ciently computable and even if an (cid:96)1 regularization term is\nadded. Finally, we demonstrate that for a more general class of bounded-range (but\nnot Lipschitz-bounded) stochastic convex programs an in\ufb01nite gap appears already\nin dimension 2.\n\n1\n\nIntroduction\n\nNumerous central problems in machine learning, statistics and operations research are special cases of\nstochastic optimization from i.i.d. data samples. In this problem the goal is to optimize the value of the\n= Ef\u223cD[f (x)] over some set K given i.i.d. samples f 1, f 2, . . . , f n\n.\nexpected objective function F (x)\nof f. For example, in supervised learning the set K consists of hypothesis functions from Z to Y\nand each sample is an example described by a pair (z, y) \u2208 (Z, Y ). For some \ufb01xed loss function\nL : Y \u00d7 Y \u2192 R, an example (z, y) de\ufb01nes a function from K to R given by f(z,y)(h) = L(h(z), y).\nThe goal is to \ufb01nd a hypothesis h that (approximately) minimizes the expected loss relative to some\ndistribution P over examples: E(z,y)\u223cP [L(h(z), y)] = E(z,y)\u223cP [f(z,y)(h)].\nHere we are interested in stochastic convex optimization (SCO) problems in which K is some convex\nsubset of Rd and each function in the support of D is convex over K. The importance of this\nsetting stems from the fact that such problems can be solved ef\ufb01ciently via a large variety of known\ntechniques. Therefore in many applications even if the original optimization problem is not convex, it\nis replaced by a convex relaxation.\nA classic and widely-used approach to solving stochastic optimization problems is empirical risk\nminimization (ERM) also referred to as stochastic average approximation (SAA) in the optimization\n\n\u2217See [9] for the full version of this work.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(cid:80)\n\n.\n= 1\nn\n\nliterature. In this approach, given a set of samples S = (f 1, f 2, . . . , f n) the empirical objective\nfunction: FS(x)\ni\u2264n f i(x) is optimized (sometimes with an additional regularization term\nsuch as \u03bb(cid:107)x(cid:107)2 for some \u03bb > 0). The question we address here is the number of samples required\nfor this approach to work distribution-independently. More speci\ufb01cally, for some \ufb01xed convex\nbody K and \ufb01xed set of convex functions F over K, what is the smallest number of samples\nn such that for every probability distribution D supported on F, any algorithm that minimizes\nFS given n i.i.d. samples from D will produce an \u0001-optimal solution \u02c6x to the problem (namely,\nF (\u02c6x) \u2264 minx\u2208K F (x) + \u0001) with probability at least 1\u2212 \u03b4? We will refer to this number as the sample\ncomplexity of ERM for \u0001-optimizing F over K (we will \ufb01x \u03b4 = 1/2 for now).\nThe sample complexity of ERM for \u0001-optimizing F over K is lower bounded by the sample complexity\nof \u0001-optimizing F over K, that is the number of samples that is necessary to \ufb01nd an \u0001-optimal\nsolution for any algorithm. On the other hand, it is upper bounded by the number of samples that\nensures uniform convergence of FS to F . Namely, if with probability \u2265 1 \u2212 \u03b4, for all x \u2208 K,\n|FS(x) \u2212 F (x)| \u2264 \u0001/2 then, clearly, any algorithm based on ERM will succeed. As a result, ERM\nand uniform convergence are the primary tool for analysis of the sample complexity of learning\nproblems and are the key subject of study in statistical learning theory. Fundamental results in VC\ntheory imply that in some settings, such as binary classi\ufb01cation and least-squares regression, uniform\nconvergence is also a necessary condition for learnability (e.g. [23, 17]) and therefore the three\nmeasures of sample complexity mentioned above nearly coincide.\nIn the context of stochastic convex optimization the study of sample complexity of ERM and\nuniform convergence was initiated in a groundbreaking work of Shalev-Shwartz, Shamir, Srebro and\nSridharan [18]. They demonstrated that the relationships between these notions of sample complexity\nare substantially more delicate even in the most well-studied settings of SCO. Speci\ufb01cally, let K\nbe a unit (cid:96)2 ball and F be the set of all convex sub-differentiable functions with Lipschitz constant\nrelative to (cid:96)2 bounded by 1 or, equivalently, (cid:107)\u2207f (x)(cid:107)2 \u2264 1 for all x \u2208 K. Then, known algorithm\n\u221a\nfor SCO imply that sample complexity of this problem is O(1/\u00012) and often expressed as 1/\nn\nrate of convergence (e.g. [14, 17]). On the other hand, Shalev-Shwartz et al.[18] show2 that the\nsample complexity of ERM for solving this problem with \u0001 = 1/2 is \u2126(log d). The only known\nupper bound for sample complexity of ERM is \u02dcO(d/\u00012) and relies only on the uniform convergence\nof Lipschitz-bounded functions [21, 18].\nAs can seen from this discussion, the work of Shalev-Shwartz et al.[18] still leaves a major gap\nbetween known bounds on sample complexity of ERM (and also uniform convergence) for this basic\nLipschitz-bounded (cid:96)2/(cid:96)2 setup. Another natural question is whether the gap is present in the popular\n(cid:96)1/(cid:96)\u221e setup. In this setup K is a unit (cid:96)1 ball (or in some cases a simplex) and (cid:107)\u2207f (x)(cid:107)\u221e \u2264 1 for all\nx \u2208 K. The sample complexity of SCO in this setup is \u03b8(log d/\u00012) (e.g. [14, 17]) and therefore, even\nan appropriately modi\ufb01ed lower bound in [18], does not imply any gap. More generally, the choice\nof norm can have a major impact on the relationship between these sample complexities and hence\nneeds to be treated carefully. For example, for (the reversed) (cid:96)\u221e/(cid:96)1 setting the sample complexity\nof the problem is \u03b8(d/\u00012) (e.g. [10]) and nearly coincides with the number of samples suf\ufb01cient for\nuniform convergence.\n\n1.1 Overview of Results\n\nIn this work we substantially strengthen the lower bound in [18] proving that a linear dependence on\nthe dimension d is necessary for ERM (and, consequently, uniform convergence). We then extend\nthe lower bound to all (cid:96)p/(cid:96)q setups and examine several related questions. Finally, we examine a\nmore general setting of bounded-range SCO (that is |f (x)| \u2264 1 for all x \u2208 K). While the sample\ncomplexity of this setting is still low (for example \u02dcO(1/\u00012) when K is an (cid:96)2 ball) and ef\ufb01cient\nalgorithms are known, we show that ERM might require an in\ufb01nite number of samples already for\nd = 2.\nOur work implies that in SCO, even optimization algorithms that exactly minimize the empirical\nobjective function can produce solutions with generalization error that is much larger than the generali-\nzation error of solutions obtained via some standard approaches. Another, somewhat counterintuitive,\nconclusion from our lower bounds is that, from the point of view of generalization of ERM and\nuniform convergence, convexity does not reduce the sample complexity in the worst case.\n\n2The dependence on d is not stated explicitly but follows immediately from their analysis.\n\n2\n\n\fBasic construction: Our basic construction is fairly simple and its analysis is inspired by the\ntechnique in [18]. It is based on functions of the form max{1/2, maxv\u2208V (cid:104)v, x(cid:105)}. Note that the\nmaximum operator preserves both convexity and Lipschitz bound (relative to any norm). See Figure\n1 for an illustration of such function for d = 2.\n\nFigure 1: Basic construction for d = 2.\n\nThe distribution over the sets V that de\ufb01ne such functions is uniform over all subsets of some set\nof vectors W of size 2d/6 such that for any two district u, v \u2208 W , (cid:104)u, v(cid:105) \u2264 1/2. Equivalently, each\nelement of W is included in V with probability 1/2 independently of other elements in W . This\nimplies that if the number of samples is less than d/6 then, with probability > 1/2, at least one of\nthe vectors in W (say w) will not be observed in any of the samples. This implies that FS can be\nminimized while maximizing (cid:104)w, x(cid:105) (the maximum over the unit (cid:96)2 ball is w). Note that a function\nrandomly chosen from our distribution includes the term (cid:104)w, x(cid:105) in the maximum operator with\nprobability 1/2. Therefore the value of the expected function F at w is 3/4 whereas the minimum of\nF is 1/2. In particular, there exists an ERM algorithm with generalization error of at least 1/4. The\ndetails of the construction appear in Sec. 3.1 and Thm. 3.3 gives the formal statement of the lower\nbound. We also show that, by scaling the construction appropriately, we can obtain the same lower\nbound for any (cid:96)p/(cid:96)q setup with 1/p + 1/q = 1 (see Thm. 3.4).\n\nLow complexity construction: The basic construction relies on functions that require 2d/6 bits to\ndescribe and exponential time to compute. Most application of SCO use ef\ufb01ciently computable\nfunctions and therefore it is natural to ask whether the lower bound still holds for such functions.\nTo answer this question we describe a construction based on a set of functions where each function\nrequires just log d bits to describe (there are at most d/2 functions in the support of the distribution)\nand each function can be computed in O(d) time. To achieve this we will use W that consists of\n(scaled) codewords of an asymptotically good and ef\ufb01ciently computable binary error-correcting\ncode [12, 22]. The functions are de\ufb01ned in a similar way but the additional structure of the code\nallows to use at most d/2 subsets of W to de\ufb01ne the functions. Further details of the construction\nappear in Section 4.\n\nSmoothness: The use of maximum operator results in functions that are highly non-smooth (that\nis, their gradient is not Lipschitz-bounded) whereas the construction in [18] uses smooth functions.\nSmoothness plays a crucial role in many algorithms for convex optimization (see [5] for examples).\nIt reduces the sample complexity of SCO in (cid:96)2/(cid:96)2 setup to O(1/\u0001) when the smoothness parameter\nis a constant (e.g. [14, 17]). Therefore it is natural to ask whether our strong lower bound holds\nfor smooth functions as well. We describe a modi\ufb01cation of our construction that proves a similar\nlower bound in the smooth case (with generalization error of 1/128). The main idea is to replace\neach linear function (cid:104)v, x(cid:105) with some smooth function \u03bd((cid:104)v, x(cid:105)) guaranteing that for different vectors\nv1, v2 \u2208 W and every x \u2208 K, only one of \u03bd((cid:104)v1, x(cid:105)) and \u03bd((cid:104)v2, x(cid:105)) can be non-zero. This allows to\neasily control the smoothness of maxv\u2208V \u03bd((cid:104)v, x(cid:105)). See Figure 2 for an illustration of a function on\nwhich the construction is based (for d = 2). The details of this construction appear in Sec. 3.2 and\nthe formal statement in Thm. 3.6.\n\n3\n\n\fFigure 2: Construction using 1-smooth functions for d = 2.\n\n(cid:96)1-regularization: Another important contribution in [18] is the demonstration of the important role\nthat strong convexity plays for generalization in SCO: Minimization of FS(x) + \u03bbR(x) ensures that\nERM will have low generalization error whenever R(x) is strongly convex (for a suf\ufb01ciently large\n\u03bb). This result is based on the proof that ERM of a strongly convex Lipschitz function is uniform\nreplace-one stable and the connection between such stability and generalization showed in [4] (see\nalso [19] for a detailed treatment of the relationship between generalization and stability). It is\nnatural to ask whether other approaches to regularization will ensure generalization. We demonstrate\nthat for the commonly used (cid:96)1 regularization the answer is negative. We prove this using a simple\nmodi\ufb01cation of our lower bound construction: We shift the functions to the positive orthant where\nthe regularization terms \u03bb(cid:107)x(cid:107)1 is just a linear function. We then subtract this linear function from\neach function in our construction, thereby balancing the regularization (while maintaining convexity\nand Lipschitz-boundedness). The details of this construction appear in Sec. 3.3 (see Thm. 3.7).\n\nDependence on accuracy: For simplicity and convenience we have ignored the dependence on the\naccuracy \u0001, Lipschitz bound L and radius R of K in our lower bounds. It is easy to see, that this more\ngeneral setting can be reduced to the case we consider here (Lipschitz bound and radius are equal to\n1) with accuracy parameter \u0001(cid:48) = \u0001/(LR). We generalize our lower bound to this setting and prove\nthat \u2126(d/\u0001(cid:48)2) samples are necessary for uniform convergence and \u2126(d/\u0001(cid:48)) samples are necessary\nfor generalization of ERM. Note that the upper bound on the sample complexity of these settings is\n\u02dcO(d/\u0001(cid:48)2) and therefore the dependence on \u0001(cid:48) in our lower bound does not match the upper bound for\nERM. Resolving this gap or even proving any \u03c9(d/\u0001(cid:48) + 1/\u0001(cid:48)2) lower bound is an interesting open\nproblem. Additional details can be found in the full version.\n\nBounded-range SCO: Finally, we consider a more general class of bounded-range convex functions\nNote that the Lipschitz bound of 1 and the bound of 1 on the radius of K imply a bound of 1 on the\nrange (up to a constant shift which does not affect the optimization problem). While this setting is not\nas well-studied, ef\ufb01cient algorithms for it are known. For example, the online algorithm in a recent\nwork of Rakhlin and Sridharan [16] together with standard online-to-batch conversion arguments\n[6], imply that the sample complexity of this problem is \u02dcO(1/\u00012) for any K that is an (cid:96)2 ball (of any\nradius). For general convex bodies K, the problems can be solved via random walk-based approaches\n[3, 10] or an adaptation of the center-of-gravity method given in [10]. Here we show that for this\nsetting ERM might completely fail already for K being the unit 2-dimensional ball. The construction\nis based on ideas similar to those we used in the smooth case and is formally described in in the full\nversion.\n\n2 Preliminaries\n= {1, . . . , n}. Random variables are denoted by bold letters, e.g., f.\nFor an integer n \u2265 1 let [n]\n.\np(R), and the unit ball by Bd\nGiven p \u2208 [1,\u221e] we denote the ball of radius R > 0 in (cid:96)p norm by Bd\np.\nFor a convex body (i.e., compact convex set with nonempty interior) K \u2286 Rd, we consider problems\n(cid:27)\nof the form\n\n(cid:26)\n\n.\n= E\nf\u223cD\n\n[f (x)]\n\n,\n\nminK (FD)\n\n.\n= min\nx\u2208K\n\nFD(x)\n\n4\n\n\f.\n= 1\nn\n\n(cid:80)\n\ni\u2208[n] f i.\n\nwhere f is a random variable de\ufb01ned over some set of convex, sub-differentiable functions F on K\nand distributed according to some unknown probability distribution D. We denote F \u2217 = minK(FD).\nFor an approximation parameter \u0001 > 0 the goal is to \ufb01nd x \u2208 K such that FD(x) \u2264 F \u2217 + \u0001 and we\ncall any such x an \u0001-optimal solution. For an n-tuple of functions S = (f 1, . . . , f n) we denote by\nFS\nWe say that a point \u02c6x is an empirical risk minimum for an n-tuple S of functions over K, if\nFS(\u02c6x) = minK(FS). In some cases there are many points that minimize FS and in this case we refer\nto a speci\ufb01c algorithm that selects one of the minimums of FS as an empirical risk minimizer. To\nmake this explicit we refer to the output of such a minimizer by \u02c6x(S) .\nGiven x \u2208 K, and a convex function f we denote by \u2207f (x) \u2208 \u2202f (x) an arbitrary selection of\na subgradient. Let us make a brief reminder of some important classes of convex functions. Let\np \u2208 [1,\u221e] and q = p\u2217 .\n= 1/(1 \u2212 1/p). We say that a subdifferentiable convex function f : K \u2192 R\nis in the class\n\n\u2022 F(K, B) of B-bounded-range functions if for all x \u2208 K, |f (x)| \u2264 B.\n\u2022 F 0\n\np (K, L) of L-Lipschitz continuous functions w.r.t. (cid:96)p, if for all x, y \u2208 K, |f (x) \u2212 f (y)| \u2264\nL(cid:107)x \u2212 y(cid:107)p;\np (K, \u03c3) of functions with \u03c3-Lipschitz continuous gradient w.r.t. (cid:96)p, if for all x, y \u2208 K,\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)q \u2264 \u03c3(cid:107)x \u2212 y(cid:107)p.\n\n\u2022 F 1\n\nWe will omit p from the notation when p = 2. Omitted proofs can be found in the full version [9].\n\n3 Lower Bounds for Lipschitz-Bounded SCO\n\nIn this section we present our main lower bounds for SCO of Lipschitz-bounded convex functions.\nFor comparison purposes we start by formally stating some known bounds on sample complexity of\nsolving such problems. The following uniform convergence bounds can be easily derived from the\nstandard covering number argument (e.g. [21, 18])\nTheorem 3.1. For p \u2208 [1,\u221e], let K \u2286 Bd\np(R) and let D be any distribution supported on functions\nL-Lipschitz on K relative to (cid:96)p (not necessarily convex). Then, for every \u0001, \u03b4 > 0 and n \u2265 n1 =\nO\n\n(cid:16) d\u00b7(LR)2\u00b7log(dLR/(\u0001\u03b4))\n\n(cid:17)\n\n\u00012\n\n[\u2203x \u2208 K, |FD(x) \u2212 FS(x)| \u2265 \u0001] \u2264 \u03b4.\n\nPr\nS\u223cDn\n\nThe following upper bounds on sample complexity of Lipschitz-bounded SCO can be obtained from\nseveral known algorithms [14, 18] (see [17] for a textbook exposition for p = 2).\nTheorem 3.2. For p \u2208 [1, 2], let K \u2286 Bd\np(R). Then, there is an algorithm Ap that given \u0001, \u03b4 > 0 and\np (K, L), outputs an \u0001-\nn = np(d, R, L, \u0001, \u03b4) i.i.d. samples from any distribution D supported on F 0\noptimal solution to FD over K with probability \u2265 1\u2212 \u03b4. For p \u2208 (1, 2], np = O((LR/\u0001)2 \u00b7 log(1/\u03b4))\nand for p = 1, np = O((LR/\u0001)2 \u00b7 log d \u00b7 log(1/\u03b4)).\nStronger results are known under additional assumptions on smoothness and/or strong convexity\n(e.g. [14, 15, 20, 1]).\n\n3.1 Non-smooth construction\n\nWe will start with a simpler lower bound for non-smooth functions. For simplicity, we will also\nrestrict R = L = 1. Lower bounds for the general setting can be easily obtained from this case by\nscaling the domain and desired accuracy.\nWe will need a set of vectors W \u2286 {\u22121, 1}d with the following property: for any distinct w1, w2 \u2208\nW , (cid:104)w1, w2(cid:105) \u2264 d/2. The Chernoff bound together with a standard packing argument imply that\nthere exists a set W with this property of size \u2265 ed/8 \u2265 2d/6.\nFor any subset V of W we de\ufb01ne a function\n\ngV (x)\n\n= max{1/2, max\n.\nw\u2208V\n\n(cid:104) \u00afw, x(cid:105)},\n\n(1)\n\n5\n\n\f\u221a\n= w/(cid:107)w(cid:107) = w/\n.\n\nd. See Figure 1 for an illustration. We \ufb01rst observe that gV is convex and\nwhere \u00afw\n1-Lipschitz (relative to (cid:96)2). This immediately follows from (cid:104) \u00afw, x(cid:105) being convex and 1-Lipschitz for\nevery w and gV being the maximum of convex and 1-Lipschitz functions.\nTheorem 3.3. Let K = Bd\n= {gV | V \u2286 W} for gV de\ufb01ned in eq. (1). Let D be\n.\nthe uniform distribution over H2. Then for n \u2264 d/6 and every set of samples S there exists an ERM\n\u02c6x(S) such that\n\n2 and we de\ufb01ne H2\n\n[FD(\u02c6x(S)) \u2212 F \u2217 \u2265 1/4] > 1/2.\n\nPr\nS\u223cDn\n\n.\n\nProof. We start by observing that the uniform distribution over H2 is equivalent to picking the\nfunction gV where V is obtained by including every element of W with probability 1/2 randomly\nand independently of all other elements. Further, by the properties of W , for every w \u2208 W , and\nV \u2286 W , gV ( \u00afw) = 1 if w \u2208 V and gV ( \u00afw) = 1/2 otherwise. For gV chosen randomly with respect\nto D, we have that w \u2208 V with probability exactly 1/2. This implies that FD( \u00afw) = 3/4.\nLet S = (gV1, . . . , gVn ) be the random samples. Observe that minK(FS) = 1/2 and F \u2217 =\ni\u2208[n] Vi (cid:54)= W then let\ni\u2208[n] Vi. Otherwise \u02c6x(S) is de\ufb01ned to be the origin \u00af0. Then by the\n\u02c6x(S)\nproperty of H2 mentioned above, we have that for all i, gVi(\u02c6x(S)) = 1/2 and hence FS(\u02c6x(S)) = 1/2.\nThis means that \u02c6x(S) is a minimizer of FS.\ni\u2208[n] Vi (cid:54)= W then there exists an ERM \u02c6x(S) such that\nFS(\u02c6x(S)) = minK(FS) and FD(\u02c6x(S))\u2212 F \u2217 = 1/4. Therefore to prove the claim it suf\ufb01ces to show\nthat for n \u2264 d/6 we have that\n\n= \u00afw for any w \u2208 W \\(cid:83)\n\nminK(FD) = 1/2 (the minimum is achieved at the origin \u00af0). Now, if(cid:83)\nCombining these statements, we get that, if(cid:83)\n\uf8ee\uf8f0(cid:91)\n\uf8ee\uf8f0w \u2208 (cid:91)\n\uf8f9\uf8fb =(cid:0)1 \u2212 2\u2212n(cid:1)|W| \u2264 e\u22122\u2212n\u00b72d/6 \u2264 e\u22121 <\n\n\uf8f9\uf8fb >\n\uf8f9\uf8fb = 1 \u2212 2\u2212n\nand this event is independent from the inclusion of other elements in(cid:83)\n\nThis easily follows from observing that for the uniform distribution over subsets of W , for every\nw \u2208 W ,\n\n\uf8ee\uf8f0(cid:91)\n\ni\u2208[n] Vi. Therefore\n\nVi (cid:54)= W\n\nVi = W\n\nVi\n\ni\u2208[n]\n\nPr\nS\u223cDn\n\ni\u2208[n]\n\nPr\nS\u223cDn\n\n1\n2\n\n.\n\n1\n2\n\n.\n\nPr\nS\u223cDn\n\ni\u2208[n]\n\nOther (cid:96)p norms: We now observe that exactly the same approach can be used to extend this lower\nbound to (cid:96)p/(cid:96)q setting. Speci\ufb01cally, for p \u2208 [1,\u221e] and q = p\u2217 we de\ufb01ne\n\ngp,V (x)\n\n.\n= max\n\n2\nIt is easy to see that for every V \u2286 W , gq,V \u2208 F 0\np, 1). We can now use the same argument\nas before with the appropriate normalization factor for points in Bd\np. Namely, instead of \u00afw for\nw \u2208 W we consider the values of the minimized functions at w/d1/p \u2208 Bd\np. This gives the following\ngeneralization of Thm. 3.3.\nTheorem 3.4. For every p \u2208 [1,\u221e] let K = Bd\n= {gp,V | V \u2286 W} and let D be\n.\nthe uniform distribution over Hp. Then for n \u2264 d/6 and every set of samples S there exists an ERM\n\u02c6x(S) such that\n\np and we de\ufb01ne Hp\n\n(cid:27)\n\n.\n\n(cid:104)w, x(cid:105)\nd1/q\n\n(cid:26) 1\n\n, max\nw\u2208V\np (Bd\n\n[FD(\u02c6x(S)) \u2212 F \u2217 \u2265 1/4] > 1/2.\n\nPr\nS\u223cDn\n\n6\n\n\f3.2 Smoothness does not help\n\nWe now extend the lower bound to smooth functions. We will for simplicity restrict our attention to\n(cid:96)2 but analogous modi\ufb01cations can be made for other (cid:96)p norms. The functions gV that we used in the\nconstruction use two maximum operators each of which introduces non-smoothness. To deal with\nmaximum with 1/2 we simply replace the function max{1/2,(cid:104) \u00afw, x(cid:105)} with a quadratically smoothed\nversion (in the same way as hinge loss is sometimes replaced with modi\ufb01ed Huber loss). To deal\nwith the maximum over all w \u2208 V , we show that it is possible to ensure that individual components\ndo not \u201cinteract\". That is, at every point x, the value, gradient and Hessian of at most one component\nfunction are non-zero (value, vector and matrix, respectively). This ensures that maximum becomes\naddition and Lipschitz/smoothness constants can be upper-bounded easily.\nFormally, we de\ufb01ne\n\nNow, for V \u2286 W , we de\ufb01ne\n\n\u03bd(a)\n\n.\n=\n\nhV (x)\n\n.\n=\n\nif a \u2264 0\notherwise.\n\na2\n\n(cid:26) 0\n(cid:88)\n\nw\u2208V\n\n\u03bd((cid:104) \u00afw, x(cid:105) \u2212 7/8).\n\n(2)\n\nSee Figure 2 for an illustration. We \ufb01rst prove that hV is 1/4-Lipschitz and 1-smooth.\nLemma 3.5. For every V \u2286 W and hV de\ufb01ned in eq. (2) we have hV \u2208 F 0\nFrom here we can use the proof approach from Thm. 3.3 but with hV in place of gV .\nTheorem 3.6. Let K = Bd\n= {hV | V \u2286 W} for hV de\ufb01ned in eq. (2). Let D be\nthe uniform distribution over H. Then for n \u2264 d/6 and every set of samples S there exists an ERM\n\u02c6x(S) such that\n\n2 and we de\ufb01ne H .\n\n2, 1/4) \u2229 F 1\n\n2 (Bd\n\n2 (Bd\n\n2, 1).\n\n[FD(\u02c6x(S)) \u2212 F \u2217 \u2265 1/128] > 1/2.\n\nPr\nS\u223cDn\n\n(cid:96)1 Regularization does not help\n\n\u221a\nd. (Note that if \u03bb > 1/\n\n3.3\nNext we show that the lower bound holds even with an additional (cid:96)1 regularization term \u03bb(cid:107)x(cid:107) for\n\u221a\npositive \u03bb \u2264 1/\nd then the resulting program is no longer 1-Lipschitz\nrelative to (cid:96)2. Any constant \u03bb can be allowed for (cid:96)1/(cid:96)\u221e setup). To achieve this we shift the\nconstruction to the positive orthant (that is x such that xi \u2265 0 for all i \u2208 [d]). In this orthant the\nsubgradient of the regularization term is simply \u03bb\u00af1 where \u00af1 is the all 1\u2019s vector. We can add a linear\nterm to each function in our distribution that balances this term thereby reducing the analysis to\nnon-regularized case. More formally, we de\ufb01ne the following family of functions. For V \u2286 W ,\n\n\u221a\n\nh\u03bb\nV (x)\n\n= hV (x \u2212 \u00af1/\n.\n\nd) \u2212 \u03bb(cid:104)\u00af1, x(cid:105).\nV (x) is L-Lipschitz for L \u2264 2(2 \u2212 7/8) + \u03bb\n2(2) and for a given \u03bb \u2208 (0, 1/\n\nNote that over Bd\n2(2), h\u03bb\nprove this formally.\nV | V \u2286 W} for\nTheorem 3.7. Let K = Bd\nV de\ufb01ned in eq. (3). Let D be the uniform distribution over H\u03bb. Then for n \u2264 d/6 and every set of\nh\u03bb\nsamples S there exists \u02c6x(S) such that\n\n(3)\nd \u2264 9/4. We now state and\n\nd], we de\ufb01ne H\u03bb .\n\n= {h\u03bb\n\n\u221a\n\n\u221a\n\n\u2022 FS(\u02c6x(S)) = minx\u2208K(FS(x) + \u03bb(cid:107)x(cid:107)1);\n\u2022 PrS\u223cDn [FD(\u02c6x(S)) \u2212 F \u2217 \u2265 1/128] > 1/2.\n\n4 Lower Bound for Low-Complexity Functions\n\nWe will now demonstrate that our lower bounds hold even if one restricts the attention to functions\nthat can be computed ef\ufb01ciently (in time polynomial in d). For this purpose we will rely on known\nconstructions of binary linear error-correcting codes. We describe the construction for non-smooth\n(cid:96)2/(cid:96)2 setting but analogous versions of other constructions can be obtained in the same way.\n\n7\n\n\f(cid:26)\n\n(cid:27)\n\n1 \u2212 r\n2d\n\n, max\nw\u2208Wj\n\n(cid:104) \u00afw, x(cid:105)\n\n,\n\nWe start by brie\ufb02y providing the necessary background about binary codes. For two vectors w1, w2 \u2208\n{\u00b11}d let #(cid:54)=(w1, w2) denote the Hamming distance between the two vectors. We say that a\nmapping G : {\u00b11}k \u2192 {\u00b11}d is a [d, k, r, T ] binary error-correcting code if G has distance at least\n2r + 1, G can be computed in time T and there exists an algorithm that for every w \u2208 {\u00b11}d such\nthat for some z \u2208 {\u00b11}k, #(cid:54)=(w, G(z)) \u2264 r \ufb01nds such z in time T (note that such z is unique).\nGiven [d, k, r, T ] code G, for every j \u2208 [k], we de\ufb01ne a function\n\ngj(x)\n\n.\n= max\n\n(4)\n= {G(z) | z \u2208 {\u00b11}k, zj = 1}. As before, we note that gj is convex and 1-Lipschitz\n.\n\nwhere Wj\n(relative to (cid:96)2).\nWe can now use any existing constructions of ef\ufb01cient binary error-correcting codes to obtain a lower\nbound that uses only a small set of ef\ufb01ciently computable convex functions. Getting a lower bound\nthat has asymptotically optimal dependence on d requires that k = \u2126(d) and r = \u2126(d) (referred\nto as being asymptotically good). The existence of ef\ufb01ciently computable and asymptotically good\nbinary error-correcting codes was \ufb01rst shown by Justesen [12]. More recent work of Spielman [22]\nshows existence of asymptotically good codes that can be encoded and decoded in O(d) time. In\nparticular, for some constant \u03c1 > 0, there exists a [d, d/2, \u03c1 \u00b7 d, O(d)] binary error-correcting code.\nAs a corollary we obtain the following lower bound.\nCorollary 4.1. Let G be an asymptotically-good [d, d/2, \u03c1 \u00b7 d, O(d)] error-correcting code for a\nconstant \u03c1 > 0. Let K = Bd\n= {gj | j \u2208 [d/2]} for gj de\ufb01ned in eq. (4). Let D\n.\nbe the uniform distribution over HG. Then for every x \u2208 K, gj(x) can be computed in time O(d).\nFurther, for n \u2264 d/4 and every set of samples S \u2208 Hn\n\n2 and we de\ufb01ne HG\n\nG there exists an ERM \u02c6x(S) such that\n\nFD(\u02c6x(S)) \u2212 F \u2217 \u2265 \u03c1/4.\n\n5 Discussion\n\nOur work points out to substantial limitations of the classic approach to understanding and analysis\nof generalization in the context of general SCO. Further, it implies that in order to understand\nhow well solutions produced by an optimization algorithm generalize, it is necessary to examine\nthe optimization algorithm itself. This is a challenging task that we still have relatively few tools\nto address. Yet such understanding is also crucial for developing theory to guide the design of\noptimization algorithms that are used in machine learning applications.\nOne way to bypass our lower bounds is to use additional structural assumptions. For example, for\ngeneralized linear regression problems uniform convergence gives nearly optimal bounds on sample\ncomplexity [13]. One natural question is whether there exist more general classes of functions that\ncapture most of the practically relevant SCO problems and enjoy dimension-independent (or, scaling\nas log d) uniform convergence bounds.\nAn alternative approach is to bypass uniform convergence (and possibly also ERM) altogether.\nAmong a large number of techniques that have been developed for ensuring generalization, the most\ngeneral ones are based on notions of stability [4, 19]. However, known analyses based on stability\noften do not provide the strongest known generalization guarantees (e.g. high probability bounds\nrequire very strong assumptions). Another issue is that we lack general algorithmic tools for ensuring\nstability of the output. Therefore many open problems remain and signi\ufb01cant progress is required to\nobtain a more comprehensive understanding of this approach. Some encouraging new developments\nin this area are the use of notions of stability derived from differential privacy [7, 8, 2] and the use of\ntechniques for analysis of convergence of convex optimization algorithms for proving stability [11].\n\nAcknowledgements\n\nI am grateful to Ken Clarkson, Sasha Rakhlin and Thomas Steinke for discussions and insightful\ncomments related to this work.\n\n8\n\n\fReferences\n[1] F. R. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate\n\no(1/n). In NIPS, pages 773\u2013781, 2013.\n\n[2] R. Bassily, K. Nissim, A. D. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stability for\n\nadaptive data analysis. In STOC, pages 1046\u20131059, 2016.\n\n[3] A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the local minima via simulated annealing:\n\nOptimization of approximately convex functions. In COLT, pages 240\u2013265, 2015.\n\n[4] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499\u2013526, 2002.\n\n[5] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine\n\nLearning, 8(3-4):231\u2013357, 2015.\n\n[6] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms.\n\nIEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[7] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statistical validity in\n\nadaptive data analysis. CoRR, abs/1411.2664, 2014. Extended abstract in STOC 2015.\n\n[8] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Generalization in adaptive data\n\nanalysis and holdout reuse. CoRR, abs/1506, 2015. Extended abstract in NIPS 2015.\n\n[9] V. Feldman. Generalization of ERM in stochastic convex optimization: The dimension strikes back. CoRR,\n\nabs/1608.04414, 2016. Extended abstract in NIPS 2016.\n\n[10] V. Feldman, C. Guzman, and S. Vempala. Statistical query algorithms for mean vector estimation and\n\nstochastic convex optimization. CoRR, abs/1512.09170, 2015. Extended abstract in SODA 2017.\n\n[11] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent.\n\nIn ICML, pages 1225\u20131234, 2016.\n\n[12] J. Justesen. Class of constructive asymptotically good algebraic codes. IEEE Trans. Inf. Theor., 18(5):652 \u2013\n\n656, 1972.\n\n[13] S. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin\n\nbounds, and regularization. In NIPS, pages 793\u2013800, 2008.\n\n[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM J. Optim., 19(4):1574\u20131609, 2009.\n\n[15] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic\n\noptimization. In ICML, 2012.\n\n[16] A. Rakhlin and K. Sridharan. Sequential probability assignment with binary alphabets and large classes of\n\nexperts. CoRR, abs/1501.07340, 2015.\n\n[17] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge University Press, 2014.\n\n[18] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In COLT,\n\n2009.\n\n[19] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence.\n\nThe Journal of Machine Learning Research, 11:2635\u20132670, 2010.\n\n[20] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results\n\nand optimal averaging schemes. In ICML, pages 71\u201379, 2013.\n\n[21] A. Shapiro and A. Nemirovski. On complexity of stochastic programming problems. In V. Jeyakumar and\nA. M. Rubinov, editors, Continuous Optimization: Current Trends and Applications 144. Springer, 2005.\n\n[22] D. Spielman. Linear-time encodable and decodable error-correcting codes.\n\nInformation Theory, 42(6):1723\u20131731, 1996.\n\nIEEE Transactions on\n\n[23] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1779, "authors": [{"given_name": "Vitaly", "family_name": "Feldman", "institution": "IBM Research - Almaden"}]}