{"title": "Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 1623, "page_last": 1633, "abstract": "Stochastic optimization algorithms with variance reduction have proven successful for minimizing large finite sums of functions. Unfortunately, these techniques are unable to deal with stochastic perturbations of input data, induced for example by data augmentation. In such cases, the objective is no longer a finite sum, and the main candidate for optimization is the stochastic gradient descent method (SGD). In this paper, we introduce a variance reduction approach for these settings when the objective is composite and strongly convex. The convergence rate outperforms SGD with a typically much smaller constant factor, which depends on the variance of gradient estimates only due to perturbations on a single example.", "full_text": "Stochastic Optimization with Variance Reduction\n\nfor In\ufb01nite Datasets with Finite Sum Structure\n\nAlberto Bietti\n\nInria\u2217\n\nJulien Mairal\n\nInria\u2217\n\nalberto.bietti@inria.fr\n\njulien.mairal@inria.fr\n\nAbstract\n\nStochastic optimization algorithms with variance reduction have proven successful\nfor minimizing large \ufb01nite sums of functions. Unfortunately, these techniques are\nunable to deal with stochastic perturbations of input data, induced for example by\ndata augmentation. In such cases, the objective is no longer a \ufb01nite sum, and the\nmain candidate for optimization is the stochastic gradient descent method (SGD).\nIn this paper, we introduce a variance reduction approach for these settings when\nthe objective is composite and strongly convex. The convergence rate outperforms\nSGD with a typically much smaller constant factor, which depends on the variance\nof gradient estimates only due to perturbations on a single example.\n\n1\n\nIntroduction\n\nMany supervised machine learning problems can be cast as the minimization of an expected loss\nover a data distribution with respect to a vector x in Rp of model parameters. When an in\ufb01nite\namount of data is available, stochastic optimization methods such as SGD or stochastic mirror descent\nalgorithms, or their variants, are typically used (see [5, 11, 24, 34]). Nevertheless, when the dataset is\n\ufb01nite, incremental methods based on variance reduction techniques (e.g., [2, 8, 15, 17, 18, 27, 29])\nhave proven to be signi\ufb01cantly faster at solving the \ufb01nite-sum problem\n\nmin\n\nx\u2208RpnF (x) := f (x) + h(x) =\n\n1\nn\n\nn\n\nXi=1\n\nfi(x) + h(x)o,\n\n(1)\n\nwhere the functions fi are smooth and convex, and h is a simple convex penalty that need not be\ndifferentiable such as the \u21131 norm. A classical setting is fi(x) = \u2113(yi, x\u22a4\u03bei) + (\u00b5/2)kxk2, where\n(\u03bei, yi) is an example-label pair, \u2113 is a convex loss function, and \u00b5 is a regularization parameter.\n\nIn this paper, we are interested in a variant of (1) where random perturbations of data are introduced,\nwhich is a common scenario in machine learning. Then, the functions fi involve an expectation over\na random perturbation \u03c1, leading to the problem\n\nmin\n\nx\u2208RpnF (x) :=\n\n1\nn\n\nn\n\nXi=1\n\nfi(x) + h(x)o. with\n\nfi(x) = E\u03c1[ \u02dcfi(x, \u03c1)].\n\n(2)\n\nUnfortunately, variance reduction methods are not compatible with the setting (2), since evaluating\na single gradient \u2207fi(x) requires computing a full expectation. Yet, dealing with random pertur-\nbations is of utmost interest; for instance, this is a key to achieve stable feature selection [23],\nimproving the generalization error both in theory [33] and in practice [19, 32], obtaining stable\nand robust predictors [36], or using complex a priori knowledge about data to generate virtually\n\n\u2217Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTable 1: Iteration complexity of different methods for solving the objective (2) in terms of number of\niterations required to \ufb01nd x such that E[f (x) \u2212 f (x\u2217)] \u2264 \u01eb. The complexity of N-SAGA [14] matches\nthe \ufb01rst term of S-MISO but is asymptotically biased. Note that we always have the perturbation\nnoise variance \u03c32\ntot and thus S-MISO improves on SGD both in the\n\ufb01rst term (linear convergence to a smaller \u00af\u01eb) and in the second (smaller constant in the asymptotic\nrate). In many application cases, we also have \u03c32\n\np smaller than the total variance \u03c32\n\ntot (see main text and Table 2).\n\np \u226a \u03c32\n\nMethod\n\nAsymptotic error\n\nIteration complexity\n\nSGD\n\n0\n\nN-SAGA [14]\n\n\u00b5 !\n\u01eb0 = O \u03c32\n\np\n\nS-MISO\n\n0\n\nO(cid:18) L\n\n\u00b5\n\nlog\n\n1\n\u00af\u01eb\n\nO(cid:18)(cid:18)n +\n\u00b5(cid:19) log\n\nL\n\nO (cid:18)n +\n\n+\n\nL\n\ntot\n\n1\n\n\u03c32\ntot\n\n\u00b5\u01eb(cid:19) with \u00af\u01eb = O(cid:18) \u03c32\n\u00b5 (cid:19)\n\u01eb(cid:19) with \u01eb > \u01eb0\n\u00b5(cid:19) log\n\u00b5\u01eb! with \u00af\u01eb = O \u03c32\n\u00b5 !\n\n1\n\u00af\u01eb\n\n\u03c32\np\n\n+\n\np\n\nlarger datasets [19, 26, 30]. Injecting noise in data is also useful to hide gradient information for\nprivacy-aware learning [10].\n\nDespite its importance, the optimization problem (2) has been littled studied and to the best of our\nknowledge, no dedicated optimization method that is able to exploit the problem structure has been\ndeveloped so far. A natural way to optimize this objective when h = 0 is indeed SGD, but ignoring the\n\ufb01nite-sum structure leads to gradient estimates with high variance and slow convergence. The goal of\nthis paper is to introduce an algorithm for strongly convex objectives, called stochastic MISO, which\nexploits the underlying \ufb01nite sum using variance reduction. Our method achieves a faster convergence\nrate than SGD, by removing the dependence on the gradient variance due to sampling the data points i\nin {1, . . . , n}; the dependence remains only for the variance due to random perturbations \u03c1.\n\nTo the best of our knowledge, our method is the \ufb01rst algorithm that interpolates naturally between in-\ncremental methods for \ufb01nite sums (when there are no perturbations) and the stochastic approximation\nsetting (when n = 1), while being able to ef\ufb01ciently tackle the hybrid case.\n\nRelated work. Many optimization methods dedicated to the \ufb01nite-sum problem (e.g., [15, 29])\nhave been motivated by the fact that their updates can be interpreted as SGD steps with unbiased\nestimates of the full gradient, but with a variance that decreases as the algorithm approaches the\noptimum [15]; on the other hand, vanilla SGD requires decreasing step-sizes to achieve this reduction\nof variance, thereby slowing down convergence. Our work aims at extending these techniques to the\ncase where each function in the \ufb01nite sum can only be accessed via a \ufb01rst-order stochastic oracle.\n\nMost related to our work, recent methods that use data clustering to accelerate variance reduction\ntechniques [3, 14] can be seen as tackling a special case of (2), where the expectations in fi are\nreplaced by empirical averages over points in a cluster. While N-SAGA [14] was originally not\ndesigned for the stochastic context we consider, we remark that their method can be applied to (2).\nTheir algorithm is however asymptotically biased and does not converge to the optimum. On the other\nhand, ClusterSVRG [3] is not biased, but does not support in\ufb01nite datasets. The method proposed\nin [1] uses variance reduction in a setting where gradients are computed approximately, but the\nalgorithm computes a full gradient at every pass, which is not available in our stochastic setting.\n\nPaper organization.\nIn Section 2, we present our algorithm for smooth objectives, and we analyze\nits convergence in Section 3. For space limitation reasons, we present an extension to composite\nobjectives and non-uniform sampling in Appendix A. Section 4 is devoted to empirical results.\n\n2 The Stochastic MISO Algorithm for Smooth Objectives\n\nIn this section, we introduce the stochastic MISO approach for smooth objectives (h = 0), which\nrelies on the following assumptions:\n\n\u2022 (A1) global strong convexity: f is \u00b5-strongly convex;\n\u2022 (A2) smoothness: \u02dcfi(\u00b7, \u03c1) is L-smooth for all i and \u03c1 (i.e., with L-Lipschitz gradients).\n\n2\n\n\ftot/\u03c32\n\nTable 2: Estimated ratio \u03c32\np, which corresponds to the expected acceleration of S-MISO over\nSGD. These numbers are based on feature vectors variance, which is closely related to the gradient\nvariance when learning a linear model. ResNet-50 denotes a 50 layer network [12] pre-trained on\nthe ImageNet dataset. For image transformations, the numbers are empirically evaluated from 100\ndifferent images, with 100 random perturbations for each image. R2\ncluster) denotes\nthe average squared distance between pairs of points in the dataset (respectively, in a given cluster),\nfollowing [14]. The settings for unsupervised CKN and Scattering are described in Section 4. More\ndetails are given in the main text.\n\ntot (respectively, R2\n\nType of perturbation Application case\n\nDirect perturbation of\nlinear model features\n\nRandom image\ntransformations\n\nData clustering as in [3, 14]\nAdditive Gaussian noise N (0, \u03b12I)\nDropout with probability \u03b4\nFeature rescaling by s in U (1 \u2212 w, 1 + w)\nResNet-50 [12], color perturbation\nResNet-50 [12], rescaling + crop\nUnsupervised CKN [22], rescaling + crop\nScattering [6], gamma correction\n\ntot/\u03c32\np\n\nEstimated ratio \u03c32\n\u2248\n\u2248\n\u2248\n\u2248\n\ncluster\n\nR2\ntot/R2\n1 + 1/\u03b12\n1 + 1/\u03b4\n1 + 3/w2\n21.9\n13.6\n9.6\n9.8\n\nNote that these assumptions are relaxed in Appendix A by supporting composite objectives and\nby exploiting different smoothness parameters Li on each example, a setting where non-uniform\nsampling of the training points is typically helpful to accelerate convergence (e.g., [35]).\n\nComplexity results. We now introduce the following quantity, which is essential in our analysis:\n\n\u03c32\np :=\n\n1\nn\n\nn\n\nXi=1\n\n\u03c32\ni , with \u03c32\n\ni := E\u03c1hk\u2207 \u02dcfi(x\u2217, \u03c1) \u2212 \u2207fi(x\u2217)k2i ,\n\nwhere x\u2217 is the (unique) minimizer of f . The quantity \u03c32\np represents the part of the variance of the\ngradients at the optimum that is due to the perturbations \u03c1. In contrast, another quantity of interest is\nthe total variance \u03c32\n\ntot, which also includes the randomness in the choice of the index i, de\ufb01ned as\n\ntot = Ei,\u03c1[k\u2207 \u02dcfi(x\u2217, \u03c1)k2] = \u03c32\n\u03c32\n\np + Ei[k\u2207fi(x\u2217)k2]\n\n(note that \u2207f (x\u2217) = 0).\n\nThe relation between \u03c32\n\ntot and \u03c32\n\np is obtained by simple algebraic manipulations.\n\nThe goal of our paper is to exploit the potential imbalance \u03c32\ntot, occurring when perturbations\non input data are small compared to the sampling noise. The assumption is reasonable: given a data\npoint, selecting a different one should lead to larger variation than a simple perturbation. From a\ntheoretical point of view, the approach we propose achieves the iteration complexity presented in\nTable 1, see also Appendix D and [4, 5, 24] for the complexity analysis of SGD. The gain over SGD\nis of order \u03c32\np, which is also observed in our experiments in Section 4. We also compare against\nthe method N-SAGA; its convergence rate is similar to ours but suffers from a non-zero asymptotic\nerror.\n\np \u226a \u03c32\n\ntot/\u03c32\n\nMotivation from application cases. One clear framework of application is the data clustering\nscenario already investigated in [3, 14]. Nevertheless, we will focus on less-studied data augmentation\nsettings that lead instead to true stochastic formulations such as (2). First, we consider learning a\nlinear model when adding simple direct manipulations of feature vectors, via rescaling (multiplying\neach entry vector by a random scalar), Dropout, or additive Gaussian noise, in order to improve the\ngeneralization error [33] or to get more stable estimators [23]. In Table 2, we present the potential\ngain over SGD in these scenarios. To do that, we study the variance of perturbations applied to\na feature vector \u03be. Indeed, the gradient of the loss is proportional to \u03be, which allows us to obtain\ngood estimates of the ratio \u03c32\np, as we observed in our empirical study of Dropout presented\nin Section 4. Whereas some perturbations are friendly for our method such as feature rescaling (a\nrescaling window of [0.9, 1.1] yields for instance a huge gain factor of 300), a large Dropout rate\nwould lead to less impressive acceleration (e.g., a Dropout with \u03b4 = 0.5 simply yields a factor 2).\n\ntot/\u03c32\n\nSecond, we also consider more interesting domain-driven data perturbations such as classical im-\nage transformations considered in computer vision [26, 36] including image cropping, rescaling,\nbrightness, contrast, hue, and saturation changes. These transformations may be used to train a linear\n\n3\n\n\fAlgorithm 1 S-MISO for smooth objectives\n\nInput: step-size sequence (\u03b1t)t\u22651;\ninitialize x0 = 1\nfor t = 1, . . . do\n\ni )i=1,...,n;\n\ni for some (z0\n\nnPi z0\ni =((1 \u2212 \u03b1t)zt\u22121\n\nzt\u22121\ni\n\nzt\n\n,\n\nxt =\n\n1\nn\n\nn\n\nXi=1\n\nzt\ni = xt\u22121 +\n\n1\nn\n\nend for\n\n(zt\n\nit \u2212 zt\u22121\n\nit\n\n).\n\nSample an index it uniformly at random, a perturbation \u03c1t, and update\n\ni + \u03b1t(xt\u22121 \u2212 1\n\n\u00b5 \u2207 \u02dcfit (xt\u22121, \u03c1t)),\n\nif i = it\notherwise.\n\n(3)\n\n(4)\n\nclassi\ufb01er on top of an unsupervised multilayer image model such as unsupervised CKNs [22] or\nthe scattering transform [6]. It may also be used for retraining the last layer of a pre-trained deep\nneural network: given a new task unseen during the full network training and given limited amount\nof training data, data augmentation may be indeed crucial to obtain good prediction and S-MISO\ncan help accelerate learning in this setting. These scenarios are also studied in Table 2, where the\nexperiment with ResNet-50 involving random cropping and rescaling produces 224 \u00d7 224 images\nfrom 256 \u00d7 256 ones. For these scenarios with realistic perturbations, the potential gain varies from\n10 to 20.\n\nDescription of stochastic MISO. We are now in shape to present our method, described in Algo-\nrithm 1. Without perturbations and with a constant step-size, the algorithm resembles the MISO/Finito\nalgorithms [9, 18, 21], which may be seen as primal variants of SDCA [28, 29]. Speci\ufb01cally, MISO\nis not able to deal with our stochastic objective (2), but it may address the deterministic \ufb01nite-sum\nproblem (1). It is part of a larger body of optimization methods that iteratively build a model of the\nobjective function, typically a lower or upper bound on the objective that is easier to optimize; for\ninstance, this strategy is commonly adopted in bundle methods [13, 25].\n\nMore precisely, MISO assumes that each fi is strongly convex and builds a model using lower bounds\nDt(x) = 1\n\ni=1 dt\n\nnPn\n\ni(x), where each dt\ni is a quadratic lower bound on fi of the form\n\u00b5\nkx \u2212 zt\n2\n\ni,2 \u2212 \u00b5hx, zt\n\ndt\ni(x) = ct\n\ni k2 = ct\n\nkxk2.\n\ni,1 +\n\ni i +\n\n\u00b5\n2\n\nThese lower bounds are updated during the algorithm using strong convexity lower bounds at xt\u22121 of\nthe form lt\n\ni(x) = fi(xt\u22121) + h\u2207fi(xt\u22121), x \u2212 xt\u22121i + \u00b5\n(x) + \u03b1tlt\n\n2 kx \u2212 xt\u22121k2 \u2264 fi(x):\ni(x),\n\n(5)\n\n(6)\n\ndt\n\ni(x) =(cid:26)(1 \u2212 \u03b1t)dt\u22121\n\ndt\u22121\ni\n\n(x),\n\ni\n\nif i = it\notherwise,\n\nwhich corresponds to an update of the quantity zt\ni :\n\nzt\n\ni =((1 \u2212 \u03b1t)zt\u22121\n\nzt\u22121\ni\n\n,\n\ni + \u03b1t(xt\u22121 \u2212 1\n\n\u00b5 \u2207fit (xt\u22121)),\n\nif i = it\notherwise.\n\nThe next iterate is then computed as xt = arg minx Dt(x), which is equivalent to (4). The original\nMISO/Finito algorithms use \u03b1t = 1 under a \u201cbig data\u201d condition on the sample size n [9, 21],\nwhile the theory was later extended in [18] to relax this condition by supporting smaller constant\nsteps \u03b1t = \u03b1, leading to an algorithm that may be interpreted as a primal variant of SDCA (see [28]).\n\nNote that when fi is an expectation, it is hard to obtain such lower bounds since the gradient\n\u2207fi(xt\u22121) is not available in general. For this reason, we have introduced S-MISO, which can\nexploit approximate lower bounds to each fi using gradient estimates, by letting the step-sizes \u03b1t\ndecrease appropriately as commonly done in stochastic approximation. This leads to update (3).\n\ni (y) = supx x\u22a4y \u2212fi(x).\nSeparately, SDCA [29] considers the Fenchel conjugates of fi, de\ufb01ned by f \u2217\nWhen fi is an expectation, f \u2217\ni is not available in closed form in general, nor are its gradients, and in\nfact exploiting stochastic gradient estimates is dif\ufb01cult in the duality framework. In contrast, [28]\ngives an analysis of SDCA in the primal, aka. \u201cwithout duality\u201d, for smooth \ufb01nite sums, and our\nwork extends this line of reasoning to the stochastic approximation and composite settings.\n\n4\n\n\fRelationship with SGD in the smooth case. The link between S-MISO in the non-composite\nsetting and SGD can be seen by rewriting the update (4) as\n\nwhere\n\nvt := xt\u22121 \u2212\n\n1\n\u00b5\n\nxt = xt\u22121 +\n\n1\nn\n\n(zt\n\nit\n\nit \u2212 zt\u22121\n\n) = xt\u22121 +\n\n\u03b1t\nn\n\u2207 \u02dcfit (xt\u22121, \u03c1t) \u2212 zt\u22121\n\n.\n\nit\n\nvt,\n\n(7)\n\nNote that E[vt|Ft\u22121] = \u2212 1\n\u00b5 \u2207f (xt\u22121), where Ft\u22121 contains all information up to iteration t; hence,\nthe algorithm can be seen as an instance of the stochastic gradient method with unbiased gradients,\nwhich was a key motivation in SVRG [15] and later in other variance reduction algorithms [8, 28]. It\nis also worth noting that in the absence of a \ufb01nite-sum structure (n = 1), we have zt\u22121\n= xt\u22121; hence\nour method becomes identical to SGD, up to a rede\ufb01nition of step-sizes. In the composite case (see\nAppendix A), our approach yields a new algorithm that resembles regularized dual averaging [34].\n\nit\n\nMemory requirements and handling of sparse datasets. The algorithm requires storing the\nvectors (zt\ni )i=1,...,n, which takes the same amount of memory as the original dataset and which\nis therefore a reasonable requirement in many practical cases. In the case of sparse datasets, it is\nfair to assume that random perturbations applied to input data preserve the sparsity patterns of the\noriginal vectors, as is the case, e.g., when applying Dropout to text documents described with bag-of-\nwords representations [33]. If we further assume the typical setting where the \u00b5-strong convexity\ncomes from an \u21132 regularizer: \u02dcfi(x, \u03c1) = \u03c6i(x\u22a4\u03be\u03c1\ni is the (sparse) perturbed\nexample and \u03c6i encodes the loss, then the update (3) can be written as\n\ni ) + (\u00b5/2)kxk2, where \u03be\u03c1\n\nzt\n\ni =((1 \u2212 \u03b1t)zt\u22121\n\nzt\u22121\ni\n\n,\n\ni \u2212 \u03b1t\n\n\u00b5 \u03c6\u2032\n\ni(x\u22a4\n\nt\u22121\u03be\u03c1t\n\ni )\u03be\u03c1t\ni ,\n\nif i = it\notherwise,\n\nwhich shows that for every index i, the vector zt\nthroughout the algorithm (assuming the initialization z0\nupdate (4) has the same cost since vt = zt\nit\n\n\u2212 zt\u22121\n\nit\n\nis also sparse.\n\ni preserves the same sparsity pattern as the examples \u03be\u03c1\ni\ni = 0), making the update (3) ef\ufb01cient. The\n\nLimitations and alternative approaches. Since our algorithm is uniformly better than SGD in\nterms of iteration complexity, its main limitation is in terms of memory storage when the dataset\ncannot \ufb01t into memory (remember that the memory cost of S-MISO is the same as the input dataset).\nIn these huge-scale settings, SGD should be preferred; this holds true in fact for all incremental\nmethods when one cannot afford to perform more than one (or very few) passes over the data. Our\npaper focuses instead on non-huge datasets, which are those bene\ufb01ting most from data augmentation.\n\nWe note that a different approach to variance reduction like SVRG [15] is able to trade off storage\nrequirements for additional full gradient computations, which would be desirable in some situations.\nHowever, we were not able to obtain any decreasing step-size strategy that works for these methods,\nboth in theory and practice, leaving us with constant step-size approaches as in [1, 14] that either\nmaintain a non-zero asymptotic error, or require dynamically reducing the variance of gradient\nestimates. One possible way to explain this dif\ufb01culty is that SVRG and SAGA [8] \u201cforget\u201d past\ngradients for a given example i, while S-MISO averages them in (3), which seems to be a technical\nkey to make it suitable to stochastic approximation. Nevertheless, the question of whether it is\npossible to trade-off storage with computation in a setting like ours is open and of utmost interest.\n\n3 Convergence Analysis of S-MISO\n\nWe now study the convergence properties of the S-MISO algorithm. For space limitation reasons,\nall proofs are provided in Appendix B. We start by de\ufb01ning the problem-dependent quantities\ni := x\u2217 \u2212 1\nz\u2217\n\n\u00b5 \u2207fi(x\u2217), and then introduce the Lyapunov function\n\nCt =\n\n1\n2\n\nkxt \u2212 x\u2217k2 +\n\n\u03b1t\nn2\n\nn\n\nXi=1\n\nkzt\n\ni \u2212 z\u2217\n\ni k2.\n\n(8)\n\nProposition 1 gives a recursion on Ct, obtained by upper-bounding separately its two terms, and\n\ufb01nding coef\ufb01cients to cancel out other appearing quantities when relating Ct to Ct\u22121. To this end, we\nborrow elements of the convergence proof of SDCA without duality [28]; our technical contribution\nis to extend their result to the stochastic approximation and composite (see Appendix A) cases.\n\n5\n\n\fProposition 1 (Recursion on Ct). If (\u03b1t)t\u22651 is a positive and non-increasing sequence satisfying\n\nwith \u03ba = L/\u00b5, then Ct obeys the recursion\n\n\u03b11 \u2264 min(cid:26) 1\n\n2\n\n,\n\nn\n\n2(2\u03ba \u2212 1)(cid:27) ,\n\nE[Ct] \u2264(cid:16)1 \u2212\n\n\u03b1t\n\nn (cid:17) E[Ct\u22121] + 2(cid:16) \u03b1t\n\nn (cid:17)2 \u03c32\n\n\u00b52 .\n\np\n\n(9)\n\n(10)\n\nWe now state the main convergence result, which provides the expected rate O(1/t) on Ct based on\ndecreasing step-sizes, similar to [5] for SGD. Note that convergence of objective function values is\ndirectly related to that of the Lyapunov function Ct via smoothness:\n\nE[f (xt) \u2212 f (x\u2217)] \u2264\n\nL\n2\n\nE(cid:2)kxt \u2212 x\u2217k2(cid:3) \u2264 L E[Ct].\n\n(11)\n\nTheorem 2 (Convergence of Lyapunov function). Let the sequence of step-sizes (\u03b1t)t\u22651 be de\ufb01ned\nby \u03b1t = 2n\n\n\u03b3+t with \u03b3 \u2265 0 such that \u03b11 satis\ufb01es (9). For all t \u2265 0, it holds that\n\nE[Ct] \u2264\n\n\u03bd\n\n\u03b3 + t + 1\n\nwhere\n\n\u03bd := max( 8\u03c32\n\n\u00b52 , (\u03b3 + 1)C0) .\n\np\n\n(12)\n\nChoice of step-sizes in practice. Naturally, we would like \u03bd to be small, in particular independent\nof the initial condition C0 and equal to the \ufb01rst term in the de\ufb01nition (12). We would like the\ndependence on C0 to vanish at a faster rate than O(1/t), as it is the case in variance reduction\nalgorithms on \ufb01nite sums. As advised in [5] in the context of SGD, we can initially run the algorithm\nwith a constant step-size \u00af\u03b1 and exploit this linear convergence regime until we reach the level of\nnoise given by \u03c3p, and then start decaying the step-size. It is easy to see that by using a constant\nstep-size \u00af\u03b1, Ct converges near a value \u00afC := 2\u00af\u03b1\u03c32\n\np/n\u00b52. Indeed, Eq. (10) with \u03b1t = \u00af\u03b1 yields\n\nE[Ct \u2212 \u00afC] \u2264(cid:16)1 \u2212\n\nn(cid:17) E[Ct\u22121 \u2212 \u00afC].\n\n\u00af\u03b1\n\nThus, we can reach a precision C \u2032\nstart decaying step-sizes as in Theorem 2 with \u03b3 large enough so that \u03b11 = \u00af\u03b1, we have\n\n\u00af\u03b1 log C0/\u00af\u01eb) iterations. Then, if we\n\n0 with E[C \u2032\n\n0] \u2264 \u00af\u01eb := 2 \u00afC in O( n\n\n(\u03b3 + 1) E[C \u2032\n\n0] \u2264 (\u03b3 + 1)\u00af\u01eb = 8\u03c32\n\np/\u00b52,\n\nmaking both terms in (12) smaller than or equal to \u03bd = 8\u03c32\nan initial step-size \u00af\u03b1 given by (9), the \ufb01nal work complexity for reaching E[kxt \u2212 x\u2217k2] \u2264 \u01eb is\n\np/\u00b52. Considering these two phases, with\n\nO(cid:18)(cid:18)n +\n\nL\n\n\u00b5(cid:19) log\n\nC0\n\n\u00b52\u01eb! .\n\u00af\u01eb (cid:19) + O \u03c32\n\np\n\n(13)\n\nWe can then use (11) in order to obtain the complexity for reaching E[f (xt) \u2212 f (x\u2217)] \u2264 \u01eb. Note that\nfollowing this step-size strategy was found to be very effective in practice (see Section 4).\n\nAcceleration by iterate averaging. When one is interested in the convergence in function values,\nthe complexity (13) combined with (11) yields O(L\u03c32\np/\u00b52\u01eb), which can be problematic for ill-\nconditioned problems (large condition number L/\u00b5). The following theorem presents an iterate\naveraging scheme which brings the complexity term down to O(\u03c32\np/\u00b5\u01eb), which appeared in Table 1.\nTheorem 3 (Convergence under iterate averaging). Let the step-size sequence (\u03b1t)t\u22651 be de\ufb01ned by\n\nWe have\n\nwhere\n\n\u03b1t =\n\n2n\n\n\u03b3 + t\n\nfor \u03b3 \u2265 1 s.t. \u03b11 \u2264 min(cid:26) 1\n\n2\n\n,\n\nn\n\n4(2\u03ba \u2212 1)(cid:27) .\n\nE[f (\u00afxT ) \u2212 f (x\u2217)] \u2264\n\n2\u00b5\u03b3(\u03b3 \u2212 1)C0\nT (2\u03b3 + T \u2212 1)\n\n+\n\n16\u03c32\np\n\n\u00b5(2\u03b3 + T \u2212 1)\n\n,\n\n\u00afxT :=\n\n2\n\nT (2\u03b3 + T \u2212 1)\n\n(\u03b3 + t)xt.\n\nT \u22121\n\nXt=0\n\n6\n\n\fFigure 1: Impact of conditioning for data augmentation on STL-10 (controlled by \u00b5, where \u00b5 = 10\u22124\ngives the best accuracy). Values of the loss are shown on a logarithmic scale (1 unit = factor 10).\n\u03b7 = 0.1 satis\ufb01es the theory for all methods, and we include curves for larger step-sizes \u03b7 = 1. We\nomit N-SAGA for \u03b7 = 1 because it remains far from the optimum. For the scattering representation,\nthe problem we study is \u21131-regularized, and we use the composite algorithm of Appendix A.\n\nFigure 2: Re-training of the last layer of a pre-trained ResNet 50 model, on a small dataset with\nrandom color perturbations (for different values of \u00b5).\n\nThe proof uses a similar telescoping sum technique to [16]. Note that if T \u226b \u03b3, the \ufb01rst term,\nwhich depends on the initial condition C0, decays as 1/T 2 and is thus dominated by the second\nterm. Moreover, if we start averaging after an initial phase with constant step-size \u00af\u03b1, we can consider\nC0 \u2248 4\u00af\u03b1\u03c32\np/n\u00b52. In the ill-conditioned regime, taking \u00af\u03b1 = \u03b11 = 2n/(\u03b3 + 1) as large as allowed\nby (9), we have \u03b3 of the order of \u03ba = L/\u00b5 \u226b 1. The full convergence rate then becomes\n\nE[f (\u00afxT ) \u2212 f (x\u2217)] \u2264 O \u03c32\n\np\n\n\u00b5(\u03b3 + T )(cid:16)1 +\n\n\u03b3\n\nT(cid:17)! .\n\nWhen T is large enough compared to \u03b3, this becomes O(\u03c32\n\np/\u00b5T ), leading to a complexity O(\u03c32\n\np/\u00b5\u01eb).\n\n4 Experiments\n\nWe present experiments comparing S-MISO with SGD and N-SAGA [14] on four different scenarios,\nin order to demonstrate the wide applicability of our method: we consider an image classi\ufb01cation\ndataset with two different image representations and random transformations, and two classi\ufb01cation\ntasks with Dropout regularization, one on genetic data, and one on (sparse) text data. Figures 1 and 3\nshow the curves for an estimate of the training objective using 5 sampled perturbations per example.\nThe plots are shown on a logarithmic scale, and the values are compared to the best value obtained\namong the different methods in 500 epochs. The strong convexity constant \u00b5 is the regularization\nparameter. For all methods, we consider step-sizes supported by the theory as well as larger step-sizes\nthat may work better in practice. Our C++/Cython implementation of all methods considered in this\nsection is available at https://github.com/albietz/stochs.\n\nChoices of step-sizes. For both S-MISO and SGD, we use the step-size strategy mentioned in\nSection 3 and advised by [5], which we have found to be most effective among many heuristics\n\n7\n\n050100150200250300350400450epochs10-510-410-310-210-1100f - f*STL-10 ckn, \u00b5=10\u22123S-MISO \u03b7=0.1S-MISO \u03b7=1.0N-SAGA \u03b7=0.1SGD \u03b7=0.1SGD \u03b7=1.0050100150200250300350400450epochs10-410-310-210-1100f - f*STL-10 ckn, \u00b5=10\u22124050100150200250300350400epochs10-310-210-1100f - f*STL-10 ckn, \u00b5=10\u22125050100150200250300350400epochs10-510-410-310-210-1100F - F*STL-10 scattering, \u00b5=10\u22123050100150200250300350400epochs10-510-410-310-210-1100101F - F*STL-10 scattering, \u00b5=10\u22124050100150200250300350400epochs10-410-310-210-1100101F - F*STL-10 scattering, \u00b5=10\u22125050100150200250300350400epochs10-610-510-410-310-2f - f*ResNet50, \u00b5=10\u22122S-MISO \u03b7=0.1S-MISO \u03b7=1.0N-SAGA \u03b7=0.1SGD \u03b7=0.1SGD \u03b7=1.0050100150200250300350400epochs10-710-610-510-410-310-210-1100f - f*ResNet50, \u00b5=10\u22123050100150200250300350400epochs10-510-410-310-210-1100f - f*ResNet50, \u00b5=10\u22124\fFigure 3: Impact of perturbations controlled by the Dropout rate \u03b4. The gene data is \u21132-normalized;\nhence, we consider similar step-sizes as Figure 1. The IMDB dataset is highly heterogeneous;\nthus, we also include non-uniform (NU) sampling variants of Appendix A. For uniform sampling,\ntheoretical step-sizes perform poorly for all methods; thus, we show a larger tuned step-size \u03b7 = 10.\n\nwe have tried: we initially keep the step-size constant (controlled by a factor \u03b7 \u2264 1 in the \ufb01gures)\nfor 2 epochs, and then start decaying as \u03b1t = C/(\u03b3 + t), where C = 2n for S-MISO, C = 2/\u00b5\nfor SGD, and \u03b3 is chosen large enough to match the previous constant step-size. For N-SAGA, we\nmaintain a constant step-size throughout the optimization, as suggested in the original paper [14].\nThe factor \u03b7 shown in the \ufb01gures is such that \u03b7 = 1 corresponds to an initial step-size n\u00b5/(L \u2212 \u00b5)\nfor S-MISO (from (19) in the uniform case) and 1/L for SGD and N-SAGA (with \u00afL instead of L in\nthe non-uniform case when using the variant of Appendix A).\n\nImage classi\ufb01cation with \u201cdata augmentation\u201d. The success of deep neural networks is often\nlimited by the availability of large amounts of labeled images. When there are many unlabeled\nimages but few labeled ones, a common approach is to train a linear classi\ufb01er on top of a deep\nnetwork learned in an unsupervised manner, or pre-trained on a different task (e.g., on the ImageNet\ndataset). We follow this approach on the STL-10 dataset [7], which contains 5K training images\nfrom 10 classes and 100K unlabeled images, using a 2-layer unsupervised convolutional kernel\nnetwork [22], giving representations of dimension 9 216. The perturbation consists of randomly\ncropping and scaling the input images. We use the squared hinge loss in a one-versus-all setting. The\nvector representations are \u21132-normalized such that we may use the upper bound L = 1 + \u00b5 for the\nsmoothness constant. We also present results on the same dataset using a scattering representation [6]\nof dimension 21 696, with random gamma corrections (raising all pixels to the power \u03b3, where \u03b3 is\nchosen randomly around 1). For this representation, we add an \u21131 regularization term and use the\ncomposite variant of S-MISO presented in Appendix A.\n\nFigure 1 shows convergence results on one training fold (500 images), for different values of \u00b5,\nallowing us to study the behavior of the algorithms for different condition numbers. The low variance\ninduced by data transformations allows S-MISO to reach suboptimality that is orders of magnitude\nsmaller than SGD after the same number of epochs. Note that one unit on these plots corresponds to\none order of magnitude in the logarithmic scale. N-SAGA initially reaches a smaller suboptimality\nthan SGD, but quickly gets stuck due to the bias in the algorithm, as predicted by the theory [14],\nwhile S-MISO and SGD continue to converge to the optimum thanks to the decreasing step-sizes. The\nbest validation accuracy for both representations is obtained for \u00b5 \u2248 10\u22124 (middle column), and we\nobserved relative gains of up to 1% from using data augmentation. We computed empirical variances\nof the image representations for these two strategies, which are closely related to the variance in\ngradient estimates, and observed these transformations to account for about 10% of the total variance.\n\nFigure 2 shows convergence results when training the last layer of a 50-layer Residual network [12]\nthat has been pre-trained on ImageNet. Here, we consider the common scenario of leveraging a deep\nmodel trained on a large dataset as a feature extractor in order to learn a new classi\ufb01er on a different\nsmall dataset, where it would be dif\ufb01cult to train such a model from scratch. To simulate this setting,\nwe consider a binary classi\ufb01cation task on a small dataset of 100 images of size 256x256 taken\nfrom the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012, which we crop to\n\n8\n\n050100150200250300350400epochs10-510-410-310-210-1100f - f*gene dropout, \u03b4 = 0.30S-MISO \u03b7=0.1S-MISO \u03b7=1.0SGD \u03b7=0.1SGD \u03b7=1.0N-SAGA \u03b7=0.1N-SAGA \u03b7=1.0050100150200250300350400epochs10-610-510-410-310-210-1100f - f*gene dropout, \u03b4 = 0.10050100150200250300350400epochs10-710-610-510-410-310-210-1100f - f*gene dropout, \u03b4 = 0.01050100150200250300350400epochs10-410-310-210-1100f - f*imdb dropout, \u03b4 = 0.30S-MISO-NU \u03b7=1.0S-MISO \u03b7=10.0SGD-NU \u03b7=1.0SGD \u03b7=10.0N-SAGA \u03b7=10.0050100150200250300350400epochs10-510-410-310-210-1100f - f*imdb dropout, \u03b4 = 0.10050100150200250300350400epochs10-710-610-510-410-310-210-1100f - f*imdb dropout, \u03b4 = 0.01\f224x224 before performing random adjustments to brightness, saturation, hue and contrast. As in the\nSTL-10 experiments, the gains of S-MISO over other methods are of about one order of magnitude in\nsuboptimality, as predicted by Table 2.\n\nDropout on gene expression data. We trained a binary logistic regression model on the breast\ncancer dataset of [31], with different Dropout rates \u03b4, i.e., where at every iteration, each coordinate \u03bej\nof a feature vector \u03be is set to zero independently with probability \u03b4 and to \u03bej/(1 \u2212 \u03b4) otherwise. The\ndataset consists of 295 vectors of dimension 8 141 of gene expression data, which we normalize\nin \u21132 norm. Figure 3 (top) compares S-MISO with SGD and N-SAGA for three values of \u03b4, as a\nway to control the variance of the perturbations. We include a Dropout rate of 0.01 to illustrate the\nimpact of \u03b4 on the algorithms and study the in\ufb02uence of the perturbation variance \u03c32\np, even though\nthis value of \u03b4 is less relevant for the task. The plots show very clearly how the variance induced by\nthe perturbations affects the convergence of S-MISO, giving suboptimality values that may be orders\nof magnitude smaller than SGD. This behavior is consistent with the theoretical convergence rate\nestablished in Section 3 and shows that the practice matches the theory.\n\nDropout on movie review sentiment analysis data. We trained a binary classi\ufb01er with a squared\nhinge loss on the IMDB dataset [20] with different Dropout rates \u03b4. We use the labeled part of\nthe IMDB dataset, which consists of 25K training and 250K testing movie reviews, represented as\n89 527-dimensional sparse bag-of-words vectors. In contrast to the previous experiments, we do not\nnormalize the representations, which have great variability in their norms, in particular, the maximum\nLipschitz constant across training points is roughly 100 times larger than the average one. Figure 3\n(bottom) compares non-uniform sampling versions of S-MISO (see Appendix A) and SGD (see\nAppendix D) with their uniform sampling counterparts as well as N-SAGA. Note that we use a large\nstep-size \u03b7 = 10 for the uniform sampling algorithms, since \u03b7 = 1 was signi\ufb01cantly slower for\nall methods, likely due to outliers in the dataset. In contrast, the non-uniform sampling algorithms\nrequired no tuning and just use \u03b7 = 1. The curves clearly show that S-MISO-NU has a much faster\nconvergence in the initial phase, thanks to the larger step-size allowed by non-uniform sampling, and\nlater converges similarly to S-MISO, i.e., at a much faster rate than SGD when the perturbations are\nsmall. The value of \u00b5 used in the experiments was chosen by cross-validation, and the use of Dropout\ngave improvements in test accuracy from 88.51% with no dropout to 88.68 \u00b1 0.03% with \u03b4 = 0.1\nand 88.86 \u00b1 0.11% with \u03b4 = 0.3 (based on 10 different runs of S-MISO-NU after 400 epochs).\n\nFinally, we also study the effect of the iterate averaging scheme of Theorem 3 in Appendix E.\n\nAcknowledgements\n\nThis work was supported by a grant from ANR (MACARON project under grant number ANR-\n14-CE23-0003-01), by the ERC grant number 714381 (SOLARIS project), and by the MSR-Inria\njoint center.\n\nReferences\n\n[1] M. Achab, A. Guilloux, S. Ga\u00efffas, and E. Bacry. SGD with Variance Reduction beyond Empirical Risk\n\nMinimization. arXiv:1510.04822, 2015.\n\n[2] Z. Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In Symposium on the\n\nTheory of Computing (STOC), 2017.\n\n[3] Z. Allen-Zhu, Y. Yuan, and K. Sridharan. Exploiting the Structure: Stochastic Gradient Methods Using\n\nRaw Clusters. In Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[4] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine\n\nlearning. In Advances in Neural Information Processing Systems (NIPS), 2011.\n\n[5] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning.\n\narXiv:1606.04838, 2016.\n\n[6] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis\n\nand machine intelligence (PAMI), 35(8):1872\u20131886, 2013.\n\n[7] A. Coates, H. Lee, and A. Y. Ng. An Analysis of Single-Layer Networks in Unsupervised Feature Learning.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2011.\n\n9\n\n\f[8] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for\nnon-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS),\n2014.\n\n[9] A. Defazio, J. Domke, and T. S. Caetano. Finito: A faster, permutable incremental gradient method for big\n\ndata problems. In International Conference on Machine Learning (ICML), 2014.\n\n[10] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Privacy aware learning. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2012.\n\n[11] J. C. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting. Journal of\n\nMachine Learning Research (JMLR), 10:2899\u20132934, 2009.\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[13] J.-B. Hiriart-Urruty and C. Lemar\u00e9chal. Convex analysis and minimization algorithms I: Fundamentals.\n\nSpringer science & business media, 1993.\n\n[14] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance Reduced Stochastic Gradient\n\nDescent with Neighbors. In Advances in Neural Information Processing Systems (NIPS), 2015.\n\n[15] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2013.\n\n[16] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) convergence rate\n\nfor the projected stochastic subgradient method. arXiv:1212.2002, 2012.\n\n[17] G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Mathematical Programming,\n\n2017.\n\n[18] H. Lin, J. Mairal, and Z. Harchaoui. A Universal Catalyst for First-Order Optimization. In Advances in\n\nNeural Information Processing Systems (NIPS), 2015.\n\n[19] G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective sampling. In\n\nLarge Scale Kernel Machines, pages 301\u2013320. MIT Press, Cambridge, MA., 2007.\n\n[20] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment\nanalysis. In The 49th Annual Meeting of the Association for Computational Linguistics (ACL), pages\n142\u2013150. Association for Computational Linguistics, 2011.\n\n[21] J. Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine\n\nLearning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\n[22] J. Mairal. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. In Advances in\n\nNeural Information Processing Systems (NIPS), 2016.\n\n[23] N. Meinshausen and P. B\u00fchlmann. Stability selection. Journal of the Royal Statistical Society: Series B\n\n(Statistical Methodology), 72(4):417\u2013473, 2010.\n\n[24] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust Stochastic Approximation Approach to\n\nStochastic Programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[25] Y. Nesterov. Introductory Lectures on Convex Optimization. Springer, 2004.\n\n[26] M. Paulin, J. Revaud, Z. Harchaoui, F. Perronnin, and C. Schmid. Transformation pursuit for image\nclassi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2014.\n\n[27] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\nMathematical Programming, 162(1):83\u2013112, 2017.\n\n[28] S. Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In International\n\nConference on Machine Learning (ICML), 2016.\n\n[29] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimiza-\n\ntion. Journal of Machine Learning Research (JMLR), 14:567\u2013599, 2013.\n\n10\n\n\f[30] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation Invariance in Pattern Recognition\n\u2014 Tangent Distance and Tangent Propagation. In G. B. Orr and K.-R. M\u00fcller, editors, Neural Networks:\nTricks of the Trade, number 1524 in Lecture Notes in Computer Science, pages 239\u2013274. Springer Berlin\nHeidelberg, 1998.\n\n[31] M. J. van de Vijver et al. A Gene-Expression Signature as a Predictor of Survival in Breast Cancer. New\n\nEngland Journal of Medicine, 347(25):1999\u20132009, Dec. 2002.\n\n[32] L. van der Maaten, M. Chen, S. Tyree, and K. Q. Weinberger. Learning with marginalized corrupted\n\nfeatures. In International Conference on Machine Learning (ICML), 2013.\n\n[33] S. Wager, W. Fithian, S. Wang, and P. Liang. Altitude Training: Strong Bounds for Single-layer Dropout.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[34] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research (JMLR), 11:2543\u20132596, 2010.\n\n[35] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM\n\nJournal on Optimization, 24(4):2057\u20132075, 2014.\n\n[36] S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via\nstability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2016.\n\n11\n\n\f", "award": [], "sourceid": 1025, "authors": [{"given_name": "Alberto", "family_name": "Bietti", "institution": "Inria"}, {"given_name": "Julien", "family_name": "Mairal", "institution": "Inria"}]}