{"title": "Momentum-Based Variance Reduction in Non-Convex SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 15236, "page_last": 15245, "abstract": "Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large \"mega-batches\" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $F$, STORM finds a point $x$ with $\\mathbb{E}[\\|\\nabla F(x)\\|]\\le O(1/\\sqrt{T}+\\sigma^{1/3}/T^{1/3})$ in $T$ iterations with $\\sigma^2$ variance in the gradients, matching the best-known rate but without requiring knowledge of $\\sigma$.", "full_text": "Momentum-BasedVarianceReductioninNon-ConvexSGDAshokCutkoskyGoogleResearchMountainView,CA,USAashok@cutkosky.comFrancescoOrabonaBostonUniversityBoston,MA,USAfrancesco@orabona.comAbstractVariancereductionhasemergedinrecentyearsasastrongcompetitortostochasticgradientdescentinnon-convexproblems,providingthe\ufb01rstalgorithmstoimproveupontheconvergerateofstochasticgradientdescentfor\ufb01nding\ufb01rst-ordercriticalpoints.However,variancereductiontechniquestypicallyrequirecarefullytunedlearningratesandwillingnesstouseexcessivelylarge\u201cmega-batches\u201dinordertoachievetheirimprovedresults.Wepresentanewalgorithm,STORM,thatdoesnotrequireanybatchesandmakesuseofadaptivelearningrates,enablingsimplerimplementationandlesshyperparametertuning.Ourtechniqueforremovingthebatchesusesavariantofmomentumtoachievevariancereductioninnon-convexoptimization.OnsmoothlossesF,STORM\ufb01ndsapointxwithE[k\u2207F(x)k]\u2264O(1/\u221aT+\u03c31/3/T1/3)inTiterationswith\u03c32varianceinthegradients,matchingtheoptimalrateandwithoutrequiringknowledgeof\u03c3.1IntroductionThispaperaddressestheclassicstochasticoptimizationproblem,inwhichwearegivenafunctionF:Rd\u2192R,andwishto\ufb01ndx\u2208RdsuchthatF(x)isassmallaspossible.Unfortunately,ouraccesstoFislimitedtoastochasticfunctionoracle:wecanobtainsamplefunctionsf(\u00b7,\u03be)where\u03berepresentssomesamplevariable(e.g.aminibatchindex)suchthatE[f(\u00b7,\u03be)]=F(\u00b7).Stochasticoptimizationproblemsarefoundthroughoutmachinelearning.Forexample,insupervisedlearning,xrepresentstheparametersofamodel(saytheweightsofaneuralnetwork),\u03berepresentsanexample,f(x,\u03be)representsthelossonanexample,andFrepresentsthetraininglossofthemodel.Wedonotassumeconvexity,soingeneraltheproblemof\ufb01ndingatrueminimumofFmaybeNP-hard.Hence,werelaxtheproblemto\ufb01ndingacriticalpointofF\u2013thatisapointsuchthat\u2207F(x)=0.Also,weassumeaccessonlytostochasticgradientsevaluatedonarbitrarypoints,ratherthanHessiansorotherinformation.Inthissetting,thestandardalgorithmisstochasticgradientdescent(SGD).SGDproducesasequenceofiteratesx1,...,xTusingtherecursionxt+1=xt\u2212\u03b7tgt,(1)wheregt=\u2207f(xt,\u03bet),f(\u00b7,\u03be1),...,f(\u00b7,\u03beT)arei.i.d.samplesfromadistributionD,and\u03b71,...\u03b7T\u2208Rareasequenceoflearningratesthatmustbecarefullytunedtoensuregoodperfor-mance.Assumingthe\u03b7tareselectedproperly,SGDguaranteesthatarandomlyselectediteratextsatis\ufb01esE[k\u2207F(xt)k]\u2264O(1/T1/4)[9].Recently,variancereductionhasemergedasanimprovedtechniquefor\ufb01ndingcriticalpointsinnon-convexoptimizationproblems.Stochasticvariance-reducedgradient(SVRG)algorithmsalsoproduceiteratesx1,...,xTaccordingtotheupdateformula(1),butnowgtisavariancereducedestimateof\u2207F(xt).Overthelastfewyears,SVRGalgorithmshaveimprovedtheconvergenceratetocriticalpointsofnon-convexSGDfromO(1/T1/4)toO(1/T3/10)[2,21]toO(1/T1/3)[8,31].Despite33rdConferenceonNeuralInformationProcessingSystems(NeurIPS2019),Vancouver,Canada.\fthisimprovement,SVRGhasnotseenasmuchsuccessinpracticeinnon-convexmachinelearningproblems[5].Manyreasonsmaycontributetothisphenomenon,buttwopotentialissuesweaddresshereareSVRG\u2019suseofnon-adaptivelearningratesandrelianceongiantbatchsizestoconstructvariancereducedgradientsthroughtheuseoflow-noisegradientscalculatedata\u201ccheckpoint\u201d.Inparticular,fornon-convexlossesSVRGanalysestypicallyinvolvecarefullyselectinglearningrates,thenumberofsamplestoconstructthegradientonthecheckpointpoints,andthefrequencyofupdateofthecheckpointpoints.Theoptimalsettingsbalancevariousunknownproblemparametersexactlyinordertoobtainimprovedperformance,makingitespeciallyimportant,andespeciallydif\ufb01cult,totunethem.Inthispaper,weaddressbothoftheseissues.WepresentanewalgorithmcalledSTOchasticRecursiveMomentum(STORM)thatachievesvariancereductionthroughtheuseofavariantofthemomentumterm,similartothepopularRMSProporAdammomentumheuristics[24,13].Hence,ouralgorithmdoesnotrequireagiganticbatchtocomputecheckpointgradients\u2013infact,ouralgorithmdoesnotrequireanybatchesatallbecauseitneverneedstocomputeacheckpointgradient.STORMachievestheoptimalconvergencerateofO(1/T1/3)[3],anditusesanadaptivelearningrateschedulethatwillautomaticallyadjusttothevariancevaluesof\u2207f(xt,\u03bet).Overall,weconsiderouralgorithmasigni\ufb01cantqualitativedeparturefromtheusualparadigmforvariancereduction,andwehopeouranalysismayprovideinsightintothevalueofmomentuminnon-convexoptimization.Therestofthepaperisorganizedasfollows.Thenextsectiondiscussestherelatedworkonvariancereductionandadaptivelearningratesinnon-convexSGD.Section3formallyintroducesournotationandassumptions.WepresentourbasicupdateruleanditsconnectiontoSGDwithmomentuminSection4,andouralgorithminSection5.Finally,wepresentsomeempiricalresultsinSection6andconcludeswithadiscussioninSection7.2RelatedWorkVariance-reductionmethodswereproposedindependentlybythreegroupsatthesameconference:JohnsonandZhang[12],Zhangetal.[30],Mahdavietal.[17],andWangetal.[27].The\ufb01rstapplicationofvariance-reductionmethodtonon-convexSGDisduetoAllen-ZhuandHazan[2].Usingvariancereductionmethods,Fangetal.[8],Zhouetal.[31]haveobtainedmuchbetterconvergenceratesforcriticalpointsinnon-convexSGD.Thesemethodsareverydifferentfromourapproachbecausetheyrequirethecalculationofgradientsatcheckpoints.Infact,inordertocomputethevariancereducedgradientestimatesgt,thealgorithmmustperiodicallystopproducingiteratesxtandinsteadgenerateaverylarge\u201cmega-batch\u201dofsamples\u03be1,...,\u03beNwhichisusedtocomputeacheckpointgradient1NPNi=1\u2207f(v,\u03bei)foranappropriatecheckpointpointv.Dependingonthealgorithm,NmaybeaslargeasO(T),andtypicallynosmallerthanO(T2/3).TheonlyexceptionsweareawareofareSARAH[18,19]andiSARAH[20].However,theirguaranteesdonotimproveovertheonesofplainSGD,andtheystillrequireatleastonecheckpointgradient.Independentlyandsimultaneouslywiththiswork,[25]haveproposedanewalgorithmthatdoesimproveoverSGDtomatchoursameconvergencerate,althoughitdoesstillrequireonecheckpointgradient.Interestingly,theirupdateformulaisverysimilartoours,althoughtheanalysisisratherdifferent.Wearenotawareofpriorworksfornon-convexoptimizationwithreducedvariancemethodsthatcompletelyavoidusinggiantbatches.Ontheotherhand,adaptivelearning-rateschemes,thatchoosethevalues\u03b7tinsomedata-dependentwaysoastoreducetheneedfortuningthevaluesof\u03b7tmanually,havebeenintroducedbyDuchietal.[7]andpopularizedbytheheuristicmethodslikeRMSPropandAdam[24,13].Inthenon-convexsetting,adaptivelearningratescanbeshowntoimprovetheconvergencerateofSGDtoO(1/\u221aT+(\u03c32/T)1/4),where\u03c32isaboundonthevarianceof\u2207f(xt)[16,28,22].Hence,theseadaptivealgorithmsobtainmuchbetterconvergenceguaranteeswhentheproblemis\u201ceasy\u201d,andhavebecomeextremelypopularinpractice.Incontrast,theonlyvariance-reducedalgorithmweareawareofthatusesadaptivelearningratesis[4],buttheirtechniquesapplyonlytoconvexlosses.3NotationandAssumptionsInthefollowing,wewillwritevectorswithboldlettersandwewilldenotetheinnerproductbetweenvectorsaandbbya\u00b7b.2\fThroughoutthepaperwewillmakethefollowingassumptions.Weassumeaccesstoastreamofindependentrandomvariables\u03be1,...,\u03beT\u2208\u039eandafunctionfsuchthatforalltandforallx,E[f(x,\u03bet)|x]=F(x).Notethatweaccesstwogradientsonthesame\u03betontwodifferentpointsineachupdate,likeinstandardvariance-reducedmethods.Inpractice,\u03betmaydenoteani.i.d.trainingexample,oranindexintoatrainingsetwhilef(x,\u03bet)indicatesthelossonthetrainingexampleusingthemodelparameterx.Weassumethereissome\u03c32thatupperboundsthenoiseongradients:E[k\u2207f(x,\u03bet)\u2212\u2207F(x)k2]\u2264\u03c32.Wede\ufb01neF?=infxF(x)andwewillassumethatF?>\u2212\u221e.Wewillalsoneedsomeassumptionsonthefunctionsf(x,\u03bet).De\ufb01neadifferentiablefunctionf:Rd\u2192RtobeG-Lipschitziffk\u2207f(x)k\u2264Gforallx,andftobeL-smoothiffk\u2207f(x)\u2212\u2207f(y)k\u2264Lkx\u2212ykforallxandy.Weassumethatf(x,\u03bet)isdifferentiable,andL-smoothasafunctionofxwithprobability1.Wewillalsoassumethatf(x,\u03bet)isG-Lipschitzforouradaptiveanalysis.WeshowinappendixBthatthisassumptioncanbeliftedattheexpenseofadaptivityto\u03c3.4MomentumandVarianceReductionBeforedescribingouralgorithmindetails,webrie\ufb02yexploretheconnectionbetweenSGDwithmomentumandvariancereduction.Thestochasticgradientdescentwithmomentumalgorithmistypicallyimplementedasdt=(1\u2212a)dt\u22121+a\u2207f(xt,\u03bet)xt+1=xt\u2212\u03b7dt,whereaissmall,i.e.a=0.1.Inwords,insteadofusingthecurrentgradient\u2207F(xt)intheupdateofxt,weuseanexponentialaverageofthepastobservedgradients.WhileSGDwithmomentumanditsvariantshavebeensuccessfullyusedinmanymachinelearningapplications[13],itiswellknownthatthepresenceofnoiseinthestochasticgradientscannullifythetheoreticalgainofthemomentumterm[e.g.29].Asaresult,itisunclearhowandwhyusingmomentumcanbebetterthanplainSGD.AlthoughrecentworkshaveprovedthatavariantofSGDwithmomentumimprovesthenon-dominanttermsintheconvergencerateonconvexstochasticleastsquareproblems[6,11],itisstilluncleariftheactualconvergenceratecanbeimproved.Here,wetakeadifferentroute.InsteadofshowingthatmomentuminSGDworksinthesamewayasinthenoiselesscase,i.e.givingacceleratedrates,weshowthatavariantofmomentumcanprovablyreducethevarianceofthegradients.Initssimplestform,thevariantweproposeis:dt=(1\u2212a)dt\u22121+a\u2207f(xt,\u03bet)+(1\u2212a)(\u2207f(xt,\u03bet)\u2212\u2207f(xt\u22121,\u03bet))(2)xt+1=xt\u2212\u03b7dt.(3)Theonlydifferenceisthethatweaddtheterm(1\u2212a)(\u2207f(xt,\u03bet)\u2212\u2207f(xt\u22121,\u03bet))totheupdate.Asinstandardvariance-reducedmethods,weusetwogradientsineachstep.However,wedonotneedtousethegradientcalculatedatanycheckpointpoints.Notethatifxt\u2248xt\u22121,thenourupdatebecomesapproximatelythemomentumone.Thesetwotermswillbesimilaraslongasthealgorithmisactuallyconvergingtosomepoint,andsowecanexpectthealgorithmtobehaveexactlyliketheclassicmomentumSGDtowardstheendoftheoptimizationprocess.Tounderstandwhytheaboveupdatesdeliversavariancereduction,considerthe\u201cerrorindt\u201dwhichwedenoteas\u0001t:\u0001t:=dt\u2212\u2207F(xt).Thistermmeasurestheerrorweincurbyusingdtasupdatedirectioninsteadofthecorrectbutunknowndirection,\u2207F(xt).TheequivalentterminSGDwouldbeE[k\u2207f(xt,\u03bet)\u2212\u2207F(xt)k2]\u2264\u03c32.So,ifE[k\u0001tk2]decreasesovertime,wehaverealizedavariancereductioneffect.OurtechnicalresultthatweusetoshowthisdecreaseisprovidedinLemma2,butletustakeamomentheretoappreciatewhythisshouldbeexpectedintuitively.Consideringtheupdatewrittenin(2),wecanobtainarecursiveexpressionfor\u0001tbysubtracting\u2207F(xt)frombothsides:\u0001t=(1\u2212a)\u0001t\u22121+a(\u2207f(xt,\u03bet)\u2212\u2207F(xt))+(1\u2212a)(\u2207f(xt,\u03bet)\u2212\u2207f(xt\u22121,\u03bet)\u2212(\u2207F(xt)\u2212\u2207F(xt\u22121))).3\fAlgorithm1STORM:STOchasticRecursiveMomentum1:Input:Parametersk,w,c,initialpointx12:Sample\u03be13:G1\u2190k\u2207f(x1,\u03be1)k4:d1\u2190\u2207f(x1,\u03be1)5:\u03b70\u2190kw1/36:fort=1toTdo7:\u03b7t\u2190k(w+Pti=1G2t)1/38:xt+1\u2190xt\u2212\u03b7tdt9:at+1\u2190c\u03b72t10:Sample\u03bet+111:Gt+1\u2190k\u2207f(xt+1,\u03bet+1)k12:dt+1\u2190\u2207f(xt+1,\u03bet+1)+(1\u2212at+1)(dt\u2212\u2207f(xt,\u03bet+1))13:endfor14:Choose\u02c6xuniformlyatrandomfromx1,...,xT.(Inpractice,set\u02c6x=xT).15:return\u02c6xNow,noticethatthereisgoodreasontoexpectthesecondandthirdtermsoftheRHSabovetobesmall:wecancontrola(\u2207f(xt,\u03bet)\u2212\u2207F(xt))simplybychoosingsmallenoughvaluesa,andfromsmoothnessweexpect(\u2207f(xt,\u03bet)\u2212\u2207f(xt\u22121,\u03bet)\u2212(\u2207F(xt)\u2212\u2207F(xt\u22121))tobeoftheorderofO(kxt\u2212xt\u22121k)=O(\u03b7dt\u22121).Therefore,bychoosingsmallenough\u03b7anda,weobtaink\u0001tk=(1\u2212a)k\u0001t\u22121k+ZwhereZissomesmallvalue.Thus,intuitivelyk\u0001tkwilldecreaseuntilitreachesZ/a.Thishighlightsatrade-offinsetting\u03b7andainordertodecreasethenumeratorofZ/awhilekeepingthedenominatorsuf\ufb01cientlylarge.Ourcentralchallengeisshowingthatitispossibletoachieveafavorabletrade-offinwhichZ/aisverysmall,resultinginsmallerror\u0001t.5STORM:STOchasticRecursiveMomentumWenowdescribeourstochasticoptimizationalgorithm,whichwecallSTOchasticRecursiveMo-mentum(STORM).ThepseudocodeisinAlgorithm1.Asdescribedintheprevioussection,itsbasicupdateisoftheformof(2)and(3).However,inordertoachieveadaptivitytothenoiseinthegradients,boththestepsizeandthemomentumtermwilldependonthepastgradients,\u00e0laAdaGrad[7].TheconvergenceguaranteeofSTORMispresentedinTheorem1below.Theorem1.UndertheassumptionsinSection3,foranyb>0,wewritek=bG23L.Setc=28L2+G2/(7Lk3)=L2(28+1/(7b3))andw=max(cid:16)(4Lk)3,2G2,(cid:0)ck4L(cid:1)3(cid:17)=G2max(cid:0)(4b)3,2,(28b+17b2)3/64(cid:1).Then,STORMsatis\ufb01esE[k\u2207F(\u02c6x)k]=E\"1TTXt=1k\u2207F(xt)k#\u2264w1/6\u221a2M+2M3/4\u221aT+2\u03c31/3T1/3,whereM=8k(F(x1)\u2212F?)+w1/3\u03c324L2k2+k2c22L2ln(T+2).Inwords,Theorem1guaranteesthatSTORMwillmakethenormofthegradientsconvergeto0atarateofO(lnT\u221aT)ifthereisnonoise,andinexpectationatarateof2\u03c31/3T1/3inthestochasticcase.Weremarkthatweachievebothratesautomatically,withouttheneedtoknowthenoiselevelnortheneedtotunestepsizes.Notethattheratewhen\u03c36=0matchestheoptimalrate[3],whichwaspreviouslyonlyobtainedbySVRG-basedalgorithmsthatrequirea\u201cmega-batch\u201d[8,31].ThedependenceonGinthisbounddeservessomediscussion-at\ufb01rstblushitappearsthatifG\u21920,theboundwillgotoin\ufb01nitybecausethedenominatorinMgoestozero.Fortunately,thisisnotso:theresolutionistoobservethatF(x1)\u2212F?=O(G)and\u03c3=O(G),sothatthenumeratorsofMactuallygotozeroatleastasfastasthedenominator.ThedependenceonLmaybesimilarlynon-intuitive:asL\u21920,M\u2192\u221e.Inthiscasethisisactuallytobeexpected:ifL=0,thenthere4\farenocriticalpoints(becausethegradientsareallthesame!)andsowecannotactually\ufb01ndone.Ingeneral,MshouldberegardedasanO(log(T))termwheretheconstantindicatessomeinherenthardnesslevelintheproblem.Finally,notethathereweassumedthateachf(x,\u03be)isG-Lipschitzinx.Priorvariancereductionresults(e.g.[18,8,25])donotmakeuseofthisassumption.However,weweshowinAppendixBthatsimplyreplacingallinstancesofGorGtintheparametersofSTORMwithanoracle-tunedvalueof\u03c3allowsustodispensewiththisassumptionwhilestillavoidingallcheckpointgradients.Alsonotethat,asinsimilarworkonstochasticminimizationofnon-convexfunctions,Theorem1onlyboundsthegradientofarandomlyselectediterate[9].However,inpracticalimplementationsweexpectthelastiteratetoperformequallywell.OuranalysisformalizestheintuitiondevelopedintheprevioussectionthroughaLyapunovpotentialfunction.OurLyapunovfunctionissomewhatnon-standard:forsmoothnon-convexfunctions,theLyapunovfunctionistypicallyoftheform\u03a6t=F(xt),butweproposetousethefunction\u03a6t=F(xt)+ztk\u0001tk2foratime-varyingzt\u221d\u03b7\u22121t\u22121,where\u0001tistheerrorintheupdateintroducedintheprevioussection.Theuseoftime-varyingztappearstobecriticalforustoavoidusinganycheckpoints:withconstantztitseemsthatonealwaysneedsatleastonecheckpointgradient.Potentialfunctionsofthisformhavebeenusedtoanalyzemomentumalgorithmsinordertoproveasymptoticguarantees,see,e.g.,RuszczynskiandSyski[23].However,asfarasweknow,thisuseofapotentialissomewhatdifferentthanmostvariancereductionanalyses,andsomayprovideavenuesforfurtherdevelopment.WenowproceedtotheproofofTheorem1.5.1ProofofTheorem1First,weconsideragenericSGD-styleanalysis.MostSGDanalysesassumethatthegradientestimatesusedbythealgorithmareunbiasedof\u2207F(xt),butunfortunatelydtbiased.Asaresult,weneedthefollowingslightlydifferentanalysis.Forlackofspace,theproofofthisLemmaandthenextoneareintheAppendix.Lemma1.Suppose\u03b7t\u226414Lforallt.ThenE[F(xt+1)\u2212F(xt)]\u2264E(cid:2)\u2212\u03b7t/4k\u2207F(xt)k2+3\u03b7t/4k\u0001tk2(cid:3).ThefollowingtechnicalobservationiskeytoouranalysisofSTORM:itprovidesarecurrencethatenablesustoboundthevarianceoftheestimatesdt.Lemma2.WiththenotationinAlgorithm1,wehaveE(cid:2)k\u0001tk2/\u03b7t\u22121(cid:3)\u2264E(cid:2)2c2\u03b73t\u22121G2t+(1\u2212at)2(1+4L2\u03b72t\u22121)k\u0001t\u22121k2/\u03b7t\u22121+4(1\u2212at)2L2\u03b7t\u22121k\u2207F(xt\u22121)k2(cid:3).Lemma2exhibitsasomewhatinvolvedalgebraicidentity,soletustrytobuildsomeintuitionforwhatitmeansandhowitcanhelpus.First,multiplybothsidesby\u03b7t\u22121.Technicallytheexpectationsmakethisaforbiddenoperation,butweignorethisdetailfornow.Next,observethatPTt=1G2tisroughly\u0398(T)(sincethethevariancepreventskgtk2fromgoingtozeroevenwhenk\u2207F(xt)kdoes).Therefore\u03b7tisroughlyO(1/t1/3),andatisroughlyO(1/t2/3).Discardingallconstants,andobservingthat(1\u2212at)2\u2264(1\u2212at),theaboveLemmaisthensayingthatE[k\u0001tk2]\u2264E(cid:2)\u03b74t\u22121+(1\u2212at)k\u0001t\u22121k2+\u03b72t\u22121k\u2207F(xt\u22121)k2(cid:3)=Eht\u22124/3+(cid:16)1\u2212t\u22122/3(cid:17)k\u0001t\u22121k2+t\u22121/3k\u2207F(xt\u22121)k2i.Wecanusethisrecurrencetocomputeakindof\u201cequilibriumvalue\u201dforE[k\u0001tk2]:setE[k\u0001tk2]=E[k\u0001t\u22121k2]andsolvetoobtaink\u0001tk2isO(1/t2/3+k\u2207F(xt)k2).Thisinturnsuggeststhat,wheneverk\u2207F(xt)k2isgreaterthan1/t2/3,thegradientestimatedt=\u2207F(xt)+\u0001twillbeaverygoodapproximationof\u2207F(xt)sothatgradientdescentshouldmakeveryfastprogress.Therefore,weexpectthe\u201cequilibriumvalue\u201dfork\u2207F(xt)k2tobeO(1/T2/3),sincethisisthepointatwhichtheestimatedtbecomesdominatedbytheerror.WeformalizethisintuitionusingaLyapunovfunctionoftheform\u03a6t=F(xt)+ztk\u0001tk2intheproofofTheorem1below.5\fProofofTheorem1.Considerthepotential\u03a6t=F(xt)+132L2\u03b7t\u22121k\u0001tk2.Wewillupperbound\u03a6t+1\u2212\u03a6tforeacht,whichwillallowustobound\u03a6Tintermsof\u03a61bysummingovert.First,observethatsincew\u2265(4Lk)3,wehave\u03b7t\u226414L.Further,sinceat+1=c\u03b72t,wehaveat+1\u2264ck4Lw1/3\u22641forallt.Then,we\ufb01rstconsider\u03b7\u22121tk\u0001t+1k2\u2212\u03b7\u22121t\u22121k\u0001tk2.UsingLemma2,weobtainE(cid:2)\u03b7\u22121tk\u0001t+1k2\u2212\u03b7\u22121t\u22121k\u0001tk2(cid:3)\u2264E(cid:20)2c2\u03b73tG2t+1+(1\u2212at+1)2(1+4L2\u03b72t)k\u0001tk2\u03b7t+4(1\u2212at+1)2L2\u03b7tk\u2207F(xt)k2\u2212k\u0001tk2\u03b7t\u22121(cid:21)\u2264E\uf8ee\uf8ef\uf8f02c2\u03b73tG2t+1|{z}At+(cid:0)\u03b7\u22121t(1\u2212at+1)(1+4L2\u03b72t)\u2212\u03b7\u22121t\u22121(cid:1)k\u0001tk2|{z}Bt+4L2\u03b7tk\u2207F(xt)k2|{z}Ct\uf8f9\uf8fa\uf8fb.Letusfocusonthetermsofthisexpressionindividually.Forthe\ufb01rstterm,At,observethatw\u22652G2\u2265G2+G2t+1toobtain:TXt=1At=TXt=12c2\u03b73tG2t+1=TXt=12k3c2G2t+1w+Pti=1G2i\u2264TXt=12k3c2G2t+1G2+Pt+1i=1G2i\u22642k3c2ln 1+T+1Xt=1G2tG2!\u22642k3c2ln(T+2),whereinthesecondtolastinequalityweusedLemma4intheAppendix.ForthesecondtermBt,wehaveBt\u2264(\u03b7\u22121t\u2212\u03b7\u22121t\u22121+\u03b7\u22121t(4L2\u03b72t\u2212at+1))k\u0001tk2=(cid:0)\u03b7\u22121t\u2212\u03b7\u22121t\u22121+\u03b7t(4L2\u2212c)(cid:1)k\u0001tk2.Letusfocuson1\u03b7t\u22121\u03b7t\u22121foraminute.Usingtheconcavityofx1/3,wehave(x+y)1/3\u2264x1/3+yx\u22122/3/3.Therefore:1\u03b7t\u22121\u03b7t\u22121=1k\uf8ee\uf8f0 w+tXi=1G2i!1/3\u2212 w+t\u22121Xi=1G2i!1/3\uf8f9\uf8fb\u2264G2t3k(w+Pt\u22121i=1G2i)2/3\u2264G2t3k(w\u2212G2+Pti=1G2i)2/3\u2264G2t3k(w/2+Pti=1G2i)2/3\u226422/3G2t3k(w+Pti=1G2i)2/3\u226422/3G23k3\u03b72t\u226422/3G212Lk3\u03b7t\u2264G27Lk3\u03b7t,wherewehaveusedthatthatw\u2265(4Lk)3tohave\u03b7t\u226414L.Further,sincec=28L2+G2/(7Lk3),wehave\u03b7t(4L2\u2212c)\u2264\u221224L2\u03b7t\u2212G2\u03b7t/(7Lk3).Thus,weobtainBt\u2264\u221224L2\u03b7tk\u0001tk2.Puttingallthistogetheryields:132L2TXt=1(cid:18)k\u0001t+1k2\u03b7t\u2212k\u0001tk2\u03b7t\u22121(cid:19)\u2264k3c216L2ln(T+2)+TXt=1(cid:20)\u03b7t8k\u2207F(xt)k2\u22123\u03b7t4k\u0001tk2(cid:21).(4)Now,wearereadytoanalyzethepotential\u03a6t.Since\u03b7t\u226414L,wecanuseLemma1toobtainE[\u03a6t+1\u2212\u03a6t]\u2264E(cid:20)\u2212\u03b7t4k\u2207F(xt)k2+3\u03b7t4k\u0001tk2+132L2\u03b7tk\u0001t+1k2\u2212132L2\u03b7t\u22121k\u0001tk2(cid:21).Summingovertandusing(4),weobtainE[\u03a6T+1\u2212\u03a61]\u2264TXt=1E(cid:20)\u2212\u03b7t4k\u2207F(xt)k2+3\u03b7t4k\u0001tk2+132L2\u03b7tk\u0001t+1k2\u2212132L2\u03b7t\u22121k\u0001tk2(cid:21)\u2264E\"k3c216L2ln(T+2)\u2212TXt=1\u03b7t8k\u2207F(xt)k2#.6\fReorderingtheterms,wehaveE\"TXt=1\u03b7tk\u2207F(xt)k2#\u2264E(cid:2)8(\u03a61\u2212\u03a6T+1)+k3c2/(2L2)ln(T+2)(cid:3)\u22648(F(x1)\u2212F?)+E[k\u00011k2]/(4L2\u03b70)+k3c2/(2L2)ln(T+2)\u22648(F(x1)\u2212F?)+w1/3\u03c32/(4L2k)+k3c2/(2L2)ln(T+2),wherethelastinequalityisgivenbythede\ufb01nitionofd1and\u03b70inthealgorithm.Now,werelateEhPTt=1\u03b7tk\u2207F(xt)k2itoEhPTt=1k\u2207F(xt)k2i.First,since\u03b7tisdecreasing,E\"TXt=1\u03b7tk\u2207F(xt)k2#\u2265E\"\u03b7TTXt=1k\u2207F(xt)k2#.Now,fromCauchy-Schwarzinequality,foranyrandomvariablesAandBwehaveE[A2]E[B2]\u2265E[AB]2.Hence,settingA=q\u03b7TPT\u22121t=1k\u2207F(xt)k2andB=p1/\u03b7T,weobtainE[1/\u03b7T]E\"\u03b7TTXt=1k\u2207F(xt)k2#\u2265E\uf8ee\uf8f0vuutTXt=1k\u2207F(xt)k2\uf8f9\uf8fb2.Therefore,ifwesetM=1kh8(F(x1)\u2212F?)+w1/3\u03c324L2k+k3c22L2ln(T+2)i,togetE\uf8ee\uf8f0vuutTXt=1k\u2207F(xt)k2\uf8f9\uf8fb2\u2264E\"8(F(x1)\u2212F?)+w1/3\u03c324L2k+k3c22L2ln(T+2)\u03b7T#=E(cid:20)kM\u03b7T(cid:21)\u2264E\uf8ee\uf8f0M w+TXt=1G2t!1/3\uf8f9\uf8fb.De\ufb01ne\u03b6t=\u2207f(xt,\u03bet)\u2212\u2207F(xt),sothatE[k\u03b6tk2]\u2264\u03c32.Then,wehaveG2t=k\u2207F(xt)+\u03b6tk2\u22642k\u2207F(xt)k2+2k\u03b6tk2.Pluggingthisinandusing(a+b)1/3\u2264a1/3+b1/3weobtain:E\uf8ee\uf8f0vuutTXt=1k\u2207F(xt)k2\uf8f9\uf8fb2\u2264E\uf8ee\uf8f0M w+2TXt=1k\u03b6tk2!1/3+M21/3 TXt=1k\u2207F(xt)k2!1/3\uf8f9\uf8fb\u2264M(w+2T\u03c32)1/3+E\uf8ee\uf8ef\uf8f021/3M\uf8eb\uf8edvuutTXt=1k\u2207F(xt)k2\uf8f6\uf8f82/3\uf8f9\uf8fa\uf8fb\u2264M(w+2T\u03c32)1/3+21/3M\uf8eb\uf8edE\uf8ee\uf8f0vuutTXt=1k\u2207F(xt)k2\uf8f9\uf8fb\uf8f6\uf8f82/3,wherewehaveusedtheconcavityofx7\u2192xaforalla\u22641tomoveexpectationsinsidetheexponents.Now,de\ufb01neX=qPTt=1k\u2207F(xt)k2.Thentheabovecanberewrittenas:(E[X])2\u2264M(w+2T\u03c32)1/3+21/3M(E[X])2/3.Notethatthisimpliesthateither(E[X])2\u22642M(w+T\u03c32)1/3,or(E[X])2\u22642\u00b721/3M(E[X])2/3.SolvingforE[X]inthesetwocases,weobtainE[X]\u2264\u221a2M(w+2T\u03c32)1/6+2M3/4.Finally,observethatbyCauchy-SchwarzwehavePTt=1k\u2207F(xt)k/T\u2264X/\u221aTsothatE\"TXt=1k\u2207F(xt)kT#\u2264\u221a2M(w+2T\u03c32)1/6+2M3/4\u221aT\u2264w1/6\u221a2M+2M3/4\u221aT+2\u03c31/3T1/3,whereweused(a+b)1/3\u2264a1/3+b1/3inthelastinequality.7\f0123Steps1e510\u2212410\u2212310\u2212210\u22121100Train loss (log scale)AdamAdagradStorm(a)TrainLossvsIterations0123Steps1e50.800.850.900.951.00Train accuracyAdamAdagradStorm(b)TrainAccuracyvsIterations0246Steps1e50.40.50.60.70.80.9Test accuracyAdamAdagradStorm(c)TestAccuracyvsIterationsFigure1:ExperimentsonCIFAR-10withResNet-32Network.6EmpiricalValidationInordertocon\ufb01rmthatouradvancesdoindeedyieldanalgorithmthatperformswellandrequireslittletuning,weimplementedSTORMinTensorFlow[1]andtesteditsperformanceontheCIFAR-10imagerecognitionbenchmark[14]usingaResNetmodel[10],asimplementedbytheTensor2Tensorpackage[26]1.WecompareSTORMtoAdaGradandAdam,whicharebothverypopularandsuccessfuloptimizationalgorithms.ThelearningratesforAdaGradandAdamweresweptoveralogarithmicallyspacedgrid.ForSTORM,wesetw=k=0.1asadefault2andsweptcoveralogarithmicallyspacedgrid,sothatallalgorithmsinvolvedonlyoneparametertotune.Noregularizationwasemployed.Werecordtrainloss(cross-entropy),andaccuracyonboththetrainandtestsets(seeFigure1).Theseresultsshowthat,whileSTORMisonlymarginallybetterthanAdaGradontestaccuracy,onbothtraininglossandaccuracySTORMappearstobesomewhatfasterintermsofnumberofiterations.Wenotethattheconvergenceproofweprovideactuallyonlyappliestothetrainingloss(sincewearemakingmultiplepassesoverthedataset).Weleaveforthefuturewhetherappropriateregularizationcantrade-offSTORM\u2019sbettertraininglossperformancetoobtainbettertestperformance.7ConclusionWehaveintroducedanewvariance-reduction-basedalgorithm,STORM,that\ufb01ndscriticalpointsinstochastic,smooth,non-convexproblems.Ouralgorithmimprovesuponprioralgorithmsbyvirtueofremovingtheneedforcheckpointgradients,andincorporatingadaptivelearningrates.TheseimprovementsmeanthatSTORMissubstantiallyeasiertotune:itdoesnotrequirechoosingthesizeofthecheckpoints,norhowoftentocomputethecheckpoints(becausetherearenocheckpoints),andbyusingadaptivelearningratesthealgorithmenjoysthesamerobustnesstolearningratetuningaspopularalgorithmslikeAdaGradorAdam.STORMobtainstheoptimalconvergenceguarantee,adaptingtothelevelofnoiseintheproblemwithoutknowledgeofthisparameter.Weveri\ufb01edthatonCIFAR-10withaResNetarchitecture,STORMindeedseemstobeoptimizingtheobjectiveinfeweriterationsthanbaselinealgorithms.Additionally,wepointoutthatSTORM\u2019supdateformulaisstrikinglysimilartothestandardSGDwithmomentumheuristicemployedinpractice.Toourknowledge,notheoreticalresultactuallyestablishesanadvantageofaddingmomentumtoSGDinstochasticproblems,creatinganintriguingmystery.WhileouralgorithmisnotpreciselythesameastheSGDwithmomentum,wefeelthatitprovidesstrongintuitiveevidencethatmomentumisperformingsomekindofvariancereduction.Wethereforehopethatsomeoftheanalysistechniquesusedinthispapermayprovideapathtowardsexplainingtheadvantagesofmomentum.1https://github.com/google-research/google-research/tree/master/storm_optimizer2Wepickedthesedefaultsbytuningoveralogarithmicgridonthemuch-simplerMNISTdataset[15].wandkwerenottunedonCIFAR10.8\fReferences[1]M.Abadi,A.Agarwal,P.Barham,E.Brevdo,Z.Chen,C.Citro,G.S.Corrado,A.Davis,J.Dean,M.Devin,S.Ghemawat,I.Goodfellow,A.Harp,G.Irving,M.Isard,Y.Jia,R.Joze-fowicz,L.Kaiser,M.Kudlur,J.Levenberg,D.Man\u00e9,R.Monga,S.Moore,D.Murray,C.Olah,M.Schuster,J.Shlens,B.Steiner,I.Sutskever,K.Talwar,P.Tucker,V.Vanhoucke,V.Va-sudevan,F.Vi\u00e9gas,O.Vinyals,P.Warden,M.Wattenberg,M.Wicke,Y.Yu,andX.Zheng.TensorFlow:Large-scalemachinelearningonheterogeneoussystems,2015.[2]Z.Allen-ZhuandE.Hazan.Variancereductionforfasternon-convexoptimization.InInterna-tionalconferenceonmachinelearning,pages699\u2013707,2016.[3]Y.Arjevani,Y.Carmon,J.C.Duchi,D.J.Foster,N.Srebro,andB.Woodworth.Lowerboundsfornon-convexstochasticoptimization.arXivpreprintarXiv:1912.02365,2019.[4]A.CutkoskyandR.Busa-Fekete.DistributedstochasticoptimizationviaadaptiveSGD.InAdvancesinNeuralInformationProcessingSystems,pages1910\u20131919,2018.[5]A.DefazioandL.Bottou.Ontheineffectivenessofvariancereducedoptimizationfordeeplearning.arXivpreprintarXiv:1812.04529,2018.[6]A.Dieuleveut,N.Flammarion,andF.Bach.Harder,better,faster,strongerconvergenceratesforleast-squaresregression.J.Mach.Learn.Res.,18(1):3520\u20133570,January2017.[7]J.C.Duchi,E.Hazan,andY.Singer.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JournalofMachineLearningResearch,12:2121\u20132159,2011.[8]C.Fang,C.J.Li,Z.Lin,andT.Zhang.Spider:Near-optimalnon-convexoptimizationviastochasticpath-integrateddifferentialestimator.InAdvancesinNeuralInformationProcessingSystems,pages689\u2013699,2018.[9]S.GhadimiandG.Lan.Stochastic\ufb01rst-andzeroth-ordermethodsfornonconvexstochasticprogramming.SIAMJournalonOptimization,23(4):2341\u20132368,2013.[10]K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.InProc.oftheIEEEconferenceoncomputervisionandpatternrecognition,pages770\u2013778,2016.[11]P.Jain,S.M.Kakade,R.Kidambi,P.Netrapalli,andA.Sidford.Acceleratingstochasticgradientdescentforleastsquaresregression.InS.Bubeck,V.Perchet,andP.Rigollet,editors,Proceedingsofthe31stConferenceOnLearningTheory,volume75ofProceedingsofMachineLearningResearch,pages545\u2013604.PMLR,06\u201309Jul2018.[12]R.JohnsonandT.Zhang.Acceleratingstochasticgradientdescentusingpredictivevariancereduction.InAdvancesinneuralinformationprocessingsystems,pages315\u2013323,2013.[13]D.P.KingmaandJ.Ba.Adam:Amethodforstochasticoptimization.InInternationalConferenceonLearningRepresentations(ICLR),2015.[14]A.Krizhevsky.Learningmultiplelayersoffeaturesfromtinyimages.Technicalreport,UniversityofToronto,2009.[15]Y.Lecun,L.Bottou,Y.Bengio,andP.Haffner.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278\u20132324,November1998.[16]X.LiandF.Orabona.Ontheconvergenceofstochasticgradientdescentwithadaptivestepsizes.InProc.ofthe22ndInternationalConferenceonArti\ufb01cialIntelligenceandStatistics,AISTATS,2019.[17]M.Mahdavi,L.Zhang,andR.Jin.Mixedoptimizationforsmoothfunctions.InAdvancesinneuralinformationprocessingsystems,pages674\u2013682,2013.[18]L.M.Nguyen,J.Liu,K.Scheinberg,andM.Tak\u00e1\u02c7c.SARAH:Anovelmethodformachinelearningproblemsusingstochasticrecursivegradient.InProc.ofthe34thInternationalConferenceonMachineLearning-Volume70,pages2613\u20132621.JMLR.org,2017.[19]L.M.Nguyen,J.Liu,K.Scheinberg,andM.Tak\u00e1\u02c7c.Stochasticrecursivegradientalgorithmfornonconvexoptimization.arXivpreprintarXiv:1705.07261,2017.[20]L.M.Nguyen,K.Scheinberg,andM.Tak\u00e1\u02c7c.InexactSARAHalgorithmforstochasticoptimization.arXivpreprintarXiv:1811.10105,2018.9\f[21]S.J.Reddi,A.Hefny,S.Sra,B.Poczos,andA.Smola.Stochasticvariancereductionfornonconvexoptimization.InInternationalconferenceonmachinelearning,pages314\u2013323,2016.[22]S.J.Reddi,S.Kale,andS.Kumar.OntheconvergenceofAdamandbeyond.InInternationalConferenceonLearningRepresentations,2018.[23]A.RuszczynskiandW.Syski.Stochasticapproximationmethodwithgradientaveragingforunconstrainedproblems.IEEETransactionsonAutomaticControl,28(12):1097\u20131105,December1983.[24]T.TielemanandG.Hinton.Lecture6.5-rmsprop:Dividethegradientbyarunningaverageofitsrecentmagnitude.COURSERA:NeuralNetworksforMachineLearning,2012.[25]QuocTran-Dinh,NhanHPham,DzungTPhan,andLamMNguyen.Hybridstochasticgradientdescentalgorithmsforstochasticnonconvexoptimization.arXivpreprintarXiv:1905.05920,2019.[26]A.Vaswani,S.Bengio,E.Brevdo,F.Chollet,A.N.Gomez,S.Gouws,L.Jones,\u0141.Kaiser,N.Kalchbrenner,N.Parmar,R.Sepassi,N.Shazeer,andJ.Uszkoreit.Tensor2tensorforneuralmachinetranslation.CoRR,abs/1803.07416,2018.[27]C.Wang,X.Chen,A.J.Smola,andE.P.Xing.Variancereductionforstochasticgradientoptimization.InC.J.C.Burges,L.Bottou,M.Welling,Z.Ghahramani,andK.Q.Wein-berger,editors,AdvancesinNeuralInformationProcessingSystems26,pages181\u2013189.CurranAssociates,Inc.,2013.[28]R.Ward,X.Wu,andL.Bottou.AdaGradstepsizes:Sharpconvergenceovernonconvexlandscapes,fromanyinitialization.arXivpreprintarXiv:1806.01811,2018.[29]K.Yuan,B.Ying,andA.H.Sayed.Onthein\ufb02uenceofmomentumaccelerationononlinelearning.JournalofMachineLearningResearch,17(192):1\u201366,2016.[30]L.Zhang,M.Mahdavi,andR.Jin.Linearconvergencewithconditionnumberindependentaccessoffullgradients.InC.J.C.Burges,L.Bottou,M.Welling,Z.Ghahramani,andK.Q.Weinberger,editors,AdvancesinNeuralInformationProcessingSystems26,pages980\u2013988.CurranAssociates,Inc.,2013.[31]D.Zhou,P.Xu,andQ.Gu.Stochasticnestedvariancereducedgradientdescentfornonconvexoptimization.InAdvancesinNeuralInformationProcessingSystems,pages3921\u20133932,2018.10\f", "award": [], "sourceid": 8746, "authors": [{"given_name": "Ashok", "family_name": "Cutkosky", "institution": "Google Research"}, {"given_name": "Francesco", "family_name": "Orabona", "institution": "Boston University"}]}