{"title": "SpiderBoost and Momentum: Faster Variance Reduction Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2406, "page_last": 2416, "abstract": "SARAH and SPIDER are two recently developed stochastic variance-reduced algorithms, and SPIDER has been shown to achieve a near-optimal first-order oracle complexity in smooth nonconvex optimization. However, SPIDER uses an accuracy-dependent stepsize that slows down the convergence in practice, and cannot handle objective functions that involve nonsmooth regularizers. In this paper, we propose SpiderBoost as an improved scheme, which allows to use a much larger constant-level stepsize while maintaining the same near-optimal oracle complexity, and can be extended with proximal mapping to handle composite optimization (which is nonsmooth and nonconvex) with provable convergence guarantee. In particular, we show that proximal SpiderBoost achieves an oracle complexity of  O(min{n^{1/2}\\epsilon^{-2},\\epsilon^{-3}})  in composite nonconvex optimization, improving the state-of-the-art result by a factor of  O(min{n^{1/6},\\epsilon^{-1/3}}). We further develop a novel momentum scheme to accelerate SpiderBoost for composite optimization, which achieves the near-optimal oracle complexity in theory and substantial improvement in experiments.", "full_text": "SpiderBoostandMomentum:FasterStochasticVarianceReductionAlgorithmsZheWangDepartmentofECETheOhioStateUniversitywang.10982@osu.eduKaiyiJiDepartmentofECETheOhioStateUniversityji.367@osu.eduYiZhouDepartmentofECETheUniversityofUtahyi.zhou@utah.eduYingbinLiangDepartmentofECETheOhioStateUniversityliang.889@osu.eduVahidTarokhDepartmentofECEDukeUniversityvahid.tarokh@duke.eduAbstractSARAHandSPIDERaretworecentlydevelopedstochasticvariance-reducedalgorithms,andSPIDERhasbeenshowntoachieveanear-optimal\ufb01rst-orderoraclecomplexityinsmoothnonconvexoptimization.However,SPIDERusesanaccuracy-dependentstepsizethatslowsdowntheconvergenceinpractice,andcannothandleobjectivefunctionsthatinvolvenonsmoothregularizers.Inthispaper,weproposeSpiderBoostasanimprovedscheme,whichallowstouseamuchlargerconstant-levelstepsizewhilemaintainingthesamenear-optimaloraclecomplexity,andcanbeextendedwithproximalmappingtohandlecompositeoptimization(whichisnonsmoothandnonconvex)withprovableconvergenceguarantee.Inparticular,weshowthatproximalSpiderBoostachievesanoraclecomplexityofO(min{n1/2\u0001\u22122,\u0001\u22123})incompositenonconvexoptimization,improvingthestate-of-the-artresultbyafactorofO(min{n1/6,\u0001\u22121/3}).WefurtherdevelopanovelmomentumschemetoaccelerateSpiderBoostforcompositeoptimization,whichachievesthenear-optimaloraclecomplexityintheoryandsubstantialimprovementinexperiments.1IntroductionWeconsiderthefollowing\ufb01nite-sumoptimizationproblemminx\u2208Rd\u03a8(x):=f(x),wheref(x):=1nnXi=1fi(x)(P)wherethefunctionfdenotesthetotallossonthetrainingsamplesandingeneralisnonconvex.Sincelarge-scalemachinelearningproblemscanhaveverylargesamplesizen,thefull-batchgradientdescentalgorithmhashighcomputationalcomplexity.Thus,variousstochasticgradientdescent(SGD)algorithmshavebeenproposed.Fornonconvexoptimization,thebasicSGDalgorithm,whichcalculatesthegradientofonedatasampleperiteration,hasbeenshowntoyieldanoverallstochastic\ufb01rst-orderoracle(SFO)complexity,i.e.,gradientcomplexity,ofO(\u0001\u22124)[9]toattaina\ufb01rst-orderstationarypoint\u00afxthatsatis\ufb01esEk\u2207f(\u00afx)k\u2264\u0001.SGDalgorithmswithdiminishingstep-size[9,5]orasuf\ufb01cientlylargebatchsize[37,11]werealsoproposedtoguaranteetheirconvergencetoastationarypointratherthanitsneighborhood.Furthermore,variousvariancereductionmethodshavebeenproposed,whichconstructmoreaccuratestochasticgradientestimatorsthanthatofSGD,e.g.,SAG[31],SAGA[7]andSVRG[16].In33rdConferenceonNeuralInformationProcessingSystems(NeurIPS2019),Vancouver,Canada.\fparticular,SAGAandSVRGhavebeenshowntoyieldanoverallSFOcomplexityofO(n2/3\u0001\u22122)[29,4].Recently,[24,25]proposedavariancereductionmethodcalledSARAH,wherethegradientestimatorissequentiallyupdatedintheinnerlooptoimprovetheestimationaccuracy.Inparticular,SARAHhasbeenshownin[25]toachieveanoverallO(\u0001\u22124)SFOcomplexityfornonconvexoptimization.AnothervariancereductionmethodcalledSPIDERwasproposedin[8],whichusesthesamegradientestimatorasthatofSARAHbutadoptsanormalizedgradientupdatewithastepsize\u03b7=O(\u0001/L).[8]showedthatSPIDERachievesanoverallO(min{n1/2\u0001\u22122,\u0001\u22123})SFO,whichwasfurthershowntobeoptimalintheregimewithn\u2264O(\u0001\u22124).ThoughSPIDERistheoreticallyappealing,threeimportantissuesstillrequirefurtherattention.First,SPIDERrequiresaveryrestrictivestepsize\u03b7=O(\u0001/L)toguaranteeitsconvergence,whichpreventsSPIDERfrommakingbigprogressevenifitispossible.Relaxingsuchaconditionappearsnoteasyunderitsoriginalconvergenceanalysisframework.\u2022ThispaperproposesamorepracticalSpiderBoostalgorithm,whichallowsamuchlargerstepsize\u03b7=O(1/L)thanSPIDERwhileretainingthesamestate-of-the-artcomplexityorderasSPIDER(seeTable2inSuppl).Thisisduetothenewconvergenceanalysisideathatwedevelop,whichanalyzestheincrementsofvariablesovereachentireinnerloopratherthanovereachinner-loopiteration,andhenceyieldstighterboundandconsequentlymorerelaxedstepsizerequirement.Second,theconvergenceanalysisofSPIDERrequiresaverysmallper-iterationincrementkxk+1\u2212xkk=O(\u0001/L),whichisdif\ufb01culttoguaranteeifoneattemptstogeneralizeittoaproximalalgorithmforsolvingthecompositeoptimizationproblem(seeSection3)thatpossiblyinvolvesnonsmoothness.Hence,generalizingSPIDERtotheproximalsettingwithprovableconvergenceguaranteeischallenging.\u2022OurSpiderBoosthasanaturalgeneralization,i.e.,theProx-SpiderBoostalgorithm,whichcanbeappliedtosolvecompositeoptimizationproblems.WeshowthatProx-SpiderBoostachievesaSFOcomplexityofO(n1/2\u0001\u22122)andaproximaloracle(PO)complexityofO(\u0001\u22122),whichimprovestheexistingbestresultsbyafactorofO(n1/6)(seeTable1).Third,althoughSPIDERachievesthenear-optimaloraclecomplexityinnonconvexoptimization,itspracticalperformancehasbeenfound[25,8]tobehardlyadvantageousoverSVRG.Therefore,itisofvitalimportancetoexploitotheralgorithmicdimensionstofurtherimprovethepracticalperformanceofSPIDER,andmomentumissuchapromisingperspective.However,theexistinganalysisofvariance-reducedalgorithmshasbeenexploredforSVRGonlyincertainconvexscenarios[27,1,2,32]andunderalocalgradientdominancegeometryinnonconvexoptimization[19].Therefore,itisnotevenclearwhetheracertainmomentumschemecanbeappliedtoSPIDERandyieldtheoptimaloraclegradientcomplexityforgeneralnonconvexoptimization.\u2022ThispaperproposesamomentumschemetoacceleratetheProx-SpiderBoost,namedProx-SpiderBoost-M,forcompositeoptimization.WeshowthatProx-SpiderBoost-MachievesanoraclecomplexityorderofO(n+\u221an\u0001\u22122),matchingthecomplexitylowerboundfornonconvexoptimization.Incontrasttotheexistinganalysisforstochasticalgorithmswithmomentum[10]fornonconvexoptimization,ourproofexploitsthemartingalestructureofthegradientestimatortoboundthevariancetermanditsaccumulationsovertheentireoptimizationpathinatightwayunderthemomentumscheme.Duetospacelimitation,werelegateseveralotherresultstothesupplementarymaterials,includinganalysisofProx-SpiderBoostundernon-EuclideangeometryandPolyak-\u0141ojasiewiczcondition,andanalysisofbothProx-SpiderBoostandProx-SpiderBoost-Mforonlinenonconvexcompositeoptimization.1.1RelatedWorkStochasticalgorithmsforsmoothnonconvexoptimization:TheconvergenceanalysisforSGDwasstudiedin[11]forsmoothnonconvexoptimization.SGDwithdiminishingstepsizeandsuf-\ufb01cientlylargebatchsizewerefurtherstudiedin[11,5,37]toimprovetheperformance.Variousvariance-reducedalgorithmshavebeenproposedandstudied,including,e.g.,SAG[31],SAGA[7],SVRG[16,29,4],SCSG[18],SNVRG[36],SARAH[24,25,26,28],SPIDER[8].Inparticular,SPIDERhasbeenshownin[8]toachievetheoraclecomplexitylowerboundforacertainregime.Suchanideahasalsobeenextendedforoptimizationovermanifoldsin[38,35],zeroth-orderopti-mizationin[15],ADMMin[12],zeroth-orderADMMin[13],problemwithnonsmoothnonconvex2\fTable1:ComparisonofSFOcomplexityandPOcomplexityforcompositeoptimization.AlgorithmsStepsize\u03b7Finite-SumFinite-Sum/Online1SFOPOSFOPOProxGD[11]O(L\u22121)O(n\u0001\u22122)O(\u0001\u22122)N/AN/AProxSGD[11]O(L\u22121)N/AN/AO(\u0001\u22124)O(\u0001\u22122)ProxSVRG/SAGA[30]O(L\u22121)O(n+n2/3\u0001\u22122)O(\u0001\u22122)N/AN/ANatasha1.5[3]O(\u00012/3L\u22122/3)N/AN/AO(\u0001\u22123+\u0001\u221210/3)O(\u0001\u221210/3)ProxSVRG+[22]O(L\u22121)O(n+n2/3\u0001\u22122)O(\u0001\u22122)O(\u0001\u221210/3)O(\u0001\u22122)Prox-SpiderBoost(ThisWork)O(L\u22121)O(n+n1/2\u0001\u22122)O(\u0001\u22122)O(\u0001\u22122+\u0001\u22123)O(\u0001\u22122)1Theonlinesettingreferstothecasewheretheobjectivefunctiontakestheformoftheexpectedvalueofthelossfunctionoverthedatadistribution.Suchamethodcanalsobeappliedtosolvethe\ufb01nite-sumproblem,andhencetheSFOcomplexityinthelastcolumnisapplicabletoboththe\ufb01nite-sumandonlineproblems.Thus,foralgorithmsthathaveSFOboundsavailableinbothofthelasttwocolumns,theminimumbetweenthetwoboundsprovidesthebestboundforthe\ufb01nite-sumproblem.regularizerin[33],stochasticcompositeoptimizationin[34],noisygradientdescentin[21],andanadaptivebatchsizeschemein[14].OurstudyhereproposesaSpiderBoostalgorithm,whichsubstantiallyimprovesthestepsizeofSPIDERwhileretainingthesameperformanceguaranteeandperformsmuchfasterthanSPIDERinpractice.Stochasticalgorithmsforcompositenonconvexoptimization:ProximalSGDhasbeenproposedandstudiedby[9,10]tosolvecompositenonconvexoptimizationproblems.Moreover,variancereducedalgorithmssuchasProx-SVRGandProx-SAGA[30],Natasha1.5[3],andProxSVRG+[22]havealsobeenproposedtofurtherimprovetheperformance.OurstudyproposesProx-SpiderBoost,whichorder-leveloutperformsalltheexistingalgorithmsforcompositenonconvexoptimization.Momentumschemesfornonconvexoptimization:Fornonconvexoptimization,[10]establishedconvergenceofSGDwithmomentumtoan\u0001-\ufb01rst-orderstationarypointwithanoraclecomplexityofO(\u0001\u22124).TheconvergenceguaranteeofSVRGwithmomentumhasbeenexploredunderacertainlocalgradientdominancegeometryinnonconvexoptimization[19].Here,weproposeProx-SpiderBoost-Mwhichachievesthecomplexitylowerboundforacertainregime,andpracticallysubstantiallyoutperformsexistingvariancereducedalgorithmswithmomentum.2SpiderBoostforNonconvexOptimization2.1SpiderBoostAlgorithmInthissection,weintroducetheSpiderBoostalgorithmdesignedfortheproblem(P).In[24],anovelgradientestimatorwasintroducedforreducingthevariance.Morespeci\ufb01cally,consideracertaininnerloop{xk}q\u22121k=0.Theinitializationoftheestimatorissettobev0=\u2207f(x0).Then,foreachsubsequentiterationk,anindexsetSissampledandthecorrespondingestimatorvkisconstructedasvk=1|S|Xi\u2208S(cid:2)\u2207fi(xk)\u2212\u2207fi(xk\u22121)+vk\u22121(cid:3).(1)Itcanbeseenthattheestimatorineq.(1)isconstructediterativelybasedontheinformationxk\u22121andvk\u22121thatareobtainedfromthepreviousupdate.Asacomparison,theSVRGestimator[16]isconstructedbasedontheinformationoftheinitializationofthatloop(i.e.,replacexk\u22121andvk\u22121ineq.(1)withx0andv0,respectively).Therefore,theestimatorineq.(1)utilizesmorefreshinformationandyieldsmoreaccurateestimationofthefullgradient.Theestimatorineq.(1)hasbeenadoptedby[24,25]and[8]forproposingSARAHandSPIDER,respectively.Inspeci\ufb01c,SPIDERwasshownin[8]tobeoptimalintheregimewithn\u2264O(\u0001\u22124).ThoughSPIDERhasdesiredperformanceintheory,itcanrunveryslowlyinpracticeduetothechoiceofaconservativestepsize.Inspeci\ufb01c,SPIDERusesaverysmallstepsize\u03b7=O(\u0001L)(where\u0001isthedesiredaccuracy)innormalizedgradientdescent,whichyieldssmallincrementperiteration,i.e.,kxk+1\u2212xkk=O(\u0001).ByfollowingtheanalysisofSPIDER,suchastepsizeappearstobenecessaryinordertoachievethedesiredconvergencerate.3\fAlgorithm1SpiderBoostInput:\u03b7=12L,q,K,|S|\u2208N.fork=0,1,...,K\u22121doifmod(k,q)=0thenComputevk=\u2207f(xk),elseDraw|S|sampleswithreplacement.Computevkaccordingtoeq.(1).endxk+1=xk\u2212\u03b7vk.endOutput:x\u03be,where\u03beUnif\u223c{0,...,K\u22121}.Algorithm2Prox-SpiderBoostInput:\u03b7=12L,q,K,|S|\u2208N.fork=0,1,...,K\u22121doifmod(k,q)=0thenComputevk=\u2207f(xk),elseDraw|S|sampleswithreplacement.Computevkaccordingtoeq.(1).endxk+1=prox\u03b7h(xk\u2212\u03b7vk).endOutput:x\u03be,where\u03beUnif\u223c{0,...,K\u22121}.SuchaconservativestepsizeadoptedbySPIDERmotivatesourdesignofanimprovedalgorithmnamedSpiderBoost(seeAlgorithm1),whichusesthesameestimatoreq.(1)asSARAHandSPIDER,butadoptsamuchlargerstepsize\u03b7=12L,asopposedto\u03b7=O(\u0001L)takenbySPIDER.Also,SpiderBoostupdatesthevariableviaagradientdescentstep(sameasSARAH),asopposedtothenormalizedgradientdescentsteptakenbySPIDER.Furthermore,SpiderBoostgeneratestheoutputvariableviaarandomstrategywhereasSPIDERoutputsdeterministically.Collectively,SpiderBoostcanmakeaconsiderablylargerprogressperiterationthanSPIDER,especiallyintheinitialoptimizationphasewheretheestimatedgradientnormkvkkislarge,andisstillguaranteedtoachievethesamedesirableconvergencerateasSPIDER,asweshowinthenextsubsection.WecomparetheempiricalperformancebetweenSPIDERandSpiderBoostinSection5.1.2.2ConvergenceAnalysisofSpiderBoostInthissubsection,westudytheconvergencerateandcomplexityofSpiderBoost.Inparticular,weadoptthefollowingstandardassumptions.Assumption1.Theobjectivefunctionintheproblem(P)satis\ufb01es:1.Theobjectfunction\u03a8isboundedbelow,i.e.,\u03a8\u2217:=infx\u2208Rd\u03a8(x)>\u2212\u221e;2.Eachgradient\u2207fi,i=1,...,nisL-Lipschitzcontinuous,i.e.,\u2200x,y\u2208Rd,k\u2207fi(x)\u2212\u2207fi(y)k\u2264Lkx\u2212yk.Assumption1essentiallyassumesthatthesmoothobjectivefunctionhasanon-trivialminimumanditsgradientisLipschitzcontinuous,whicharevalidandstandardconditionsinnonconvexoptimization.Then,weobtainthefollowingconvergenceresultforSpiderBoost.Theorem1.LetAssumption1holdandapplySpiderBoostinAlgorithm1tosolvetheproblem(P)withparametersq=|S|=\u221anandstepsize\u03b7=12L.Then,thecorrespondingoutputx\u03besatis\ufb01esEk\u2207f(x\u03be)k\u2264\u0001providedthatthetotalnumberKofiterationssatis\ufb01esK\u2265O(cid:16)L(f(x0)\u2212f\u2217)\u00012(cid:17).Moreover,theoverallSFOcomplexityisO(\u221an\u0001\u22122+n).Theorem1showsthattheoutputofSpiderBoostachievesthe\ufb01rst-orderstationaryconditionwithin\u0001accuracywithatotalSFOcomplexityO(\u221an\u0001\u22122+n).Thismatchesthelowerboundthatonecanexpectfor\ufb01rst-orderalgorithmsintheregimen\u2264O(\u0001\u22124)[8].AsweexplaininSection2.1,SpiderBoostenhancesSPIDERmainlyduetotheutilizationofalargeconstantstepsize,whichyieldssigni\ufb01cantaccelerationoverSPIDERinpracticeasweillustrateintheexperimentsinSection5.1.WenotethattheanalysisofSpiderBoostinTheorem1isverydifferentfromthatofSPIDERthatdependsonan\u0001-levelstepsizeandthenormalizedgradientdescentsteptoguaranteeaconstantincrementkxk+1\u2212xkkineveryiteration.Incontrast,SpiderBoostexploitsthespecialstructureofgradientestimatorandanalyzesthealgorithmovertheentireinnerloopratherthanovereachiteration,andthusyieldsabetterbound.4\f3Prox-SpiderBoostforNonconvexCompositeOptimizationInthissection,wegeneralizeSpiderBoosttosolvethefollowingnonconvexcompositeproblem:minx\u2208X\u03a8(x):=f(x)+h(x),f(x):=1nnXi=1fi(x)(Q)wherethefunctionfispossiblynonconvex,hisasimpleconvexbutpossiblynonsmoothregularizer,andXisaconvexconstrainedset.Tohandlethenonsmoothness,wenextintroducetheproximalmappingwhichisaneffectivetoolforcompositeoptimization.3.1PreliminariesonProximalMappingConsideraproperandlower-semicontinuousfunctionh(whichcanbenon-differentiable).Wede\ufb01neitsproximalmappingatx\u2208Rdwithparameter\u03b7>0asprox\u03b7h(x):=argminu\u2208Xnh(u)+12\u03b7ku\u2212xk2o.Suchamappingiswellde\ufb01nedandisuniqueparticularlyforconvexfunctions.Furthermore,theproximalmappingcanbeusedtogeneralizethe\ufb01rst-orderstationaryconditionofsmoothoptimizationtononsmoothcompositeoptimizationviathefollowingfact.Fact1.Lethbeaproperandconvexfunction.De\ufb01nethefollowingnotionofgeneralizedgradientG\u03b7(x):=1\u03b7(cid:16)x\u2212prox\u03b7h(x\u2212\u03b7\u2207f(x))(cid:17).(2)Then,xisacriticalpointof\u03a8:=f+h(i.e.,0\u2208\u2207f(x)+\u2202h(x))ifandonlyifG\u03b7(x)=0.Fact1introducesageneralizednotionofgradientforcompositeoptimization.Toelaborate,considerthecaseh\u22610sothattheproximalmappingbecomestheidentitymapping.Then,thegeneralizedgradientG\u03b7(x)reducestothegradient\u2207f(x)oftheunconstrainedoptimization.Therefore,the\u0001-\ufb01rst-orderstationaryconditionforcompositeoptimizationisnaturallyde\ufb01nedaskG\u03b7(x)k\u2264\u0001.3.2Prox-SpiderBoostandOracleComplexityTogeneralizetocompositeoptimization,SpiderBoostadmitsanaturalextensionProx-SpiderBoost,whereasSPIDERencounterschallenges.ThemainreasonisbecauseSpiderBoostadmitsaconstantstepsizeanditsconvergenceguaranteedoesnothaveanyrestrictionontheper-iterationincrementofthevariable.However,theconvergenceofSPIDERrequirestheper-iterationincrementofthevariabletobeatthe\u0001-level,whichischallengingtosatisfyunderthenonlinearproximaloperatorincompositeoptimization.ThedetailedstepsofProx-SpiderBoost(whichgeneralizesSpiderBoosttocompositeoptimizationobjectives)aredescribedinAlgorithm2.Inparticular,Prox-SpiderBoostupdatesthevariableviaaproximalgradientsteptohandlethepossiblenonsmoothnessincompositeoptimization.WenextcharacterizetheoraclecomplexityofProx-SpiderBoostforachievingthegeneralized\u0001-\ufb01rst-orderstationarycondition.Theorem2.LetAssumption1holdandconsidertheproblem(Q)withX=Rd.ApplytheProx-SpiderBoostinAlgorithm2withparametersq=|S|=\u221anand\u03b7=12L.Then,thecorrespondingoutputx\u03besatis\ufb01esEkG\u03b7(x\u03be)k\u2264\u0001providedthatthetotalnumberKofiterationssatis\ufb01esK\u2265O(cid:16)L(\u03a8(x0)\u2212\u03a8\u2217)\u00012(cid:17).Moreover,theSFOcomplexityisO(\u221an\u0001\u22122+n),andtheproximaloracle(PO)complexityisO(\u0001\u22122).Asacomparison,theSFOcomplexityO(\u221an\u0001\u22122+n)ofProx-SpiderBoostinTheorem2improvestheexistingcomplexityresultbyafactorofn1/6[22].Furthermore,thecomplexitylowerboundforachievingthe\u0001-\ufb01rst-orderstationaryconditioninun-regularizedoptimization[8]alsoservesasalowerboundforcompositeoptimization(byconsideringthespecialcaseh\u22610).Therefore,the5\fSFOcomplexityofourProx-SpiderBoostmatchesthecorrespondingcomplexitylowerboundintheregimewithn\u2264O(\u0001\u22124),andishencenearoptimal.Moreover,ourProx-SpiderBooststillachievesthestate-of-the-artconvergenceresultsunderothersettingssuchasonlineoptimization,non-EuclideangeometryandPolyak-\u0141ojasiewiczcondition.Duetothespacelimitation,werelegatetheseresultstoAppendixD.4AcceleratingProx-SpiderBoostviaMomentumInthissection,weproposeaproximalSpiderBoostalgorithmthatincorporatesamomentumscheme(referredtoasProx-SpiderBoost-M)forsolvingthecompositeproblem(Q),andstudyitstheoreticalguaranteeaswellastheoraclecomplexity.4.1AlgorithmDesignWepresentthedetailedupdateruleofProx-SpiderBoost-MinAlgorithm3.Algorithm3Prox-SpiderBoost-MInput:q,K\u2208N,{\u03bbk}K\u22121k=1,{\u03b2k}K\u22121k=1>0,y0=x0\u2208Rd,andset\u03b1k=2dk/qe+1.fork=0,1,...,K\u22121dozk=(1\u2212\u03b1k+1)yk+\u03b1k+1xk,ifmod(k,q)=0thensetvk=\u2207f(zk),elseDraw\u03beksampleswithreplacementandcomputevkaccordingtoeq.(1).endxk+1=prox\u03bbkh(cid:0)xk\u2212\u03bbkvk(cid:1),yk+1=zk\u2212\u03b2k\u03bbkxk+\u03b2k\u03bbkprox\u03bbkh(cid:0)xk\u2212\u03bbkvk(cid:1).endOutput:z\u03b6,where\u03b6Unif\u223c{0,...,K\u22121}.Toelaborateonthealgorithmdesign,notethatProx-SpiderBoost-Mgeneratesatupleofvariablesequences{xk,yk,zk}kaccordingtothemomentumscheme.Inspeci\ufb01c,thevariablesxk,ykareupdatedviaproximalgradient-likestepsusingthegradientestimatevkproposedforSARAHin[24,25]anddifferentstepsizes\u03bbk,\u03b2k,respectively.Then,theirconvexcombinationwithmomentumcoef\ufb01cient\u03b1k+1yieldsthevariablezk+1.Here,wechooseastandardmomentumcoef\ufb01cientschedulingthatdiminishesepochwisely(seetheexpressionfor\u03b1k)forprovingconvergenceguaranteeinnonconvexoptimization.Wealsonotethatthetwoupdatesforxk+1andyk+1donotintroduceextracomputationoverheadascomparedtoasingleupdate,sincetheybothdependonthesameproximalterm.WewanttohighlightthedifferencebetweenourmomentumschemeforProx-SpiderBoost-MandtheexistingmomentumschemedesignforproximalSGDin[10]andproximalSVRGin[1].Intheseworks,theyusethefollowingproximalgradientstepsforupdatingthevariablesxk+1andyk+1:xk+1=prox\u03bbkh(cid:0)xk\u2212\u03bbkvk(cid:1),yk+1=prox\u03b2kh(cid:0)zk\u2212\u03b2kvk(cid:1).(3)Notethateq.(3)usedifferentproximalupdatesthatarebasedonxkandzk,respectively.Asacomparison,ourmomentumschemeinAlgorithm3appliesthesameproximalgradienttermprox\u03bbkh(cid:0)xk\u2212\u03bbkvk(cid:1)toupdatebothvariablesxk+1andyk+1,andthereforerequireslesscomputation.Moreover,ourupdateforthevariableyk+1isnotasingleproximalgradientupdate(asopposedtoeq.(3)),anditcoupleswiththevariableszkandxk.Themomentumschemeintroducedin[1]wasnotproventohaveaconvergenceguaranteeinnonconvexoptimization.Inthenextsubsection,weprovethatourmomentumschemeinAlgorithm3hasaprovableconvergenceguaranteefornonconvexcompositeoptimizationwithconvexregularizers.4.2ConvergenceandComplexityAnalysisInthissubsection,westudytheconvergenceguaranteeofProx-SpiderBoost-Mforsolvingtheproblem(Q).Weobtainthefollowingmainresult.6\fTheorem3.LetAssumption1hold.ApplyProx-SpiderBoost-M(seeAlgorithm3)tosolvetheproblem(Q)withparametersq=|\u03bek|\u2261\u221an,\u03b2k\u226118Land\u03bbk\u2208[\u03b2k,(1+\u03b1k)\u03b2k].Then,theoutputz\u03b6producedbythealgorithmsatis\ufb01esEkG\u03bb\u03b6(z\u03b6,\u2207f(z\u03b6))k\u2264\u0001forany\u0001>0providedthatthetotalnumberKofiterationssatis\ufb01esK\u2265O(cid:18)L(\u03a8(x0)\u2212\u03a8\u2217)\u00012(cid:19).(4)Moreover,theSFOcomplexityisatmostO(n+\u221an\u0001\u22122)andthePOcomplexityisatmostO(\u0001\u22122).Theorem3establishestheconvergencerateofProx-SpiderBoost-Mtosatisfythegeneralized\ufb01rst-orderstationaryconditionandthecorrespondingoraclecomplexity.Speci\ufb01cally,theiterationcom-plexitytoachievethegeneralized\u0001-\ufb01rst-orderstationaryconditionisintheorderofO(\u0001\u22122),whichmatchesthatofProx-SpiderBoost.Furthermore,thecorrespondingSFOcomplexityO(n+\u221an\u0001\u22122)matchesthelowerboundfornonconvexoptimization[8].Therefore,Prox-SpiderBoost-MenjoysthesameoptimalconvergenceguaranteeasthatfortheProx-SpiderBoostinnonconvexoptimization,anditfurtherbene\ufb01tsfromthemomentumschemethatcanleadtosigni\ufb01cantaccelerationinpracticalapplications(aswedemonstrateviaexperimentsinSection5).Fromatechnicalperspective,wehighlightthefollowingthreemajornewdevelopmentsintheproofofTheorem3thatisdifferentfromtheproofforthebasicstochasticgradientalgorithmwithmomentum[10]fornonconvexoptimization:1)ourproofexploitsthemartingalestructureoftheSPIDERestimatevkwhichallowstoboundthemean-squareerrortermEk\u2207f(zk)\u2212vkk2inatightwayunderthemomentumscheme.Intraditionalanalysisofstochasticalgorithmswithmomentum[10],suchanerrortermcorrespondstothevarianceofthestochasticestimatorandisassumedtobeboundedbyauniversalconstant.2):Ourproofrequiresaverycarefulmanipulationoftheboundingstrategytohandletheaccumulationofthemean-squareerrorEk\u2207f(zk)\u2212vkk2overtheentireoptimizationpath.5Experiments5.1ComparisonbetweenSpiderBoostandSPIDER020406080# of epochs1011109107105103101101Loss(f-f*)SPIDERSpiderBoost(a)Dataset:a9a020406080100120140# of epochs10151012109106103100Loss(f-f*)SPIDERSpiderBoost(b)Dataset:w8a020406080100# of epochs101410121010108106104102100102Loss(f-f*)SPIDERSpiderBoost(c)Dataset:a9a020406080100120# of epochs109107105103101101Loss(f-f*)SPIDERSpiderBoost(d)Dataset:w8aFigure1:(a)and(b):Logisticregressionproblemwithnonconvexregularizer.(c)and(d):Robustlinearregressionproblemwithanl2regularizer.Inthissubsection,wecomparetheperformanceofSPIDERandSpiderBoostforsolvingthelogisticregressionproblemwithanonconvexregularizerandthenonconvexrobustlinearregressionproblem(SeeAppendixFfortheformsoftheobjectivefunctions).Foreachproblem,weapplytwodifferentdatasetsfromtheLIBSVM[6]:thea9adataset(n=32561,d=123)andthew8adataset(n=49749,d=300).Forbothalgorithms,weusethesameparametersettingexceptforthestepsize.Asspeci\ufb01edin[8]forSPIDER,weset\u03b7=0.01(determinedbyaprescribedaccuracytoguaranteeconvergence).Ontheotherhand,SpiderBoostallowstoset\u03b7=0.05.Figure1showstheconvergenceofthefunctionvaluegapofbothalgorithmsversusthenumberofpassesthataretakenoverthedata.ItcanbeseenthatSpiderBoostenjoysamuchfasterconvergencethanthatofSPIDERduetotheallowanceofalargestepsize.Furthermore,SPIDERoscillatesaroundapoint,whichistheprescribedaccuracythatdeterminestheadoptedstepsize\u03b7=0.01.ThisimpliesthatsettingalargerstepsizeforSPIDERwouldcauseittosaturateandstarttooscillateatacertainfunctionvalue,whichisundesired.5.2ComparisonofSpiderBoostTypeofAlgorithmswithOtherAlgorithmsInthissubsection,wecomparetheperformanceofourSpiderBoost(forsmoothproblems),Prox-SpiderBoost(forcompositeproblems),andProx-SpiderBoost-Mwithotherexistingstochastic7\fvariance-reducedalgorithmsincludingSVRGin[16],Katyushansin[1],ASVRGin[32],RSAGin[10].WenotethatallalgorithmsusecertainmomentumschemesexceptforSVRG,SpiderBoost,andProx-SpiderBoost.Forallalgorithmsconsidered,wesettheirlearningratestobe0.05.Foreachexperiment,weinitializeallthealgorithmsatthesamepointthatisgeneratedrandomlyfromthenormaldistribution.Also,wechoosea\ufb01xedmini-batchsize256andsettheepochlengthqtobe2n/256suchthatallalgorithmspassovertheentiredatasettwiceineachepoch.We\ufb01rstapplythesealgorithmstosolvetwosmoothnonconvexproblems:logisticregressionandrobustlinearregressionproblems,eachwithdatasetsofa9aandw8a,andreporttheexperimentresultsinFigure2.OnecanseefromFigure2thatourProx-SpiderBoost-Machievesthebestperformanceandsigni\ufb01cantlyoutperformsotheralgorithms.Also,theperformancesofbothKatyushansandASVRGdonotachievemuchaccelerationinsuchanonconvexcase,asthesealgorithmsareoriginallydevelopedtoachieveaccelerationforconvexproblems.ThisdemonstratesthatourdesignofProx-SpiderBoost-Mhasastableperformanceinnonconvexoptimizationaswellasprovabletheoreticalguarantee.WenotethatthecurveofSpiderBoostoverlapswiththatofSVRGsimilarlytotheresultsreportedinotherrecentstudies.020406080# of epochs104103102101100101Loss(f-f*)SVRGASVRGKatyusha_nsRSAGSpiderBoostSpiderBoost-M(a)Dataset:a9a0510152025# of epochs101100101Loss(f-f*)SVRGASVRGKatyusha_nsRSAGSpiderBoostSpiderBoost-M(b)Dataset:w8a0510152025303540# of epochs101100Loss(f-f*)SVRGASVRGKatyusha_nsRSAGSpiderBoostSpiderBoost-M(c)Dataset:a9a01020304050# of epochs101100Loss(f-f*)SVRGASVRGKatyusha_nsRSAGSpiderBoostSpiderBoost-M(d)Dataset:w8aFigure2:(a)and(b):Logisticregressionwithnonconvexregularizer,(c)and(d):Robustlinearregression..Wefurtheraddan\u20181nonsmoothregularizerwithweightcoef\ufb01cient0.1totheobjectivefunctionsoftheabovetwooptimizationproblems,andapplythecorrespondingproximalversionsofthesealgorithmstosolvethenonconvexcompositeoptimizationproblems.AlltheresultsarepresentedinFigures3.OnecanseethatourProx-SpiderBoost-Mstillsigni\ufb01cantlyoutperformsalltheotheralgorithmsinthesenonsmoothandnonconvexscenarios.Thisdemonstratesthatournoveldesignofthecoupledupdatefor{yk}kinthemomentumschemeisef\ufb01cientinthenonsmoothandnonconvexsetting.Also,itturnsoutthatKatyushansandASVRGaresufferingfromaslowconvergence(theirconvergencesoccurataround40epochs).Togetherwiththeaboveexperimentsforsmoothproblems,thisimpliesthattheirperformanceisnotstableandmaynotbegenerallysuitableforsolvingnonconvexproblems.0.02.55.07.510.012.515.017.520.0# of epochs104103102101100101Loss(f-f*)ProxSVRGProxASVRGProxKatyusha_nsProxRSAGProxSpiderBoostProx-SpiderBoost-M(a)Dataset:a9a02468101214# of epochs104103102101100101Loss(f-f*)ProxSVRGProxASVRGProxKatyusha_nsProxRSAGProxSpiderBoostProx-SpiderBoost-M(b)Dataset:w8a0.02.55.07.510.012.515.017.520.0# of epochs104103102101100101Loss(f-f*)ProxSVRGProxASVRGProxKatyusha_nsProxRSAGProxSpiderBoostProx-SpiderBoost-M(c)Dataset:a9a02468101214# of epochs104103102101100101102Loss(f-f*)ProxSVRGProxASVRGProxKatyusha_nsProxRSAGProxSpiderBoostProx-SpiderBoost-M(d)Dataset:w8aFigure3:(a)and(b):Logisticregressionwithan\u20181nonsmoothregualarizer.(c)and(d):Robustlinearregressionwithan\u20181nonsmoothregualarizer.6ConclusionInthispaper,weproposedtheSpiderBoostalgorithm,whichachievesthesamenear-optimalcom-plexityperformanceasSPIDER,butallowsamuchlargerstepsizeandhencerunsfasterinpracticethanSPIDER.WethenextendtheproposedSpiderBoosttosolvecompositenonconvexoptimization,andproposedamomentumschemetofurtheracceleratethealgorithm.Forallthesealgorithms,wedevelopnewtechniquestocharacterizetheperformancebounds,allofwhichachievethebeststate-of-the-art.WeanticipatethatSpiderBoosthasagreatpotentialtobeappliedtovariousotherlarge-scaleoptimizationproblems.8\fAcknowledgmentsTheworkofZ.Wang,K.Ji,andY.LiangwassupportedinpartbytheU.S.NationalScienceFoundationunderthegrantsCCF-1761506,CCF-1909291,andCCF-1900145.References[1]Z.Allen-Zhu.Katyusha:The\ufb01rstdirectaccelerationofstochasticgradientmethods.JournalofMachineLearningResearch(JMLR),18(1):8194\u20138244,Jan.2017.[2]Z.Allen-Zhu.KatyushaX:Simplemomentummethodforstochasticsum-of-nonconvexoptimization.InProc.InternationalConferenceonMachineLearning(ICML),volume80,pages179\u2013185,10\u201315Jul2018.[3]Z.Allen-Zhu.Natasha2:Fasternon-convexoptimizationthansgd.InProc.AdvancesinNeuralInformationProcessingSystems(NeurIPS),pages2675\u20132686.2018.[4]Z.Allen-ZhuandE.Hazan.Variancereductionforfasternon-convexoptimization.InProcInternationalConferenceonMachineLearning(ICML),pages699\u2013707,2016.[5]L.Bottou,F.E.Curtis,andJ.Nocedal.Optimizationmethodsforlarge-scalemachinelearning.SIAMReview,60(2):223\u2013311,2018.[6]C.ChangandC.Lin.LIBSVM:Alibraryforsupportvectormachines.ACMTransactionsonIntelligentSystemsandTechnology,2(3):1\u201327,2011.[7]A.Defazio,F.Bach,andS.Lacoste-Julien.SAGA:Afastincrementalgradientmethodwithsupportfornon-stronglyconvexcompositeobjectives.InProc.AdvancesinNeuralInformationProcessingSystems(NeurIPS),pages1646\u20131654.2014.[8]C.Fang,C.J.Li,Z.Lin,andT.Zhang.Near-optimalnon-convexoptimizationviastochasticpath-integrateddifferentialestimator.InProc.AdvancesinNeuralInformationProcessingSystems(NeurIPS).2018.[9]S.GhadimiandG.Lan.Stochastic\ufb01rst-andzeroth-ordermethodsfornonconvexstochasticprogramming.SIAMJournalonOptimization,23(4):2341\u20132368,2013.[10]S.GhadimiandG.Lan.Acceleratedgradientmethodsfornonconvexnonlinearandstochasticprogramming.MathematicalProgramming,156(1):59\u201399,Mar2016.[11]S.Ghadimi,G.Lan,andH.Zhang.Mini-batchstochasticapproximationmethodsfornonconvexstochasticcompositeoptimization.MathematicalProgramming,155(1):267\u2013305,Jan2016.[12]F.Huang,S.Chen,andH.Huang.Fasterstochasticalternatingdirectionmethodofmultipliersfornonconvexoptimization.InProc.InternationalConferenceonInternationalConferenceonMachineLearning(ICML).2019.[13]F.Huang,S.Gao,J.Pei,andH.Huang.Nonconvexzeroth-orderstochasticADMMmethodswithlowerfunctionquerycomplexity.arXiv:1907.13463,2019.[14]K.Ji,Z.Wang,Y.Zhou,andY.Liang.Fasterstochasticalgorithmsviahistory-gradientaidedbatchsizeadaptation,2019.[15]K.Ji,Z.Wang,Y.Zhou,andY.Liang.Improvedzeroth-ordervariancereducedalgorithmsandanalysisfornonconvexoptimization.InProceedingsofthe36thInternationalConferenceonMachineLearning,volume97ofProceedingsofMachineLearningResearch,pages3100\u20133109,09\u201315Jun2019.[16]R.JohnsonandT.Zhang.Acceleratingstochasticgradientdescentusingpredictivevariancereduction.InProc.AdvancesinNeuralInformationProcessingSystems(NeurIPS),pages315\u2013323.2013.9\f[17]H.Karimi,J.Nutini,andM.Schmidt.Linearconvergenceofgradientandproximal-gradientmethodsunderthepolyak-\u0142ojasiewiczcondition.InP.Frasconi,N.Landwehr,G.Manco,andJ.Vreeken,editors,Proc.MachineLearningandKnowledgeDiscoveryinDatabases,pages795\u2013811,2016.[18]L.Lei,C.Ju,J.Chen,andM.I.Jordan.Non-convex\ufb01nite-sumoptimizationviaSCSGmethods.InProc.AdvancesinNeuralInformationProcessingSystems(NeurIPS),pages2348\u20132358.2017.[19]Q.Li,Y.Zhou,Y.Liang,andP.K.Varshney.Convergenceanalysisofproximalgradientwithmomentumfornonconvexoptimization.InProc.InternationalConferenceonMachineLearning(ICML),volume70,pages2111\u20132119,2017.[20]X.Li,S.Ling,T.Strohmer,andK.Wei.Rapid,robust,andreliableblinddeconvolutionvianonconvexoptimization.AppliedandComputationalHarmonicAnalysis,2018.[21]Z.Li.SSRGD:Simplestochasticrecursivegradientdescentforescapingsaddlepoints.arXiv:1904.09265,Apr2019.[22]Z.LiandJ.Li.Asimpleproximalstochasticgradientmethodfornonsmoothnonconvexoptimization.InProc.AdvancesinNeuralInformationProcessingSystem(NeurIPS).2018.[23]Y.Nesterov.Introductorylecturesonconvexoptimization:Abasiccourse.SpringerPublishingCompany,Incorporated,2014.[24]L.M.Nguyen,J.Liu,K.Scheinberg,andM.Tak\u00b4a\u02c7c.SARAH:Anovelmethodformachinelearningproblemsusingstochasticrecursivegradient.InProc.InternationalConferenceonMachineLearning(ICML),volume70,pages2613\u20132621,2017.[25]L.M.Nguyen,J.Liu,K.Scheinberg,andM.Tak\u00b4a\u02c7c.Stochasticrecursivegradientalgorithmfornonconvexoptimization.ArXiv:1705.07261,May2017.[26]L.M.Nguyen,M.vanDijk,D.T.Phan,P.H.Nguyen,T.-W.Weng,andJ.R.Kalagnanam.Finite-sumsmoothoptimizationwithSARAH.arXiv:1901.07648,Jan2019.[27]A.Nitanda.Acceleratedstochasticgradientdescentforminimizing\ufb01nitesums.InProc.InternationalConferenceonArti\ufb01cialIntelligenceandStatistics(AISTATS),volume51,pages195\u2013203,May2016.[28]N.H.Pham,L.M.Nguyen,D.T.Phan,andQ.Tran-Dinh.ProxSARAH:Anef\ufb01cientalgo-rithmicframeworkforstochasticcompositenonconvexoptimization.arXiv:1902.05679,Feb2019.[29]S.J.Reddi,A.Hefny,S.Sra,B.Poczos,andA.Smola.Stochasticvariancereductionfornonconvexoptimization.InProc.InternationalConferenceonMachineLearning(ICML),pages314\u2013323,2016.[30]S.J.Reddi,S.Sra,B.Poczos,andA.Smola.Proximalstochasticmethodsfornonsmoothnonconvex\ufb01nite-sumoptimization.InProc.AdvancesinNeuralInformationProcessingSystems(NeurIPS),pages1145\u20131153.2016.[31]N.L.Roux,M.Schmidt,andF.R.Bach.Astochasticgradientmethodwithanexponentialconvergenceratefor\ufb01nitetrainingsets.InProc.AdvancesinNeuralInformationProcessingSystems(NeurIPS),pages2663\u20132671.2012.[32]F.Shang,L.Jiao,K.Zhou,J.Cheng,Y.Ren,andY.Jin.ASVRG:AcceleratedproximalSVRG.InProc.AsianConferenceonMachineLearning,volume95,pages815\u2013830,2018.[33]Y.Xu,R.Jin,andT.Yang.Stochasticproximalgradientmethodsfornon-smoothnon-convexregularizedproblems.arXiv:1902.07672,Feb2019.[34]J.ZhangandL.Xiao.Astochasticcompositegradientmethodwithincrementalvariancereduction.arXiv:1906.10186,2019.10\f[35]J.Zhang,H.Zhang,andS.Sra.R-SPIDER:AfastRiemannianstochasticoptimizationalgorithmwithcurvatureindependentrate.arXiv:811.04194,2018.[36]D.Zhou,P.Xu,andQ.Gu.Stochasticnestedvariancereducedgradientdescentfornonconvexoptimization.InProc.AdvancesinNeuralInformationProcessingSystem(NeurIPS).2018.[37]P.Zhou,X.Yuan,andJ.Feng.Newinsightintohybridstochasticgradientdescent:Beyondwith-replacementsamplingandconvexity.InProc.AdvancesinNeuralInformationProcessingSystems(NeurIPS),pages1242\u20131251.2018.[38]P.Zhou,X.Yuan,andJ.Feng.Faster\ufb01rst-ordermethodsforstochasticnon-convexoptimizationonRiemannianmanifolds.InProc.InternationalConferenceonArti\ufb01cialIntelligenceandStatistics(AISTATS),2019.[39]Y.ZhouandY.Liang.Characterizationofgradientdominanceandregularityconditionsforneuralnetworks.ArXiv:1710.06910v2,Oct2017.[40]Y.Zhou,H.Zhang,andY.Liang.Geometricalpropertiesandacceleratedgradientsolversofnon-convexphaseretrieval.Proc.AnnualAllertonConferenceonCommunication,Control,andComputing(Allerton),pages331\u2013335,2016.11\f", "award": [], "sourceid": 1410, "authors": [{"given_name": "Zhe", "family_name": "Wang", "institution": "Ohio State University"}, {"given_name": "Kaiyi", "family_name": "Ji", "institution": "The Ohio State University"}, {"given_name": "Yi", "family_name": "Zhou", "institution": "University of Utah"}, {"given_name": "Yingbin", "family_name": "Liang", "institution": "The Ohio State University"}, {"given_name": "Vahid", "family_name": "Tarokh", "institution": "Duke University"}]}