{"title": "Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data", "book": "Advances in Neural Information Processing Systems", "page_first": 8157, "page_last": 8166, "abstract": "Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels. Furthermore, the analysis provides interesting insights into several aspects of learning neural networks and can be verified based on empirical studies on synthetic data and on the MNIST dataset.", "full_text": "LearningOverparameterizedNeuralNetworksviaStochasticGradientDescentonStructuredDataYuanzhiLiComputerScienceDepartmentStanfordUniversityStanford,CA94305yuanzhil@stanford.eduYingyuLiangDepartmentofComputerSciencesUniversityofWisconsin-MadisonMadison,WI53706yliang@cs.wisc.eduAbstractNeuralnetworkshavemanysuccessfulapplications,whilemuchlesstheoreticalunderstandinghasbeengained.Towardsbridgingthisgap,westudytheproblemoflearningatwo-layeroverparameterizedReLUneuralnetworkformulti-classclas-si\ufb01cationviastochasticgradientdescent(SGD)fromrandominitialization.Intheoverparameterizedsetting,whenthedatacomesfrommixturesofwell-separateddistributions,weprovethatSGDlearnsanetworkwithasmallgeneralizationerror,albeitthenetworkhasenoughcapacityto\ufb01tarbitrarylabels.Furthermore,theanalysisprovidesinterestinginsightsintoseveralaspectsoflearningneuralnetworksandcanbeveri\ufb01edbasedonempiricalstudiesonsyntheticdataandontheMNISTdataset.1IntroductionNeuralnetworkshaveachievedgreatsuccessinmanyapplications,butdespitearecentincreaseoftheoreticalstudies,muchremainstobeexplained.Forexample,itisempiricallyobservedthatlearningwithstochasticgradientdescent(SGD)intheoverparameterizedsetting(i.e.,learningalargenetworkwithnumberofparameterslargerthanthenumberoftrainingdatapoints)doesnotleadtoover\ufb01tting[24,31].Somerecentstudiesusethelowcomplexityofthelearnedsolutiontoexplainthegeneralization,butusuallydonotexplainhowtheSGDoritsvariantsfavorslowcomplexitysolutions(i.e.,theinductivebiasorimplicitregularization)[3,23].Itisalsoobservedthatoverparameterizationandproperrandominitializationcanhelptheoptimization[28,12,26,18],butitisalsonotwellunderstoodwhyaparticularinitializationcanimprovelearning.Moreover,mostoftheexistingworkstryingtoexplainthesephenomenonsingeneralrelyonunrealisticassumptionsaboutthedatadistribution,suchasGaussian-nessand/orlinearseparability[32,25,10,17,7].Thispaperthusproposestostudytheproblemoflearningatwo-layeroverparameterizedneuralnetworkusingSGDforclassi\ufb01cation,ondatawithamorerealisticstructure.Inparticular,thedataineachclassisamixtureofseveralcomponents,andcomponentsfromdifferentclassesarewellseparatedindistance(butthecomponentsineachclasscanbeclosetoeachother).Thisismotivatedbypracticaldata.Forexample,onthedatasetMNIST[15],eachclasscorrespondstoadigitandcanhaveseveralcomponentscorrespondingtodifferentwritingstylesofthedigit,andanimageinitisasmallperturbationofoneofthecomponents.Ontheotherhand,imagesthatbelongtothesamecomponentareclosertoeachotherthantoanimageofanotherdigit.Analysisinthissettingcanthenhelpunderstandhowthestructureofthepracticaldataaffectstheoptimizationandgeneralization.Inthissetting,weprovethatwhenthenetworkissuf\ufb01cientlyoverparameterized,SGDprovablylearnsanetworkclosetotherandominitializationandwithasmallgeneralizationerror.Thisresultshowsthatintheoverparameterizedsettingandwhenthedataiswellstructured,thoughinprinciple32ndConferenceonNeuralInformationProcessingSystems(NeurIPS2018),Montr\u00e9al,Canada.\fthenetworkcanover\ufb01t,SGDwithrandominitializationintroducesastronginductivebiasandleadstogoodgeneralization.Ourresultalsoshowsthattheoverparameterizationrequirementandthelearningtimedependsontheparametersinherenttothestructureofthedatabutnotontheambientdimensionofthedata.Moreimportantly,theanalysistoobtaintheresultalsoprovidessomeinterestingtheoreticalinsightsforvariousaspectsoflearningneuralnetworks.Itrevealsthatthesuccessoflearningcruciallyreliesonoverparameterizationandrandominitialization.ThesetwocombinedtogetherleadtoatightcouplingaroundtheinitializationbetweentheSGDandanotherlearningprocessthathasabenignoptimizationlandscape.Thiscoupling,togetherwiththestructureofthedata,allowsSGDto\ufb01ndasolutionthathasalowgeneralizationerror,whilestillremainsintheaforementionedneighborhoodoftheinitialization.Ourworkmakesasteptowradsexplaininghowoverparameterizationandrandominitializationhelpoptimization,andhowtheinductivebiasandgoodgeneralizationarisefromtheSGDdynamicsonstructureddata.Someothermoretechnicalimplicationsofouranalysiswillbediscussedinlatersections,suchastheexistenceofagoodsolutionclosetotheinitialization,andthelow-ranknessoftheweightslearned.ComplementaryempiricalstudiesonsyntheticdataandonthebenchmarkdatasetMNISTprovidepositivesupportfortheanalysisandinsights.2RelatedWorkGeneralizationofneuralnetworks.Empiricalstudiesshowinterestingphenomenaaboutthegeneralizationofneuralnetworks:practicalneuralnetworkshavethecapacityto\ufb01trandomlabelsofthetrainingdata,yettheystillhavegoodgeneralizationwhentrainedonpracticaldata[24,31,2].Thesenetworksareoverparameterizedinthattheyhavemoreparametersthanstatisticallynecessary,andtheirgoodgeneralizationcannotbeexplainedbyna\u00efvelyapplyingtraditionaltheory.Severallinesofworkhaveproposedcertainlowcomplexitymeasuresofthelearnednetworkandderivedgeneralizationboundstobetterexplainthephenomena.[3,23,21]provedspectrally-normalizedmargin-basedgeneralizationbounds,[9,23]derivedboundsfromaPAC-Bayesapproach,and[1,33,4]derivedboundsfromthecompressionpointofview.They,ingeneral,donotaddresswhythelowcomplexityarises.Thispapertakesasteptowardsthisdirection,thoughontwo-layernetworksandasimpli\ufb01edmodelofthedata.Overparameterizationandimplicitregularization.Thetrainingobjectivesofoverparameterizednetworksinprinciplehavemany(approximate)globaloptimaandsomegeneralizebetterthantheothers[14,8,2],whileempiricalobservationsimplythattheoptimizationprocessinpracticeprefersthosewithbettergeneralization.Itisthenaninterestingquestionhowthisimplicitregularizationorinductivebiasarisesfromtheoptimizationandthestructureofthedata.RecentstudiesareonSGDfordifferenttasks,suchaslogisticregression[27]andmatrixfactorization[11,19,16].Morerelatedtoourworkis[7],whichstudiestheproblemoflearningatwo-layeroverparameterizednetworkonlinearlyseparabledataandshowsthatSGDconvergestoaglobaloptimumwithgoodgeneralization.Ourworkstudiestheproblemondatawithawellclustered(andpotentiallynotlinearlyseparable)structurethatwebelieveisclosertopracticalscenariosandthuscanadvancethislineofresearch.Theoreticalanalysisoflearningneuralnetworks.Therealsoexistsalargebodyofworkthatanalyzestheoptimizationlandscapeoflearningneuralnetworks[13,26,30,10,25,29,6,32,17,5].TheyingeneralneedtoassumeunrealisticassumptionsaboutthedatasuchasGaussian-ness,and/orhavestrongassumptionsaboutthenetworksuchasusingonlylinearactivation.Theyalsodonotstudytheimplicitregularizationbytheoptimizationalgorithms.3ProblemSetupInthiswork,atwo-layerneuralnetworkwithReLUactivationfork-classesclassi\ufb01cationisgivenbyf=(f1,f2,\u00b7\u00b7\u00b7,fk)suchthatforeachi\u2208[k]:fi(x)=mXr=1ai,rReLU(hwr,xi)where{wr\u2208Rd}aretheweightsforthemneuronsinthehiddenlayer,{ai,r\u2208R}aretheweightsofthetoplayer,andReLU(z)=max{0,z}.2\fAssumptionsaboutthedata.ThedataisgeneratedfromadistributionDasfollows.Therearek\u00d7lunknowndistributions{Di,j}i\u2208[k],j\u2208[l]overRdandprobabilitiespi,j\u22650suchthatPi,jpi,j=1.Eachdatapoint(x,y)isi.i.d.generatedby:(1)Samplez\u2208[k]\u00d7[l]suchthatPr[z=(i,j)]=pi,j;(2)Setlabely=z[0],andsamplexfromDz.AssumewesampleNpoints{(xi,yi)}Ni=1.Letusde\ufb01nethesupportofadistributionDwithdensitypoverRdassupp(D)={x:p(x)>0},thedistancebetweentwosetsS1,S2\u2286Rdasdist(S1,S2)=minx\u2208S1,y\u2208S2{kx\u2212yk2},andthediameterofasetS1\u2286Rdasdiam(S1)=maxx,y\u2208S1{kx\u2212yk2}.Thenwearereadytomaketheassumptionsaboutthedata.(A1)(Separability)Thereexists\u03b4>0suchthatforeveryi16=i2\u2208[k]andeveryj1,j2\u2208[l],dist(supp(Di1,j1),supp(Di2,j2))\u2265\u03b4.Moreover,foreveryi\u2208[k],j\u2208[l],1diam(supp(Di,j))\u2264\u03bb\u03b4,for\u03bb\u22641/(8l).(A2)(Normalization)Anyxfromthedistributionhaskxk2=1.Afewremarksareworthy.Insteadofhavingonedistributionforoneclass,weallowanarbitraryl\u22651distributionsineachclass,whichwebelieveisabetter\ufb01ttotherealdata.Forexample,inMNIST,aclasscanbethenumber1,andlcanbethedifferentstylesofwriting1(1or|or/).Assumption(A2)isforsimplicity,while(A1)isourkeyassumption.Withl\u22651distributionsinsideeachclass,ourassumptionallowsdatathatisnotlinearlyseparable,e.g.,XORtypedatainR2wheretherearetwoclasses,oneconsistingoftwoballsofdiameter1/10withcenters(0,0)and(2,2)andtheotherconsistingoftwoofthesamediameterwithcenters(0,2)and(2,0).SeeFigure3inAppendixCforanillustration.Moreover,essentiallytheonlyassumptionwehavehereis\u03bb=O(1/l).Whenl=1,\u03bb=O(1),whichistheminimalrequirementontheorderof\u03bbforthedistributiontobeef\ufb01cientlylearnable.Ourworkallowslargerl,sothatthedatacanbemorecomplicatedinsideeachclass.Inthiscase,werequiretheseparationtoalsobehigher.Whenweincreaseltore\ufb01nethedistributionsinsideeachclass,weshouldexpectthediametersofeachdistributionbecomesmalleraswell.Aslongastherateofdiameterdecreasingineachdistributionisgreaterthanthetotalnumberofdistributions,thenourassumptionwillhold.Assumptionsaboutthelearningprocess.Wewillonlylearntheweightwrtosimplifytheanalysis.SincetheReLUactivationispositivehomogeneous,theeffectofoverparameterizationcanstillbestudied,andasimilarapproachhasbeenadoptedinpreviouswork[7].Sothenetworkisalsowrittenasy=f(x,w)=(f1(x,w),\u00b7\u00b7\u00b7,fk(x,w))forw=(w1,\u00b7\u00b7\u00b7,wr).Weassumethelearningisfromarandominitialization:(A3)(Randominitialization)w(0)r\u223cN(0,\u03c32I),ai,r\u223cN(0,1),with\u03c3=1m1/2.Thelearningprocessminimizesthecrossentropylossoverthesoftmax,de\ufb01nedas:L(w)=\u22121NNXs=1logoys(xs,w),whereoy(x,w)=efy(x,w)Pki=1efi(x,w).LetL(w,xs,ys)=\u2212logoys(xs,w)denotethecrossentropylossforaparticularpoint(xs,ys).WeconsideraminibatchSGDofbatchsizeB,numberofiterationsT=N/Bandlearningrate\u03b7asthefollowingprocess:RandomlydividethetotaltrainingexamplesintoTbatches,eachofsizeB.Lettheindicesoftheexamplesinthet-thbatchbeBt.Ateachiteration,theupdateis2w(t+1)r=w(t)r\u2212\u03b71BXs\u2208Bt\u2202L(w(t),xs,ys)\u2202w(t)r,\u2200r\u2208[m],where\u2202L(w,xs,ys)\u2202wr=\uf8eb\uf8edXi6=ysai,roi(xs,w)\u2212Xi6=ysays,roi(xs,w)\uf8f6\uf8f81hwr,xsi\u22650xs.(1)1Theassumption1/(8l)canbemadeto1/[(1+\u03b1)l]forany\u03b1>0bypayingalargepolynomialin1/\u03b1inthesamplecomplexity.Wewillnotproveitinthispaperbecausewewouldliketohighlightthekeyfactors.2Strictlyspeaking,L(w,xs,ys)doesnothavegradienteverywhereduetothenon-smoothnessofReLU.Onecanview\u2202L(w,xs,ys)\u2202wrasaconvenientnotationfortherighthandsideof(1).3\f4MainResultFornotationsimplicity,foratargeterror\u03b5(tobespeci\ufb01edlater),withhighprobability(orw.h.p.)meanswithprobability1\u22121/poly(1/\u03b4,k,l,m,1/\u03b5)forasuf\ufb01cientlylargepolynomialpoly,and\u02dcOhidesfactorsofpoly(log1/\u03b4,logk,logl,logm,log1/\u03b5).Theorem4.1.Supposetheassumptions(A1)(A2)(A3)aresatis\ufb01ed.Thenforevery\u03b5>0,thereisM=poly(k,l,1/\u03b4,1/\u03b5)suchthatforeverym\u2265M,afterdoingaminibatchSGDwithbatchsizeB=poly(k,l,1/\u03b4,1/\u03b5,logm)andlearningrate\u03b7=1m\u00b7poly(k,l,1/\u03b4,1/\u03b5,logm)forT=poly(k,l,1/\u03b4,1/\u03b5,logm)iterations,withhighprobability:Pr(x,y)\u223cDh\u2200j\u2208[k],j6=y,fy(x,w(T))>fj(x,w(T))i\u22651\u2212\u03b5.Ourtheoremimpliesifthedatasatis\ufb01esourassumptions,andweparametrizethenetworkproperly,thenweonlyneedpolynomialink,l,1/\u03b4manysamplestoachieveagoodpredictionerror.ThiserrorismeasureddirectlyonthetruedistributionD,notmerelyontheinputdatausedtotrainthisnetwork.Ourresultisalsodimensionfree:Thereisnodependencyontheunderlyingdimensiondofthedata,thecomplexityisfullycapturedbyk,l,1/\u03b4.Moreover,nomatterhowmuchthenetworkisoverparameterized,itwillonlyincreasethetotaliterationsbyfactorsoflogm.Sowecanoverparameterizebyansub-exponentialamountwithoutsigni\ufb01cantlyincreasingthecomplexity.Furthermore,wecanalwaystreateachinputexampleasanindividualdistribution,thus\u03bbisalwayszero.Inthiscase,ifweusebatchsizeBforTiterations,wewouldhavel=N=BT.Thenourtheoremindicatethataslongasm=poly(N,1/\u03b40),where\u03b40istheminimaldistancebetweeneachexamples,wecanactually\ufb01tarbitrarylabelsoftheinputdata.However,sincethetotaliterationonlydependsonlogm,whenm=poly(N,1/\u03b40)buttheinputdataisactuallystructured(withsmallk,landlarge\u03b4),thenSGDcanactuallyachieveasmallgeneralizationerror,evenwhenthenetworkhasenoughcapacityto\ufb01tarbitrarylabelsofthetrainingexamples(andcanalsobedonebySGD).Thus,weprovethatSGDhasastronginductivebiasonstructureddata:Insteadof\ufb01ndingabadglobaloptimathatcan\ufb01tarbitrarylabels,itactually\ufb01ndsthosewithgoodgeneralizationguarantees.Thisgivesmorethoroughexplanationtotheempiricalobservationsin[24,31].5IntuitionandProofSketchforASimpli\ufb01edCaseTotrainaneuralnetworkwithReLUactivations,therearetwoquestionsneedtobeaddressed:1WhycanSGDoptimizethetrainingloss?Oreven\ufb01ndingacriticalpoint?Sincetheunder-lyingnetworkishighlynon-smooth,existingtheoremsdonotgiveany\ufb01niteconvergencerateofSGDfortrainingneuralnetworkwithReLUsactivations.2Whycanthetrainednetworkgeneralize?Evenwhenthecapacityislargeenoughto\ufb01trandomlabelsoftheinputdata?ThisisknownastheinductivebiasofSGD.Thisworktakesasteptowardsansweringthesetwoquestions.Weshowthatwhenthenetworkisoverparameterized,itbecomesmore\u201cpseudosmooth\u201d,whichmakesiteasirforSGDtominimizethetrainingloss,andfurthermore,itwillnothurtthegeneralizationerror.Ourproofisbasedonthefollowingimportantobservation:Themoreweoverparameterizethenetwork,thelesslikelytheactivationpatternforoneneuronandonedatapointwillchangeina\ufb01xednumberofiterations.Thisobservationallowsustocouplethegradientofthetrueneuralnetworkwitha\u201cpseudogradient\u201dwheretheactivationpatternforeachdatapointandeachneuronis\ufb01xed.Thatis,whencomputingthe\u201cpseudogradient\u201d,for\ufb01xedr,i,whetherther-thhiddennodeisactivatedonthei-thdatapointxiwillalwaysbethesamefordifferentt.(Butfor\ufb01xedt,fordifferentrori,thesigncanbedifferent.)Weareabletoprovethatunlessthegeneralizationerrorissmall,the\u201cpseudogradient\u201dwillalwaysbelarge.Moreover,weshowthatthenetworkisactuallysmooththusSGDcanminimizetheloss.Wethenshowthatwhenthenumbermofhiddenneuronsincreases,withaproperlydecreasinglearningrate,thetotalnumberofiterationsittakestominimizethelossisroughlynotchanged.However,thetotalnumberofiterationsthatwecancouplethetruegradientwiththepseudoone4\fincreases.Thus,thereisapolynomiallylargemsothatwecancouplethesetwogradientsuntilthenetworkreachesasmallgeneralizationerror.5.1ASimpli\ufb01edCase:NoVarianceHereweillustratetheproofsketchforasimpli\ufb01edcaseandAppendixAprovidestheproof.TheproofforthegeneralcaseisprovidedinAppendixB.Inthesimpli\ufb01edcase,wefurtherassume:(S)(Novariance)EachDa,bisasingledatapoint(xa,b,a),andalsowearedoingfullbatchgradientdescentasoppositetotheminibatchSGD.ThenwereloadthelossnotationasL(w)=Pa\u2208[k],b\u2208[l]pa,bL(w,xa,b,a),andthegradientis\u2202L(w)\u2202wr=Xa\u2208[k],b\u2208[l]pa,b\uf8eb\uf8edXi6=aai,roi(xa,b,w)\u2212Xi6=aaa,roi(xa,b,w)\uf8f6\uf8f81hwr,xa,bi\u22650xa,b.Followingtheintuitionabove,wede\ufb01nethepseudogradientas\u02dc\u2202L(w)\u2202wr=Xa\u2208[k],b\u2208[l]pa,b\uf8eb\uf8edXi6=aai,roi(xa,b,w)\u2212Xi6=aaa,roi(xa,b,w)\uf8f6\uf8f81hw(0)r,xa,bi\u22650xa,b,whereituses1hw(0)r,xa,bi\u22650insteadof1hwr,xa,bi\u22650asinthetruegradient.Thatis,theactivationpatternissettobethatintheinitialization.Intuitively,thepseudogradientissimilartothegradientforapseudonetworkg(butnotexactlythesame),de\ufb01nedasgi(x,w):=Pmr=1ai,rhwr,xi1Dw(0)r,xE\u22650.Couplingthegradientsisthensimilartocouplingthenetworksfandg.Forsimplicity,letva,a,b:=Pi6=aoi(xa,b,w)=Pi6=aefi(xa,b,w)Pki=1efi(xa,b,w)andwhens6=a,vs,a,b:=\u2212os(xa,b,w)=\u2212efs(xa,b,w)Pki=1efi(xa,b,w).Roughly,ifva,a,bissmall,thenfa(xa,b,w)isrelativelylargercomparedtotheotherfi(xa,b,w),sotheclassi\ufb01cationerrorissmall.Weprovethefollowingtwomainlemmas.The\ufb01rstsaysthatateachiteration,thetotalnumberofhiddenunitswhosegradientcanbecoupledwiththepseudooneisquitelarge.Lemma5.1(Coupling).W.h.p.overtherandominitialization,forevery\u03c4>0,foreveryt=\u02dcO(cid:16)\u03c4\u03b7(cid:17),wehavethatforatleast1\u2212e\u03c4kl\u03c3fractionofr\u2208[m]:\u2202L(w(t))\u2202wr=\u02dc\u2202L(w(t))\u2202wr.Thesecondlemmasaysthatthepseudogradientislargeunlesstheerrorissmall.Lemma5.2.Form=\u02dc\u2126(cid:16)k3l2\u03b4(cid:17),forevery{pa,bvi,a,b}i,a\u2208[k],b\u2208[l]\u2208[\u2212v,v](thatdependsonw(0)r,ai,r,etc.)withmax{pa,bvi,a,b}i,a\u2208[k],b\u2208[l]=v,thereexistsatleast\u2126(\u03b4kl)fractionofr\u2208[m]suchthat(cid:13)(cid:13)(cid:13)\u02dc\u2202L(w)\u2202wr(cid:13)(cid:13)(cid:13)2=\u02dc\u2126(cid:0)v\u03b4kl(cid:1).Wenowillustratehowtousethesetwolemmastoshowtheconvergenceforasmallenoughlearningrate\u03b7.Forsimplicity,letusassumethatkl/\u03b4=O(1)and\u03b5=o(1).Thus,byLemma5.2weknowthatunlessv\u2264\u03b5,thereare\u2126(1)fractionofrsuchthat(cid:13)(cid:13)(cid:13)\u02dc\u2202L(w)/\u2202wr(cid:13)(cid:13)(cid:13)2=\u2126(\u03b5).Moreover,byLemma5.1weknowthatwecanpick\u03c4=\u0398(\u03c3\u03b5)soe\u03c4/\u03c3=\u0398(\u03b5),whichimpliesthatthereare\u2126(1)fractionofrsuchthatk\u2202L(w)/\u2202wrk2=\u2126(\u03b5)aswell.Forsmallenoughlearningrate\u03b7,doingonestepofgradientdescentwillthusdecreaseL(w)by\u2126(\u03b7m\u03b52),soitconvergesint=O(cid:0)1/\u03b7m\u03b52(cid:1)iterations.Intheend,wejustneedtomakesurethat1/\u03b7m\u03b52\u2264O(\u03c4/\u03b7)=\u0398(\u03c3\u03b5/\u03b7)sowecanalwaysapplythecouplingLemma5.1.By\u03c3=\u02dcO(1/m\u22121/2)weknowthatthisistrueaslongasm\u2265poly(1/\u03b5).Asmallvcanbeshowntoleadtoasmallgeneralizationerror.6DiscussionofInsightsfromtheAnalysisOuranalysis,thoughforlearningtwo-layernetworksonwellstructureddata,alsoshedssomelightuponlearningneuralnetworksinmoregeneralsettings.5\fGeneralization.Severallinesofrecentworkexplainthegeneralizationphenomenonofoverparam-eterizednetworksbylowcomplexityofthelearnednetworks,fromthepointviewsofspectrally-normalizedmargins[3,23,21],compression[1,33,4],andPAC-Bayes[9,23].OuranalysishaspartiallyexplainedhowSGD(withproperrandominitialization)onstructureddataleadstothelowcomplexityfromthecompressionandPCA-Bayespointviews.Wehaveshownthatinaneighborhoodoftherandominitialization,w.h.p.thegradientsaresimilartothoseofanotherbenignlearningprocess,andthusSGDcanreducetheerrorandreachagoodsolutionwhilestillintheneighborhood.Theclosenesstotheinitializationthenmeanstheweights(ormorepreciselythedifferencebetweenthelearnedweightsandtheinitialization)canbeeasilycompressed.Infact,empiricalobservationshavebeenmadeandconnectedtogeneralizationin[22,1].Furthermore,[1]explicitlypointoutsuchacompressionusingahelperstring(correspondingtotheinitializationinoursetting).[1]alsopointoutthatthecompressionviewcanberegardedasamoreexplicitformofthePAC-Bayesview,andthusourintuitionalsoappliestothelatter.Theexistenceofasolutionofasmallgeneralizationerrorneartheinitializationisitselfnotobvious.Intuitively,onstructureddata,theupdatesarestructuredsignalsspreadoutacrosstheweightsofthehiddenneurons.Thenforprediction,therandominitializedpartintheweightshasstrongcancellation,whilethestructuredsignalpartintheweightscollectivelyaffectstheoutput.Therefore,thelattercanbemuchsmallerthantheformerwhilethenetworkcanstillgiveaccuratepredictions.Inotherwords,therecanbeasolutionnotfarfromtheinitializationwithhighprobability.Someinsightisprovidedonthelowrankoftheweights.Moreprecisely,whenthedataarewellclusteredaroundafewpatterns,theaccumulatedupdates(differencebetweenthelearnedweightsandtheinitialization)shouldbeapproximatelylowrank,whichcanbeseenfromcheckingtheSGDupdates.However,whenthedifferenceissmallcomparedtotheinitialization,thespectrumofthe\ufb01nalweightmatrixisdominatedbythatoftheinitializationandthuswilltendtoclosertothatofarandommatrix.Again,suchobservations/intuitionshavebeenmadeintheliteratureandconnectedtocompressionandgeneralization(e.g.,[1]).Implicitregularizationv.s.structureofthedata.Existingworkhasanalyzedtheimplicitregular-izationofSGDonlogisticregression[27],matrixfactorization[11,19,16],andlearningtwo-layernetworksonlinearlyseparabledata[7].Oursettingandalsotheanalysistechniquesarenovelcomparedtotheexistingwork.Onemotivationtostudyonstructureddataistounderstandtheroleofstructureddataplayintheimplicitregularization,i.e.,theobservationthatthesolutionlearnedonlessstructuredorevenrandomdataisfurtherawayfromtheinitialization.Indeed,ouranalysisshowsthatwhenthenetworksizeis\ufb01xed(andsuf\ufb01cientlyoverparameterized),learningoverpoorlystructureddata(largerkand\u2018)needsmoreiterationsandthusthesolutioncandeviatemorefromtheinitializationandhashighercomplexity.Anextremeandespeciallyinterestingcaseiswhenthenetworkisoverparameterizedsothatinprincipleitcan\ufb01tthetrainingdatabyviewingeachpointasacomponentwhileactuallytheycomefromstructureddistributionswithsmallnumberofcomponents.Inthiscase,wecanshowthatitstilllearnsanetworkwithasmallgeneralizationerror;seethemoretechnicaldiscussioninSection4.Wealsonotethatouranalysisisundertheassumptionthatthenetworkissuf\ufb01cientlyoverparam-eterized,i.e.,misasuf\ufb01cientlylargepolynomialofk,\u2018andotherrelatedparametersmeasuringthestructureofthedata.Therecouldbethecasethatmissmallerthanthispolynomialbutismorethansuf\ufb01cientto\ufb01tthedata,i.e.,thenetworkisstilloverparameterized.Thoughinthiscasetheanalysisstillprovidesusefulinsight,itdoesnotfullyapply;seeourexperimentswithrelativelysmallm.Ontheotherhand,theempiricalobservations[24,31]suggestthatpracticalnetworksarehighlyoverparameterized,soourintuitionmaystillbehelpfulthere.Effectofrandominitialization.Ouranalysisalsoshowshowproperrandominitializationshelpstheoptimizationandconsequentlygeneralization.Essentially,thisguaranteesthatw.h.p.forweightsclosetotheinitialization,manyhiddenReLUunitswillhavethesameactivationpatterns(i.e.,activatedornot)asfortheinitializations,whichmeansthegradientsintheneighborhoodlooklikethosewhenthehiddenunitshave\ufb01xedactivationpatterns.ThisallowsSGDmakesprogresswhenthelossislarge,andeventuallylearnsagoodsolution.Wealsonotethatitisessentialtocarefullysetthescaleoftheinitialization,whichisaextensivelystudiedtopic[20,28].Ourinitializationhasascalerelatedtothenumberofhiddenunits,whichisparticularlyusefulwhenthenetworksizeisvarying,andthuscanbeofinterestinsuchpracticalsettings.6\f050100150200250300350400Number of steps0.00.20.40.60.81.0Test AccuracyTest Accuracy v.s. number of stepsNumber of hidden nodes50010002000400080001600032000(a)Testaccuracy050100150200250300350400Number of steps0.000.010.020.030.040.050.060.07Activation pattern difference ratioActivation difference v.s. number of stepsNumber of hidden nodes50010002000400080001600032000(b)Coupling050100150200250300350400Number of steps0.0000.0050.0100.0150.0200.0250.0300.0350.040Relative distanceRelative distance v.s. number of stepsNumber of hidden nodes50010002000400080001600032000(c)Distancefromtheinitialization020406080100Singular value index10 610 510 410 310 210 1100Singular valueSingular values of weight matrix and accumulated updatesSpectrum forWeight matrixAccumulated updates(d)Rankofaccumulatedupdates(y-axisinlog-scale)Figure1:Resultsonthesyntheticdata.7ExperimentsThissectionaimsatverifyingsomekeyimplications:(1)theactivationpatternsofthehiddenunitscouplewiththoseatinitialization;(2)Thedistancefromthelearnedsolutionfromtheinitializationisrelativelysmallcomparedtothesizeofinitialization;(3)Theaccumulatedupdates(i.e.,thedifferencebetweenthelearnedweightmatrixandtheinitialization)haveapproximatelylowrank.TheseareindeedsupportedbytheresultsonthesyntheticandtheMNISTdata.AdditionalexperimentsarepresentedinAppendixD.Setup.Thesyntheticdataareof1000dimensionandconsistofk=10classes,eachhaving\u2018=2components.Eachcomponentisofequalprobability1/(kl),andisaGaussianwithcovariance\u03c32/dIanditsmeanisi.i.d.sampledfromaGaussiandistributionN(0,\u03c320/d),where\u03c3=1and\u03c30=5.1000trainingdatapointsand1000testdatapointsaresampled.ThenetworkstructureandthelearningprocessfollowthoseinSection3;thenumberofhiddenunitsmvariesintheexperiments,andtheweightsareinitializedwithN(0,1/\u221am).Onthesyntheticdata,theSGDisrunforT=400stepswithbatchsizeB=16andlearningrate\u03b7=10/m.OnMNIST,theSGDisrunforT=2\u00d7104stepswithbatchsizeB=64andlearningrate\u03b7=4\u00d7103/m.Besidesthetestaccuracy,wereportthreequantitiescorrespondingtothethreeobserva-tions/implicationstobeveri\ufb01ed.First,forcoupling,wecomputethefractionofhiddenunitswhoseactivationpatternchangedcomparedtothetimeatinitialization.Here,theactivationpatternisde\ufb01nedas1iftheinputtotheReLUispositiveand0otherwise.Second,fordistance,wecomputetherelativeratiokw(t)\u2212w(0)kF/kw(0)kF,wherew(t)istheweightmatrixattimet.Finally,fortherankoftheaccumulatedupdates,weplotthesingularvaluesofw(T)\u2212w(0)whereTisthe\ufb01nalstep.Allexperimentsarerepeated5times,andthemeanandstandarddeviationarereported.7\f025005000750010000125001500017500Number of steps0.20.40.60.8Test AccuracyTest Accuracy v.s. number of stepsNumber of hidden nodes10002000400080001600032000(a)Testaccuracy025005000750010000125001500017500Number of steps0.00.10.20.30.4Activation pattern difference ratioActivation difference v.s. number of stepsNumber of hidden nodes10002000400080001600032000(b)Coupling025005000750010000125001500017500Number of steps0.000.250.500.751.001.251.501.752.00Relative distanceRelative distance v.s. number of stepsNumber of hidden nodes10002000400080001600032000(c)Distancefromtheinitialization01020304050607080Singular value index10\u22121510\u22121310\u22121110\u2212910\u2212710\u2212510\u2212310\u22121101Singular valueSingular values of the eight matrix and accumulated updatesSpectrum forWeight matrixAccumulated updates(d)Rankofaccumulatedupdates(y-axisinlog-scale)Figure2:ResultsontheMNISTdata.Results.Figure1showstheresultsonthesyntheticdata.Thetestaccuracyquicklyconvergesto100%,whichisevenmoresigni\ufb01cantwithlargernumberofhiddenunits,showingthattheoverparameterizationhelpstheoptimizationandgeneralization.Recallthatouranalysisshowsthatforalearningratelinearlydecreasingwiththenumberofhiddennodesm,thenumberofiterationstogettheaccuracytoachieveadesiredaccuracyshouldberoughlythesame,whichisalsoveri\ufb01edhere.Theactivationpatterndifferenceratioislessthan0.1,indicatingastrongcoupling.Therelativedistanceislessthan0.1,sothe\ufb01nalsolutionisindeedclosetotheinitialization.Finally,thetop20singularvaluesoftheaccumulatedupdatesaremuchlargerthantherestwhilethespectrumoftheweightmatrixdonothavesuchstructure,whichisalsoconsistentwithouranalysis.Figure2showstheresultsonMNIST.Theobservationingeneralissimilartothoseonthesyntheticdata(thoughlesssigni\ufb01cant),andalsotheobservedtrendbecomemoreevidentwithmoreoverpa-rameterization.Someadditionalresults(e.g.,varyingthevarianceofthesyntheticdata)areprovidedintheappendixthatalsosupportourtheory.8ConclusionThisworkstudiedtheproblemoflearningatwo-layeroverparameterizedReLUneuralnetworkviastochasticgradientdescent(SGD)fromrandominitialization,ondatawithstructureinspiredbypracticaldatasets.WhileourworkmakesasteptowardstheoreticalunderstandingofSGDfortrainingneuralnetwors,itisfarfrombeingconclusive.Inparticular,therealdatacouldbeseparablewithrespecttodifferentmetricthan\u20182,orevenanon-convexdistancegivenbysomemanifold.Weviewthisanimportantopendirection.8\fAcknowledgementsWewouldliketothanktheanonymousreviewersofNIPS\u201918andJasonLeeforhelpfulcomments.ThisworkwassupportedinpartbyFA9550-18-1-0166,NSFgrantsCCF-1527371,DMS-1317308,SimonsInvestigatorAward,SimonsCollaborationGrant,andONR-N00014-16-1-2329.YingyuLiangwouldalsoliketoacknowledgethatsupportforthisresearchwasprovidedbytheOf\ufb01ceoftheViceChancellorforResearchandGraduateEducationattheUniversityofWisconsinMadisonwithfundingfromtheWisconsinAlumniResearchFoundation.References[1]SanjeevArora,RongGe,BehnamNeyshabur,andYiZhang.Strongergeneralizationboundsfordeepnetsviaacompressionapproach.arXivpreprintarXiv:1802.05296,2018.[2]DevanshArpit,Stanis\u0142awJastrz\u02dbebski,NicolasBallas,DavidKrueger,EmmanuelBengio,MaxinderSKanwal,TeganMaharaj,AsjaFischer,AaronCourville,YoshuaBengio,etal.Acloserlookatmemorizationindeepnetworks.arXivpreprintarXiv:1706.05394,2017.[3]PeterLBartlett,DylanJFoster,andMatusJTelgarsky.Spectrally-normalizedmarginboundsforneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages6241\u20136250,2017.[4]CenkBaykal,LucasLiebenwein,IgorGilitschenski,DanFeldman,andDanielaRus.Data-dependentcoresetsforcompressingneuralnetworkswithapplicationstogeneralizationbounds.arXivpreprintarXiv:1804.05345,2018.[5]DigvijayBoobandGuanghuiLan.Theoreticalpropertiesoftheglobaloptimizeroftwolayerneuralnetwork.arXivpreprintarXiv:1710.11241,2017.[6]AlonBrutzkusandAmirGloberson.Globallyoptimalgradientdescentforaconvnetwithgaussianinputs.arXivpreprintarXiv:1702.07966,2017.[7]AlonBrutzkus,AmirGloberson,EranMalach,andShaiShalev-Shwartz.Sgdlearnsover-parameterizednetworksthatprovablygeneralizeonlinearlyseparabledata.arXivpreprintarXiv:1710.10174,2017.[8]LaurentDinh,RazvanPascanu,SamyBengio,andYoshuaBengio.Sharpminimacangeneralizefordeepnets.arXivpreprintarXiv:1703.04933,2017.[9]GintareKarolinaDziugaiteandDanielMRoy.Computingnonvacuousgeneralizationboundsfordeep(stochastic)neuralnetworkswithmanymoreparametersthantrainingdata.arXivpreprintarXiv:1703.11008,2017.[10]RongGe,JasonDLee,andTengyuMa.Learningone-hidden-layerneuralnetworkswithlandscapedesign.arXivpreprintarXiv:1711.00501,2017.[11]SuriyaGunasekar,BlakeEWoodworth,SrinadhBhojanapalli,BehnamNeyshabur,andNatiSrebro.Implicitregularizationinmatrixfactorization.InAdvancesinNeuralInformationProcessingSystems,pages6152\u20136160,2017.[12]MoritzHardtandTengyuMa.Identitymattersindeeplearning.arXivpreprintarXiv:1611.04231,2016.[13]KenjiKawaguchi.Deeplearningwithoutpoorlocalminima.InAdvancesinNeuralInformationProcessingSystems,pages586\u2013594,2016.[14]NitishShirishKeskar,DheevatsaMudigere,JorgeNocedal,MikhailSmelyanskiy,andPingTakPeterTang.Onlarge-batchtrainingfordeeplearning:Generalizationgapandsharpminima.arXivpreprintarXiv:1609.04836,2016.[15]YannLeCun,L\u00e9onBottou,YoshuaBengio,andPatrickHaffner.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278\u20132324,1998.9\f[16]YuanzhiLi,TengyuMa,andHongyangZhang.Algorithmicregularizationinover-parameterizedmatrixrecovery.arXivpreprintarXiv:1712.09203,2017.[17]YuanzhiLiandYangYuan.Convergenceanalysisoftwo-layerneuralnetworkswithreluactivation.InAdvancesinNeuralInformationProcessingSystems,pages597\u2013607,2017.[18]RoiLivni,ShaiShalev-Shwartz,andOhadShamir.Onthecomputationalef\ufb01ciencyoftrainingneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages855\u2013863,2014.[19]CongMa,KaizhengWang,YuejieChi,andYuxinChen.Implicitregularizationinnonconvexstatisticalestimation:Gradientdescentconvergeslinearlyforphaseretrieval,matrixcompletionandblinddeconvolution.arXivpreprintarXiv:1711.10467,2017.[20]JamesMartens.Deeplearningviahessian-freeoptimization.InICML,volume27,pages735\u2013742,2010.[21]CisseMoustapha,BojanowskiPiotr,GraveEdouard,DauphinYann,andUsunierNicolas.Parse-valnetworks:Improvingrobustnesstoadversarialexamples.arXivpreprintarXiv:1704.08847,2017.[22]VaishnavhNagarajanandZicoKolter.Generalizationindeepnetworks:Theroleofdistancefrominitialization.NIPSworkshoponDeepLearning:BridgingTheoryandPractice,2017.[23]BehnamNeyshabur,SrinadhBhojanapalli,DavidMcAllester,andNathanSrebro.Apac-bayesianapproachtospectrally-normalizedmarginboundsforneuralnetworks.arXivpreprintarXiv:1707.09564,2017.[24]BehnamNeyshabur,RyotaTomioka,andNathanSrebro.Insearchoftherealinductivebias:Ontheroleofimplicitregularizationindeeplearning.arXivpreprintarXiv:1412.6614,2014.[25]MahdiSoltanolkotabi,AdelJavanmard,andJasonDLee.Theoreticalinsightsintotheoptimiza-tionlandscapeofover-parameterizedshallowneuralnetworks.arXivpreprintarXiv:1707.04926,2017.[26]DanielSoudryandYairCarmon.Nobadlocalminima:Dataindependenttrainingerrorguaranteesformultilayerneuralnetworks.arXivpreprintarXiv:1605.08361,2016.[27]DanielSoudry,EladHoffer,andNathanSrebro.Theimplicitbiasofgradientdescentonseparabledata.arXivpreprintarXiv:1710.10345,2017.[28]IlyaSutskever,JamesMartens,GeorgeDahl,andGeoffreyHinton.Ontheimportanceofinitializationandmomentumindeeplearning.InInternationalconferenceonmachinelearning,pages1139\u20131147,2013.[29]YuandongTian.Ananalyticalformulaofpopulationgradientfortwo-layeredrelunetworkanditsapplicationsinconvergenceandcriticalpointanalysis.arXivpreprintarXiv:1703.00560,2017.[30]BoXie,YingyuLiang,andLeSong.Diversityleadstogeneralizationinneuralnetworks.arXivpreprintArxiv:1611.03131,2016.[31]ChiyuanZhang,SamyBengio,MoritzHardt,BenjaminRecht,andOriolVinyals.Understandingdeeplearningrequiresrethinkinggeneralization.arXivpreprintarXiv:1611.03530,2016.[32]KaiZhong,ZhaoSong,PrateekJain,PeterLBartlett,andInderjitSDhillon.Recoveryguaranteesforone-hidden-layerneuralnetworks.arXivpreprintarXiv:1706.03175,2017.[33]WendaZhou,VictorVeitch,MorganeAustern,RyanPAdams,andPeterOrbanz.Com-pressibilityandgeneralizationinlarge-scaledeeplearning.arXivpreprintarXiv:1804.05862,2018.10\f", "award": [], "sourceid": 4999, "authors": [{"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "University of Wisconsin Madison"}]}