{"title": "Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel", "book": "Advances in Neural Information Processing Systems", "page_first": 9712, "page_last": 9724, "abstract": "Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard $\\ell_2$ regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in $d$ dimensions which the optimal regularized neural net learns with $O(d)$ samples but the NTK requires $\\Omega(d^2)$ samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable. We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.", "full_text": "RegularizationMatters:GeneralizationandOptimizationofNeuralNetsv.s.theirInducedKernelColinWeiDepartmentofComputerScienceStanfordUniversitycolinwei@stanford.eduJasonD.LeeDepartmentofElectricalEngineeringPrincetonUniversityjasonlee@princeton.eduQiangLiuDepartmentofComputerScienceUniversityofTexasatAustinlqiang@cs.texas.eduTengyuMaDepartmentofComputerScienceStanfordUniversitytengyuma@stanford.eduAbstractRecentworkshaveshownthatonsuf\ufb01cientlyover-parametrizedneuralnets,gra-dientdescentwithrelativelylargeinitializationoptimizesapredictionfunctionintheRKHSoftheNeuralTangentKernel(NTK).Thisanalysisleadstoglobalconvergenceresultsbutdoesnotworkwhenthereisastandard\u20182regularizer,whichisusefultohaveinpractice.Weshowthatsampleef\ufb01ciencycanindeeddependonthepresenceoftheregularizer:weconstructasimpledistributioninddimensionswhichtheoptimalregularizedneuralnetlearnswithO(d)samplesbuttheNTKrequires\u2126(d2)samplestolearn.Toprovethis,weestablishtwoanalysistools:i)formulti-layerfeedforwardReLUnets,weshowthattheglobalminimizerofaweakly-regularizedcross-entropylossisthemaxnormalizedmarginsolutionamongallneuralnets,whichgeneralizeswell;ii)wedevelopanewtechniqueforprovinglowerboundsforkernelmethods,whichreliesonshowingthatthekernelcannotfocusoninformativefeatures.Motivatedbyourgeneralizationresults,westudywhethertheregularizedglobaloptimumisattainable.Weprovethatforin\ufb01nite-widthtwo-layernets,noisygradientdescentoptimizestheregularizedneuralnetlosstoaglobalminimuminpolynomialiterations.1IntroductionIndeeplearning,over-parametrizationreferstothewidely-adoptedtechniqueofusingmorepa-rametersthannecessary[35,40].Over-parametrizationiscrucialforsuccessfuloptimization,andalargebodyofworkhasbeendevotedtowardsunderstandingwhy.Onelineofrecentworks[17,37,22,21,2,76,31,6,16,72]offersanexplanationthatinvitesanalogywithker-nelmethods,provingthatwithsuf\ufb01cientover-parameterizationandacertaininitializationscaleandlearningrateschedule,gradientdescentessentiallylearnsalinearclassi\ufb01erontopoftheinitialrandomfeatures.Forthissamesetting,Daniely[17],Duetal.[22,21],Jacotetal.[31],Aroraetal.[6,5]makethisconnectionexplicitbyestablishingthatthepredictionfunctionfoundbygradientdescentisinthespanofthetrainingdatainareproducingkernelHilbertspace(RKHS)inducedbytheNeuralTangentKernel(NTK).ThegeneralizationerroroftheresultingnetworkcanbeanalyzedviatheRademachercomplexityofthekernelmethod.Theseworksprovidesomeofthe\ufb01rstalgorithmicresultsforthesuccessofgradientdescentinoptimizingneuralnets;however,theresultinggeneralizationerrorisonlyasgoodasthatof\ufb01xed33rdConferenceonNeuralInformationProcessingSystems(NeurIPS2019),Vancouver,Canada.\fFigure1:DatapointsfromDhave\ufb01rsttwocoordinatesdisplayedabove,withredandbluedenotinglabelsof-1,+1,respectively.Theremainingcoordinatesareuniformin{\u22121,+1}d\u22122.kernels[6].Ontheotherhand,theequivalenceofgradientdescentandNTKisbrokenifthelosshasanexplicitregularizersuchasweightdecay.Inthispaper,westudytheeffectofanexplicitregularizeronneuralnetgeneralizationviathelensofmargintheory.We\ufb01rstconstructasimpledistributiononwhichthetwo-layernetworkoptimizingexplicitlyregularizedlogisticlosswillachievealargemargin,andtherefore,goodgeneralization.Ontheotherhand,anypredictionfunctioninthespanofthetrainingdataintheRKHSinducedbytheNTKwillover\ufb01ttonoiseandthereforeachievepoormarginandbadgeneralization.Theorem1.1(InformalversionofTheorem2.1).ConsiderthesettingoflearningthedistributionDde\ufb01nedinFigure1usingatwo-layernetworkwithreluactivationswiththegoalofachievingsmallgeneralizationerror.Usingo(d2)samples,nofunctioninthespanofthetrainingdataintheRKHSinducedbytheNTKcansucceed.Ontheotherhand,theglobaloptimizerofthe\u20182-regularizedlogisticlosscanlearnDwithO(d)samples.ThefullresultisstatedinSection2.Theintuitionisthatregularizationallowstheneuralnettoobtainabettermarginthanthe\ufb01xedNTKkernelandthusachievebettergeneralization.OursamplecomplexitylowerboundforNTKappliestoabroadclassoflossesincludingstandard0-1classi\ufb01cationlossandsquared\u20182.Tothebestofourknowledge,theprooftechniquesforobtainingthisboundarenovelandofindependentinterest(seeourproofoverviewinSection2).InSection5,wecon\ufb01rmempiricallythatanexplicitregularizercanindeedimprovethemarginandgeneralization.YehudaiandShamir[73]alsoprovealowerboundonthelearnabilityofneuralnetkernels.Theyshowthatanapproximationresultthat\u2126(exp(d))randomrelufeaturesarerequiredto\ufb01tasingleneuronin\u20182squaredloss,whichlowerboundstheamountofover-parametrizationnecessarytoapproximateasingleneuron.Incontrast,weprovesample-complexitylowerboundswhichholdforbothclassi\ufb01cationand\u20182lossevenwithin\ufb01niteover-parametrization.Motivatedbytheprovablybettergeneralizationofregularizedneuralnetsforourconstructedinstance,inSection3westudytheiroptimization,asthepreviouslycitedresultsonlyapplywhentheneuralnetbehaveslikeakernel.Weshowoptimizationispossibleforin\ufb01nite-widthregularizednets.Theorem1.2(Informal,seeTheorem3.3).Forin\ufb01nite-widthtwolayernetworkswith\u20182-regularizedloss,noisygradientdescent\ufb01ndsaglobaloptimizerinapolynomialnumberofiterations.Thisimprovesuponpriorworks[43,15,65,61]whichstudyoptimizationinthesamein\ufb01nite-widthlimitbutdonotprovidepolynomialconvergencerates.(SeemorediscussionsinSection3.)ToestablishTheorem1.1,werelyontoolsfrommargintheory.InSection4,weproveanumberofresultsofindependentinterestregardingthemarginofaregularizedneuralnet.Weshowthattheglobalminimumofweakly-regularizedlogisticlossofanyhomogeneousnetwork(regardlessofdepthorwidth)achievesthemaxnormalizedmarginamongallnetworkswiththesamearchitecture(Theorem4.1).By\u201cweak\u201dregularizer,wemeanthatthecoef\ufb01cientoftheregularizerinthelossisverysmall(approaching0).Bycombiningwitharesultof[25],weconcludethattheminimizerenjoysawidth-freegeneralizationbounddependingononlytheinversenormalizedmargin(normalizedbythenormoftheweights)anddepth(Corollary4.2).Thisexplainswhyoptimizingthe\u20182-regularizedlosstypicallyusedinpracticecanleadtoparameterswithalargemarginandgoodgeneralization.Wefurthernotethatthemaximumpossiblemarginisnon-decreasinginthewidthofthearchitecture,sothegeneralizationboundofCorollary4.2improvesasthesizeofthenetworkgrows(seeTheorem4.3).Thus,evenifthedatasetisalreadyseparable,itcouldstillbeusefultofurtherover-parameterizetoachievebettergeneralization.Finally,weempiricallyvalidateseveralclaimsmadeinthispaperinSection5.First,wecon\ufb01rmonsyntheticdatathatneuralnetworksdogeneralizebetterwithanexplicitregularizervs.without.Second,weshowthatfortwo-layernetworks,thetesterrordecreasesandmarginincreasesasthehiddenlayergrows,aspredictedbyourtheory.2\f1.1AdditionalRelatedWorkZhangetal.[74]andNeyshaburetal.[52]showthatneuralnetworkgeneralizationde\ufb01esconventionalexplanationsandrequiresnewones.Neyshaburetal.[48]initiatethesearchforthe\u201cinductivebias\u201dofneuralnetworkstowardssolutionswithgoodgeneralization.Recentpapers[30,12,14]studyinductivebiasthroughtrainingtimeandsharpnessoflocalminima.Neyshaburetal.[49]proposeasteepestdescentalgorithminageometryinvarianttoweightrescalingandshowthisimprovesgeneralization.Morcosetal.[45]relategeneralizationtothenumberof\u201cdirections\u201dintheneurons.Otherpapers[26,68,46,28,38,27,38,32]studyimplicitregularizationtowardsaspeci\ufb01csolution.Maetal.[41]showthatimplicitregularizationhelpsgradientdescentavoidovershootingoptima.Rossetetal.[58,59]studylinearlogisticregressionwithweakregularizationandshowconvergencetothemaxmargin.InSection4,weadopttheirtechniquesandextendtheirresults.AlineofworkinitiatedbyNeyshaburetal.[50]hasfocusedonderivingtighternorm-basedRademachercomplexityboundsfordeepneuralnetworks[9,51,25]andnewcompressionbasedgeneralizationproperties[4].Bartlettetal.[9]highlighttheimportantroleofnormalizedmargininneuralnetgeneralization.WeiandMa[70]provegeneralizationboundsdependingonadditionaldata-dependentproperties.DziugaiteandRoy[23]computenon-vacuousgeneralizationboundsfromPAC-Bayesbounds.Neyshaburetal.[53]investigatetheRademachercomplexityoftwo-layernetworksandproposeaboundthatisdecreasingwiththedistancetoinitialization.LiangandRakhlin[39]andBelkinetal.[10]studythegeneralizationofkernelmethods.Foroptimization,SoudryandCarmon[67]explainwhyover-parametrizationcanremovebadlocalminima.SafranandShamir[63]showover-parametrizationcanimprovethequalityofarandominitialization.HaeffeleandVidal[29],NguyenandHein[55],andVenturietal.[69]showthatforsuf\ufb01cientlyoverparametrizednetworks,alllocalminimaareglobal,butdonotshowhowto\ufb01ndtheseminimaviagradientdescent.DuandLee[19]showfortwo-layernetworkswithquadraticactivations,allsecond-orderstationarypointsareglobalminimizers.Aroraetal.[3]interpretover-parametrizationasameansofacceleration.Meietal.[43],ChizatandBach[15],SirignanoandSpiliopoulos[65],DouandLiang[18],Meietal.[44]analyzeadistributionalviewofover-parametrizednetworks.ChizatandBach[15]showthatWassersteingradient\ufb02owconvergestoglobaloptimizersunderstructuralassumptions.Weextendthistoapolynomial-timeresult.Finally,manypapershaveshownconvergenceofgradientdescentonneuralnets[2,1,37,22,21,6,76,13,31,16]usinganalyseswhichprovetheweightsdonotmovefarfrominitialization.Theseanalysesdonotapplytotheregularizedloss,andourexperimentsinSectionFsuggestthatmovingawayfromtheinitializationisimportantforbettertestperformance.AnotherlineofworktakesaBayesianperspectiveonneuralnets.Underanappropriatechoiceofprior,theyshowanequivalencebetweentherandomneuralnetandGaussianprocessesinthelimitofin\ufb01nitewidthorchannels[47,71,36,42,24,56].Thisprovidesanotherkernelperspectiveofneuralnets.YehudaiandShamir[73],ChizatandBach[16]alsoarguethatthekernelperspectiveofneuralnetsisnotsuf\ufb01cientforunderstandingthesuccessofdeeplearning.ChizatandBach[16]arguethatthekernelperspectiveofgradientdescentiscausedbyalargeinitializationanddoesnotnecessarilyexplaintheempiricalsuccessesofover-parametrization.YehudaiandShamir[73]provethat\u2126(exp(d))randomrelufeaturescannotapproximateasingleneuroninsquarederrorloss.Incomparison,ourlowerboundsareforthesamplecomplexityratherthanwidthoftheNTKpredictionfunctionandapplyevenwithin\ufb01niteover-parametrizationforbothclassi\ufb01cationandsquaredloss.1.2NotationLetRdenotethesetofrealnumbers.Wewillusek\u00b7ktoindicateageneralnorm,withk\u00b7k2denotingthe\u20182normandk\u00b7kFtheFrobeniusnorm.Weuse\u00afontopofasymboltodenoteaunitvector:whenapplicable,\u00afu,u/kuk,withthenormk\u00b7kclearfromcontext.LetN(0,\u03c32)denotethenormaldistributionwithmean0andvariance\u03c32.Forvectorsu1\u2208Rd1,u2\u2208Rd2,weusethenotation(u1,u2)\u2208Rd1+d2todenotetheirconcatenation.Wealsosayafunctionfisa-homogeneousininputxiff(cx)=caf(x)foranyc,andwesayfisa-positive-homogeneousifthereistheadditionalconstraintc>0.WereservethesymbolX=[x1,...,xn]todenotethecollectionofdatapoints(asamatrix),andY=[y1,...,yn]todenotelabels.Weusedtodenotethedimensionofourdata.3\fWewillusethenotationsa.b,a&btodenotelessthanorgreaterthanuptoauniversalconstant,respectively,andwhenusedinacondition,todenotetheexistenceofsuchaconstantsuchthattheconditionistrue.Unlessstatedotherwise,O(\u00b7),\u2126(\u00b7)denotesomeuniversalconstantinupperandlowerbounds.Thenotationpolydenotesauniversalconstant-degreepolynomialinthearguments.2GeneralizationofRegularizedNeuralNetvs.NTKKernelWewillcompareneuralnetsolutionsfoundviaregularizationandmethodsinvolvingtheNTKandconstructadatadistributionDinddimensionswhichtheneuralnetoptimizerofregularizedlogisticlosslearnswithsamplecomplexityO(d).Thekernelmethodwillrequire\u2126(d2)samplestolearn.WestartbydescribingthedistributionDofexamples(x,y).Hereeiisthei-thstandardbasisvectorandweusex>eitorepresentthei-coordinateofx(sincethesubscriptisreservedtoindextrainingexamples).First,foranyk\u22653,x>ek\u223c{\u22121,+1}isauniformrandombit,andforx>e1,x>e2andy,choosey=+1,x>e1=+1,x>e2=0w/prob.1/4y=+1,x>e1=\u22121,x>e2=0w/prob.1/4y=\u22121,x>e1=0,x>e2=+1w/prob.1/4y=\u22121,x>e1=0,x>e2=\u22121w/prob.1/4(2.1)ThedistributionDcontainsallofitssignalinthe\ufb01rst2coordinates,andtheremainingd\u22122coordinatesarenoise.Wevisualizeits\ufb01rst2coordinatesinFigure1.Next,weformallyde\ufb01nethetwolayerneuralnetwithreluactivationsanditsassociatedNTK.Weparameterizeatwo-layernetworkwithmunitsbylastlayerweightsw1,...,wm\u2208Randweightvectorsu1,...,um\u2208Rd.Wedenoteby\u0398thecollectionofparametersandby\u03b8jtheunit-jparameters(uj,wj).ThenetworkcomputesfNN(x;\u0398),Pmj=1wj[u>jx]+,where[\u00b7]+denotesthereluactivation.Forbinarylabelsy1,...,yn\u2208{\u22121,+1},the\u20182regularizedlogisticlossisL\u03bb(\u0398),1nnXi=1log(1+exp(\u2212yifNN(xi;\u0398)))+\u03bbk\u0398k2F(2.2)Let\u0398\u03bb\u2208argmin\u0398L\u03bb(\u0398)beitsglobaloptimizer.De\ufb01netheNTKkernelassociatedwiththearchitecture(withrandomweights):K(x0,x)=Ew\u223cN(0,r2w),u\u223cN(0,r2uI)(cid:2)(cid:10)\u2207\u03b8fNN(x;\u0398),\u2207\u03b8fNN(x0;\u0398)(cid:11)(cid:3)where\u2207\u03b8fNN(x;\u0398)=(w1(x>u\u22650)x,[x>u]+)isthegradientofthenetworkoutputwithrespecttoagenerichiddenunit,andrw,ruarerelativescalingparameters.NotethatthetypicalNTKisrealizedspeci\ufb01callywithscalesrw=ru=1,butourboundappliesforallchoicesofrw,ru.Forcoef\ufb01cients\u03b2,wecanthende\ufb01nethepredictionfunctionfkernel(x;\u03b2)intheRKHSinducedbyKasfkernel(x;\u03b2),Pni=1\u03b2iK(xi,x).Forexample,suchaclassi\ufb01erwouldbeattainedbyrunninggradientdescentonsquaredlossforawidenetworkusingtheappropriaterandominitialization(see[31,22,21,6]).Wenowpresentourcomparisontheorembelowand\ufb01llinitsproofinSectionB.Theorem2.1.LetDbethedistributionde\ufb01nedinequation2.1.Withprobability1\u2212d\u22125overtherandomdrawofn.d2samples(x1,y1),...,(xn,yn)fromD,forallchoicesof\u03b2,thekernelpredictionfunctionfkernel(\u00b7;\u03b2)willhaveatleast\u2126(1)error:Pr(x,y)\u223cD[fkernel(x;\u03b2)y\u22640]=\u2126(1)Meanwhile,for\u03bb\u2264poly(n)\u22121,theregularizedneuralnetsolutionfNN(\u00b7;\u0398\u03bb)withatleast4hiddenunitscanhavegoodgeneralizationwithO(d2)samplesbecausewehavethefollowinggeneralizationerrorbound:Pr(x,y)\u223cD[fNN(x;\u0398\u03bb)y\u22640].rdnThisimpliesa\u2126(d)sample-complexitygapbetweentheregularizedneuralnetandkernelpredictionfunction.4\fWhiletheabovetheoremisstatedforclassi\ufb01cation,thesameDcanbeusedtostraightforwardlyprovea\u2126(d)samplecomplexitygapforthetruncatedsquaredloss\u2018(\u02c6y;y)=min((y\u2212\u02c6y)2,1).1WeprovidemoredetailsinSectionB.3.Ourintuitionofthisgapisthattheregularizationallowstheneuralnetto\ufb01ndinformativefeatures(weightvectors),thatareadaptivetothedatadistributionandeasierforthelastlayers\u2019weightstoseparate.Forexample,theneurons[e1x]+,[\u2212e1x]+,[e2x]+,[\u2212e2x]+areenoughto\ufb01tourparticulardistribution.Incomparison,theNTKmethodisunabletochangethefeaturespaceandisonlysearchingforthecoef\ufb01cientsinthekernelspace.Prooftechniquesfortheupperbound:Fortheupperbound,neuralnetswithsmallEuclideannormwillbeabletoseparateDwithlargemargin(atwo-layernetwithwidth4canalreadyachievealargemargin).AsweshowinSection4,asolutionwithamaxneural-netmarginisattainedbytheglobaloptimizeroftheregularizedlogisticloss\u2014infact,weshowthisholdsforgenerallyhomogeneousnetworksofanydepthandwidth(Theorem4.1).Then,bytheclassicalconnectionbetweenmarginandgeneralization[34],thisoptimizerwillgeneralizewell.Prooftechniquesforthelowerbound:Ontheotherhand,theNTKwillhaveaworsemarginwhen\ufb01ttingsamplesfromDthantheregularizedneuralnetworksbecauseNTKoperatesina\ufb01xedkernelspace.2However,provingthattheNTKhasasmallmargindoesnotsuf\ufb01cebecausethegeneralizationerrorboundswhichdependonmarginmaynotbetight.Wedevelopanewtechniquetoprovelowerboundsforkernelmethods,whichwebelieveisofindependentinterest,astherearefewpriorworksthatprovelowerboundsforkernelmethods.(Onethatdoesis[54],buttheirresultsrequireconstructinganarti\ufb01cialkernelanddatadistribution,whereasourlowerboundsarefora\ufb01xedkernel.)ThemainintuitionisthatbecauseNTKusesin\ufb01nitelymanyrandomfeatures,itisdif\ufb01cultfortheNTKtofocusonasmallnumberofinformativefeatures\u2013doingsowouldrequireaveryhighRKHSnorm.Infact,weshowthatwithalimitednumberofexamples,anyfunctionthatinthespanofthetrainingexamplesmustheavilyuserandomfeaturesratherthaninformativefeatures.Therandomfeaturescancollectively\ufb01tthetrainingdata,butwillgiveworsegeneralization.3PerturbedWassersteinGradientFlowFindsGlobalOptimizersinPolynomialTimeInthepriorsection,wearguedthataneuralnetwith\u20182regularizationcanachievemuchbettergeneralizationthantheNTK.Ourresultrequiredattainingtheglobalminimumoftheregularizedloss;however,existingoptimizationtheoryonlyallowsforsuchconvergencetoaglobalminimizerwithalargeinitializationandnoregularizer.Unfortunately,thesearetheregimeswheretheneuralnetlearnsakernelpredictionfunction[31,22,6].Inthissection,weshowthatatleastforin\ufb01nite-widthtwo-layernets,optimizationisnotanissue:noisygradientdescent\ufb01ndsglobaloptimizersofthe\u20182regularizedlossinpolynomialiterations.Priorwork[43,15]hasshownthatasthehiddenlayersizegrowstoin\ufb01nity,gradientdescentfora\ufb01niteneuralnetworkapproachestheWassersteingradient\ufb02owoverdistributionsofhiddenunits(de\ufb01nedinequation3.1).Withtheassumptionthatthegradient\ufb02owconverges,whichisnon-trivialsincethespaceofdistributionsisin\ufb01nite-dimensional,ChizatandBach[15]provethatWassersteingradient\ufb02owconvergestoaglobaloptimizerbutdonotspecifyarate.Meietal.[43]addanentropyregularizertoformanobjectivethatisthein\ufb01nite-neuronlimitofstochasticLangevindynamics.Theyshowglobalconvergencebutalsodonotprovideexplicitrates.Intheworstcase,theirconvergencecanbeexponentialindimension.Incontrast,weprovideexplicitpolynomialconvergenceratesforaslightlydifferentalgorithm,perturbedWassersteingradient\ufb02ow.In\ufb01nite-widthneuralnetsaremodeledmathematicallyasadistributionoverweights:formally,weoptimizethefollowingfunctionaloverdistributions\u03c1onRd+1:L[\u03c1],R(R\u03a6d\u03c1)+RVd\u03c1,where\u03a6:Rd+1\u2192Rk,R:Rk\u2192R,andV:Rd+1\u2192R.RandVcanbethoughtofasthelossandregularizer,respectively.Inthiswork,weconsider2-homogeneous\u03a6andV.Wewilladditionally1Thetruncationisrequiredtoprovegeneralizationoftheregularizedneuralnetusingstandardtools.2TherecouldbesomevariationsoftheNTKspacedependingonthescalesoftheinitializationofthetwolayers,butourTheorem2.1showsthatthesevariationsalsosufferfromaworsesamplecomplexity.5\frequirethatRisconvexandnonnegativeandVispositiveontheunitsphere.Finally,weneedstandardregularityassumptionsonR,\u03a6,andV:Assumption3.1(Regularityconditionson\u03a6,R,V).\u03a6andVaredifferentiableaswellasupperboundedandLipschitzontheunitsphere.RisLipschitzanditsHessianhasboundedoperatornorm.Weprovidemoredetailsonthespeci\ufb01cparameters(forboundedness,Lipschitzness,etc.)inSec-tionE.1.Wenotethatrelunetworkssatisfyeveryconditionbutdifferentiabilityof\u03a6.3Wecan\ufb01ta\u20182regularizedneuralnetworkunderourframework:Example3.2(Logisticlossforneuralnetworks).Weinterpret\u03c1asadistributionovertheparametersofthenetwork.Letk,nand\u03a6i(\u03b8),w\u03c6(u>xi)for\u03b8=(w,u).Inthiscase,R\u03a6d\u03c1isadistributionalneuralnetworkthatcomputesanoutputforeachofthentrainingexamples(likeastandardneuralnetwork,italsocomputesaweightedsumoverhiddenunits).Wecancomputethedistributionalversionoftheregularizedlogisticlossinequation2.2bysettingV(\u03b8),\u03bbk\u03b8k22andR(a1,...,an),Pni=1log(1+exp(\u2212yiai)).Wewillde\ufb01neL0[\u03c1]:Rd+1\u2192RwithL0[\u03c1](\u03b8),hR0(R\u03a6d\u03c1),\u03a6(\u03b8)i+V(\u03b8)andv[\u03c1](\u03b8),\u2212\u2207\u03b8L0[\u03c1](\u03b8).Informally,L0[\u03c1]isthegradientofLwithrespectto\u03c1,andvistheinducedvelocity\ufb01eld.ForthestandardWassersteingradient\ufb02owdynamics,\u03c1tevolvesaccordingtoddt\u03c1t=\u2212\u2207\u00b7(v[\u03c1t]\u03c1t)(3.1)where\u2207\u00b7denotesthedivergenceofavector\ufb01eld.Forneuralnetworks,thesedynamicsformallyde\ufb01necontinuous-timegradientdescentwhenthehiddenlayerhasin\ufb01nitesize(seeTheorem2.6of[15],forinstance).Moregenerally,equation3.1isduetotheformulaforWassersteingradient\ufb02owdynamics(seeforexample[64]),whicharederivedviacontinuous-timesteepestdescentwithrespecttoWassersteindistanceoverthespaceofprobabilitydistributionsontheneurons.Weproposethefollowingmodi\ufb01eddynamics:ddt\u03c1t=\u2212\u03c3\u03c1t+\u03c3Ud\u2212\u2207\u00b7(v[\u03c1t]\u03c1t)(3.2)whereUdistheuniformdistributiononSd.Inourperturbeddynamics,weaddverysmalluniformnoiseoverUd,whichensuresthatatalltime-steps,thereissuf\ufb01cientmassinadescentdirectionforthealgorithmtodecreasetheobjective.Forin\ufb01nite-sizeneuralnetworks,onecaninformallyinterpretthisasre-initializingaverysmallfractionoftheneuronsateverystepofgradientdescent.Weproveconvergencetoaglobaloptimizerintimepolynomialin1/\u0001,d,andtheregularityparameters.Theorem3.3(TheoremE.4withregularityparametersomitted).Supposethat\u03a6andVare2-homogeneousandtheregularityconditionsofAssumption3.1aresatis\ufb01ed.Alsoassumethatfromstartingdistribution\u03c10,asolutiontothedynamicsinequation3.2exists.De\ufb01neL?,inf\u03c1L[\u03c1].Let\u0001>0beadesirederrorthresholdandchoose\u03c3,exp(\u2212dlog(1/\u0001)poly(k,L[\u03c10]\u2212L?))andt\u0001,d2\u00014poly(log(1/\u0001),k,L[\u03c10]\u2212L?),wheretheregularityparametersfor\u03a6,V,andRarehiddeninthepoly(\u00b7).Then,perturbedWassersteingradient\ufb02owconvergestoan\u0001-approximateglobalminimumint\u0001time:min0\u2264t\u2264t\u0001L[\u03c1t]\u2212L?\u2264\u0001WestateandproveaversionofTheorem3.3thatincludesregularityparametersinSec-tionsE.1andE.3.Thekeyideafortheproofisasfollows:asRisconvex,theoptimizationproblemwillbeconvexoverthespaceofdistributions\u03c1.Thisconvexityallowsustoarguethatif\u03c1issuboptimal,thereeitherexistsadescentdirection\u00af\u03b8\u2208SdwhereL0[\u03c1](\u00af\u03b8)(cid:28)0,orthegradient\ufb02owdynamicswillresultinalargedecreaseintheobjective.Ifsuchadirection\u00af\u03b8exists,theuniformnoise\u03c3Udalongwiththe2-homogeneityof\u03a6andVwillallowtheoptimizationdynamicstoincreasethemassinthisdirectionexponentiallyfast,whichcausesapolynomialdecreaseintheloss.Asatechnicaldetail,Theorem3.3requiresthatasolutiontothedynamicsexists.Wecanremovethisassumptionbyanalyzingadiscrete-timeversionofequation3.2:\u03c1t+1,\u03c1t+\u03b7(\u2212\u03c3\u03c1t+\u03c3Ud\u2212\u2207\u00b73Thereluactivationisnon-differentiableat0andhencethegradient\ufb02owisnotwell-de\ufb01ned.ChizatandBach[15]acknowledgethissamedif\ufb01cultywithrelu.6\f(v[\u03c1t]\u03c1t)),andadditionallyassuming\u03a6andVhaveLipschitzgradients.Inthissetting,apolynomialtimeconvergenceresultalsoholds.WestatetheresultinSectionE.4.AnimplicationofourTheorem3.3isthatforin\ufb01nitenetworks,wecanoptimizetheweakly-regularizedlogisticlossintimepolynomialintheproblemparametersand\u03bb\u22121.InTheorem2.1weonlyrequire\u03bb\u22121=poly(n);thus,anin\ufb01nitewidthneuralnetcanlearnthedistributionDuptoerror\u02dcO(pd/n)inpolynomialtimeusingnoisygradientdescent.4WeakRegularizerGuaranteesMaxMarginSolutionsInthissection,wecollectanumberofresultsregardingthemarginofaregularizedneuralnet.Theseresultsprovidethetoolsforprovinggeneralizationoftheweakly-regularizedNNsolutioninTheorem2.1.Thekeytechniqueisshowingthatwithsmallregularizer\u03bb\u21920,theglobaloptimizerofregularizedlogisticlosswillobtainamaximummargin.Itiswell-understoodthatalargeneuralnetmarginimpliesgoodgeneralizationperformance[9].Infact,ourresultappliestoafunctionclassmuchbroaderthantwo-layerrelunets:inTheorem4.1weshowthatwhenweaddaweakregularizertocross-entropylosswithanypositive-homogeneouspredictionfunction,thenormalizedmarginoftheoptimumconvergestothemaxmargin.Forexample,Theorem4.1appliestofeedforwardrelunetworksofarbitrarydepthandwidth.InTheoremC.2,weboundtheapproximationerrorinthemaximummarginwhenweonlyobtainanapproximateoptimizeroftheregularizedloss.InCorollary4.2,weleveragetheseresultsandpre-existingRademachercomplexityboundstoconcludethattheoptimizeroftheweakly-regularizedlogisticlosswillhavewidth-freegeneralizationboundscalingwiththeinverseofthemaxmarginandnetworkdepth.Finally,wenotethatthemaximumpossiblemargincanonlyincreasewiththewidthofthenetwork,whichsuggeststhatincreasingwidthcanimprovegeneralizationofthesolution(seeTheorem4.3).WeworkwithafamilyFofpredictionfunctionsf(\u00b7;\u0398):Rd\u2192Rthatarea-positive-homogeneousintheirparametersforsomea>0:f(x;c\u0398)=caf(x;\u0398),\u2200c>0.Weadditionallyrequirethatfiscontinuouswhenviewedasafunctionin\u0398.Forsomegeneralnormk\u00b7kand\u03bb>0,westudythe\u03bb-regularizedlogisticlossL\u03bb,de\ufb01nedasL\u03bb(\u0398),1nnXi=1log(1+exp(\u2212yif(xi;\u0398)))+\u03bbk\u0398kr(4.1)for\ufb01xedr>0.Let\u0398\u03bb\u2208argminL\u03bb(\u0398).4De\ufb01nethenormalizedmargin\u03b3\u03bbandmax-margin\u03b3?by\u03b3\u03bb,miniyif(xi;\u00af\u0398\u03bb)and\u03b3?,maxk\u0398k\u22641miniyif(xi;\u0398).Let\u0398?achievethismaximum.Weshowthatwithsuf\ufb01cientlysmallregularizationlevel\u03bb,thenormalizedmargin\u03b3\u03bbapproachesthemaximummargin\u03b3?.OurtheoremandproofareinspiredbytheresultofRossetetal.[58,59],whoanalyzethespecialcasewhenfisalinearfunction.Incontrast,ourresultcanbeappliedtonon-linearfaslongasfishomogeneous.Theorem4.1.Assumethetrainingdataisseparablebyanetworkf(\u00b7;\u0398?)\u2208Fwithanoptimalnor-malizedmargin\u03b3?>0.Then,thenormalizedmarginoftheglobaloptimumoftheweakly-regularizedobjective(equation4.1)convergesto\u03b3?astheregularizationgoestozero.Mathematically,\u03b3\u03bb\u2192\u03b3?as\u03bb\u21920Anintuitiveexplanationforourresultisasfollows:becauseofthehomogeneity,thelossL(\u0398\u03bb)roughlysatis\ufb01esthefollowing(forsmall\u03bb,andignoringparameterssuchasn):L\u03bb(\u0398\u03bb)\u2248exp(\u2212k\u0398\u03bbka\u03b3\u03bb)+\u03bbk\u0398\u03bbkrThus,thelossselectsparameterswithlargermargin,whiletheregularizationfavorssmallernorms.ThefullproofofthetheoremisdeferredtoSectionC.Thoughtheresultinthissectionisstatedforbinaryclassi\ufb01cation,itextendstothemulti-classsettingwithcross-entropyloss.Weprovideformalde\ufb01nitionsandresultsinSectionC.InTheoremC.2,wealsoshowthatanapproximateminimizerofL\u03bbcanobtainmarginthatapproximates\u03b3?.4WeformallyshowthatL\u03bbhasaminimizerinClaimC.3ofSectionC.7\fAlthoughweconsideranexplicitregularizer,ourresultisrelatedtorecentworksonalgorithmicregularizationofgradientdescentfortheunregularizedobjective.Recentworksshowthatgradientdescent\ufb01ndstheminimumnormormax-marginsolutionforproblemsincludinglogisticregression,linearizedneuralnetworks,andmatrixfactorization[68,28,38,27,32].Manyoftheseproofsrequireadelicateanalysisofthealgorithm\u2019sdynamics,andsomearenotfullyrigorousduetoassumptionsontheiterates.Tothebestofourknowledge,itisanopenquestiontoproveanalogousresultsforeventwo-layerrelunetworks.Incontrast,byaddingtheexplicit\u20182regularizertoourobjective,wecanprovebroaderresultsthatapplytomulti-layerrelunetworks.Inthefollowingsectionweleverageourresultandexistinggeneralizationbounds[25]tohelpjustifyhowover-parameterizationcanimprovegeneralization.4.1GeneralizationoftheMax-MarginNeuralNetWeconsiderdepth-qnetworkswith1-Lipschitz,1-positive-homogeneousactivation\u03c6forq\u22652.Notethatthenetworkfunctionisq-positive-homogeneous.Supposethatthecollectionofparameters\u0398isgivenbymatricesW1,...,Wq.Forsimplicityweworkinthebinaryclasssetting,sotheq-layernetworkcomputesareal-valuedscorefNN(x;\u0398),Wq\u03c6(Wq\u22121\u03c6(\u00b7\u00b7\u00b7\u03c6(W1x)\u00b7\u00b7\u00b7))(4.2)whereweoverloadnotationtolet\u03c6(\u00b7)denotetheelement-wiseapplicationoftheactivation\u03c6.Letmidenotethesizeofthei-thhiddenlayer,soW1\u2208Rm1\u00d7d,W2\u2208Rm2\u00d7m1,\u00b7\u00b7\u00b7,Wq\u2208R1\u00d7mq\u22121.WewillletM,(m1,...,mq\u22121)denotethesequenceofhiddenlayersizes.Wewillfocuson\u20182-regularizedlogisticloss(seeequation4.1,usingk\u00b7kFandr=2)anddenoteitbyL\u03bb,M.Followingnotationestablishedinthissection,wedenotetheoptimizerofL\u03bb,Mby\u0398\u03bb,M,thenormalizedmarginof\u0398\u03bb,Mby\u03b3\u03bb,M,themax-marginsolutionby\u0398?,M,andthemax-marginby\u03b3?,M,assumedtobepositive.Ournotationemphasizesthearchitectureofthenetwork.Wecande\ufb01nethepopulation0-1lossofthenetworkparameterizedby\u0398byL(\u0398),Pr(x,y)\u223cpdata[yfNN(x;\u0398)\u22640].WeletXdenotethedatadomainandC,supx\u2208Xkxk2denotethelargestpossiblenormofasingledatapoint.BycombiningtheneuralnetcomplexityboundsofGolowichetal.[25]withourTheorem4.1,wecanconcludethatoptimizingweakly-regularizedlogisticlossgivesgeneralizationboundsthatdependonthemaximumpossiblenetworkmarginforthegivenarchitecture.Corollary4.2.Suppose\u03c6is1-Lipschitzand1-positive-homogeneous.Withprobabilityatleast1\u2212\u03b4overthedrawof(x1,y1),...,(xn,yn)i.i.d.frompdata,wecanboundthetesterroroftheoptimizeroftheregularizedlossbylimsup\u03bb\u21920L(\u0398\u03bb,M).C\u03b3?,Mqq\u221212\u221an+\u0001(\u03b3?,M)(4.3)where\u0001(\u03b3),qloglog24C\u03b3n+qlog(1/\u03b4)n.Notethat\u0001(\u03b3?,M)isprimarilyasmallerorderterm,sotheboundmainlyscaleswithC\u03b3?,Mq(q\u22121)/2\u221an.5Finally,weobservethatthemaximumnormalizedmarginisnon-decreasingwiththesizeofthearchi-tecture.Formally,fortwodepth-qarchitecturesM=(m1,...,mq\u22121)andM0=(m01,...,m0q\u22121),wesayM\u2264M0ifmi\u2264m0i\u2200i=1,...q\u22121.Theorem4.3statesifM\u2264M0,themax-marginovernetworkswitharchitectureM0isatleastthemax-marginovernetworkswitharchitectureM.Theorem4.3.Recallthat\u03b3?,Mdenotesthemaximumnormalizedmarginofanetworkwitharchi-tectureM.IfM\u2264M0,wehave\u03b3?,M\u2264\u03b3?,M0Asaimportantconsequence,thegeneralizationerrorboundofCorollary4.2forM0isatleastasgoodasthatforM.5Althoughthe1q(q\u22121)/2factorofequationD.1decreaseswithdepthq,themargin\u03b3willalsotendtodecreaseastheconstraintk\u00af\u0398kF\u22641becomesmorestringent.8\fFigure2:Comparingregularizationandnoregularizationstartingfromthesameinitialization.Left:Normalizedmargin.Center:Testaccuracy.Right:Percentageofactivationpatternschanged.ThistheoremissimpletoproveandfollowsbecausewecandirectlyimplementanynetworkofarchitectureMusingoneofarchitectureM0,ifM\u2264M0.Thishighlightsoneofthebene\ufb01tsofover-parametrization:themargindoesnotdecreasewithalargernetworksize,andthereforeCorollary4.2givesabettergeneralizationbound.InSectionF,weprovideempiricalevidencethatthetesterrordecreaseswithlargernetworksizewhilethemarginisnon-decreasing.ThephenomenoninTheorem4.3contrastswithstandard\u20182-normalizedlinearprediction.Inthissetting,addingmorefeaturesincreasesthenormofthedata,andthereforethegeneralizationerrorboundscouldalsoincrease.Ontheotherhand,Theorem4.3showsthataddingmoreneurons(whichcanbeviewedaslearnedfeatures)canonlyimprovethegeneralizationofthemax-marginsolution.5SimulationsWeempiricallyvalidateourtheorywithseveralsimulations.First,wetrainatwo-layernetonsyntheticdatawithandwithoutexplicitregularizationstartingfromthesameinitializationinordertodemonstratetheeffectofanexplicitregularizerongeneralization.Wecon\ufb01rmthattheregularizednetworkdoesindeedgeneralizebetterandmovesfurtherfromitsinitialization.Forthisexperiment,weusealargeinitializationscale,soeveryweight\u223cN(0,1).Weaveragethisexperimentover20trialsandplotthetestaccuracy,normalizedmargin,andpercentagechangeinactivationpatternsinFigure2.Wecomputethepercentageofactivationpatternschangedovereverypossiblepairofhiddenunitandtrainingexample.Sincealowpercentageofactivationschangewhen\u03bb=0,theunregularizedneuralnetlearnsinthekernelregime.Oursimulationsdemonstratethatanexplicitregularizerimprovesgeneralizationerroraswellasthemargin,aspredictedbyourtheory.Thedatacomesfromagroundtruthnetworkwith10hiddennetworks,inputdimension20,andagroundtruthunnormalizedmarginofatleast0.01.Weuseatrainingsetofsize200andtrainfor20000stepswithlearningrate0.1,onceusingregularizer\u03bb=5\u00d710\u22124andonceusingregularization\u03bb=0.Wenotethatthetrainingerrorhits0extremelyquickly(within50trainingiterations).Theinitialnormalizedmarginisnegativebecausethetrainingerrorhasnotyethitzero.Wealsocomparethegeneralizationofaregularizedneuralnetandkernelmethodasthesamplesizeincreases.Furthermore,wedemonstratethatfortwo-layernets,thetesterrordecreasesandmarginincreasesasthewidthofthehiddenlayergrows,aspredictedbyourtheory.Weprovide\ufb01guresandfulldetailsinSectionF.6ConclusionWehaveshowntheoreticallyandempiricallythatexplicitly\u20182regularizedneuralnetscangeneralizebetterthanthecorrespondingkernelmethod.Wealsoarguethatmaximizingmarginisoneoftheinductivebiasesofrelunetworksobtainedfromoptimizingweakly-regularizedcross-entropyloss.Tocomplementthesegeneralizationresults,westudyoptimizationandprovethatitispossibleto\ufb01ndaglobalminimizeroftheregularizedlossinpolynomialtimewhenthenetworkwidthisin\ufb01nite.Anaturaldirectionforfutureworkistoapplyourtheorytooptimizethemarginof\ufb01nite-sizedneuralnetworks.9\fAcknowledgmentsCWacknowledgesthesupportofaNSFGraduateResearchFellowship.JDLacknowledgessupportoftheAROunderMURIAwardW911NF-11-1-0303.ThisispartofthecollaborationbetweenUSDOD,UKMODandUKEngineeringandPhysicalResearchCouncil(EPSRC)undertheMultidisciplinaryUniversityResearchInitiative.WealsothankNatiSrebroandSuriyaGunasekarforhelpfuldiscussionsinvariousstagesofthiswork.References[1]ZeyuanAllen-Zhu,YuanzhiLi,andYingyuLiang.Learningandgeneralizationinoverpa-rameterizedneuralnetworks,goingbeyondtwolayers.arXivpreprintarXiv:1811.04918,2018.[2]ZeyuanAllen-Zhu,YuanzhiLi,andZhaoSong.Aconvergencetheoryfordeeplearningviaover-parameterization.arXivpreprintarXiv:1811.03962,2018.[3]SanjeevArora,NadavCohen,andEladHazan.Ontheoptimizationofdeepnetworks:Implicitaccelerationbyoverparameterization.arXivpreprintarXiv:1802.06509,2018.[4]SanjeevArora,RongGe,BehnamNeyshabur,andYiZhang.Strongergeneralizationboundsfordeepnetsviaacompressionapproach.arXivpreprintarXiv:1802.05296,2018.[5]SanjeevArora,SimonS.Du,WeiHu,ZhiyuanLi,RuslanSalakhutdinov,andRuosongWang.OnExactComputationwithanIn\ufb01nitelyWideNeuralNet.arXive-prints,art.arXiv:1904.11955,Apr2019.[6]SanjeevArora,SimonSDu,WeiHu,ZhiyuanLi,andRuosongWang.Fine-grainedanalysisofoptimizationandgeneralizationforoverparameterizedtwo-layerneuralnetworks.arXivpreprintarXiv:1901.08584,2019.[7]FrancisBach.Breakingthecurseofdimensionalitywithconvexneuralnetworks.JournalofMachineLearningResearch,18(19):1\u201353,2017.[8]KeithBall.Anelementaryintroductiontomodernconvexgeometry.Flavorsofgeometry,31:1\u201358,1997.[9]PeterLBartlett,DylanJFoster,andMatusJTelgarsky.Spectrally-normalizedmarginboundsforneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages6240\u20136249,2017.[10]MikhailBelkin,SiyuanMa,andSoumikMandal.Tounderstanddeeplearningweneedtounderstandkernellearning.arXivpreprintarXiv:1802.01396,2018.[11]YoshuaBengio,NicolasLRoux,PascalVincent,OlivierDelalleau,andPatriceMarcotte.Convexneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages123\u2013130,2006.[12]AlonBrutzkus,AmirGloberson,EranMalach,andShaiShalev-Shwartz.Sgdlearnsover-parameterizednetworksthatprovablygeneralizeonlinearlyseparabledata.arXivpreprintarXiv:1710.10174,2017.[13]YuanCaoandQuanquanGu.Ageneralizationtheoryofgradientdescentforlearningover-parameterizeddeeprelunetworks.arXivpreprintarXiv:1902.01384,2019.[14]PratikChaudhari,AnnaChoromanska,StefanoSoatto,YannLeCun,CarloBaldassi,ChristianBorgs,JenniferChayes,LeventSagun,andRiccardoZecchina.Entropy-sgd:Biasinggradientdescentintowidevalleys.arXivpreprintarXiv:1611.01838,2016.[15]LenaicChizatandFrancisBach.Ontheglobalconvergenceofgradientdescentforover-parameterizedmodelsusingoptimaltransport.arXivpreprintarXiv:1805.09545,2018.[16]LenaicChizatandFrancisBach.Anoteonlazytraininginsuperviseddifferentiableprogram-ming.arXivpreprintarXiv:1812.07956,2018.[17]AmitDaniely.SGDlearnstheconjugatekernelclassofthenetwork.InAdvancesinNeuralInformationProcessingSystems,pages2422\u20132430,2017.[18]XialiangDouandTengyuanLiang.Trainingneuralnetworksaslearningdata-adaptivekernels:Provablerepresentationandapproximationbene\ufb01ts.arXivpreprintarXiv:1901.07114,2019.10\f[19]SimonSDuandJasonDLee.Onthepowerofover-parametrizationinneuralnetworkswithquadraticactivation.arXivpreprintarXiv:1803.01206,2018.[20]SimonSDu,JasonDLee,YuandongTian,BarnabasPoczos,andAartiSingh.Gradientdescentlearnsone-hidden-layercnn:Don\u2019tbeafraidofspuriouslocalminima.arXivpreprintarXiv:1712.00779,2017.[21]SimonSDu,JasonDLee,HaochuanLi,LiweiWang,andXiyuZhai.Gradientdescent\ufb01ndsglobalminimaofdeepneuralnetworks.arXivpreprintarXiv:1811.03804,2018.[22]SimonSDu,XiyuZhai,BarnabasPoczos,andAartiSingh.Gradientdescentprovablyoptimizesover-parameterizedneuralnetworks.arXivpreprintarXiv:1810.02054,2018.[23]GintareKarolinaDziugaiteandDanielMRoy.Computingnonvacuousgeneralizationboundsfordeep(stochastic)neuralnetworkswithmanymoreparametersthantrainingdata.arXivpreprintarXiv:1703.11008,2017.[24]Adri\u00e0Garriga-Alonso,LaurenceAitchison,andCarlEdwardRasmussen.Deepconvolutionalnetworksasshallowgaussianprocesses.arXivpreprintarXiv:1808.05587,2018.[25]NoahGolowich,AlexanderRakhlin,andOhadShamir.Size-independentsamplecomplexityofneuralnetworks.arXivpreprintarXiv:1712.06541,2017.[26]SuriyaGunasekar,BlakeEWoodworth,SrinadhBhojanapalli,BehnamNeyshabur,andNatiSrebro.Implicitregularizationinmatrixfactorization.InAdvancesinNeuralInformationProcessingSystems,pages6151\u20136159,2017.[27]SuriyaGunasekar,JasonLee,DanielSoudry,andNathanSrebro.Characterizingimplicitbiasintermsofoptimizationgeometry.arXivpreprintarXiv:1802.08246,2018.[28]SuriyaGunasekar,JasonLee,DanielSoudry,andNathanSrebro.Implicitbiasofgradientdescentonlinearconvolutionalnetworks.arXivpreprintarXiv:1806.00468,2018.[29]BenjaminDHaeffeleandRen\u00e9Vidal.Globaloptimalityintensorfactorization,deeplearning,andbeyond.arXivpreprintarXiv:1506.07540,2015.[30]MoritzHardt,BenjaminRecht,andYoramSinger.Trainfaster,generalizebetter:Stabilityofstochasticgradientdescent.arXivpreprintarXiv:1509.01240,2015.[31]ArthurJacot,FranckGabriel,andCl\u00e9mentHongler.Neuraltangentkernel:Convergenceandgeneralizationinneuralnetworks.arXivpreprintarXiv:1806.07572,2018.[32]ZiweiJiandMatusTelgarsky.Riskandparameterconvergenceoflogisticregression.arXivpreprintarXiv:1803.07300,2018.[33]ShamMKakade,KarthikSridharan,andAmbujTewari.Onthecomplexityoflinearprediction:Riskbounds,marginbounds,andregularization.InAdvancesinneuralinformationprocessingsystems,pages793\u2013800,2009.[34]VladimirKoltchinskii,DmitryPanchenko,etal.Empiricalmargindistributionsandboundingthegeneralizationerrorofcombinedclassi\ufb01ers.TheAnnalsofStatistics,30(1):1\u201350,2002.[35]AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassi\ufb01cationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages1097\u20131105,2012.[36]JaehoonLee,YasamanBahri,RomanNovak,SamuelSSchoenholz,JeffreyPennington,andJaschaSohl-Dickstein.Deepneuralnetworksasgaussianprocesses.arXivpreprintarXiv:1711.00165,2017.[37]YuanzhiLiandYingyuLiang.Learningoverparameterizedneuralnetworksviastochasticgradientdescentonstructureddata.InAdvancesinNeuralInformationProcessingSystems,pages8168\u20138177,2018.[38]YuanzhiLi,TengyuMa,andHongyangZhang.Algorithmicregularizationinover-parameterizedmatrixsensingandneuralnetworkswithquadraticactivations.InConferenceOnLearningTheory,pages2\u201347,2018.[39]T.LiangandA.Rakhlin.JustInterpolate:Kernel\u201cRidgeless\u201dRegressionCanGeneralize.ArXive-prints,August2018.[40]RoiLivni,ShaiShalev-Shwartz,andOhadShamir.Onthecomputationalef\ufb01ciencyoftrainingneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages855\u2013863,2014.11\f[41]CongMa,KaizhengWang,YuejieChi,andYuxinChen.Implicitregularizationinnonconvexstatisticalestimation:Gradientdescentconvergeslinearlyforphaseretrieval,matrixcompletionandblinddeconvolution.arXivpreprintarXiv:1711.10467,2017.[42]AlexanderGdeGMatthews,MarkRowland,JiriHron,RichardETurner,andZoubinGhahramani.Gaussianprocessbehaviourinwidedeepneuralnetworks.arXivpreprintarXiv:1804.11271,2018.[43]SongMei,AndreaMontanari,andPhan-MinhNguyen.Amean\ufb01eldviewofthelandscapeoftwo-layersneuralnetworks.arXivpreprintarXiv:1804.06561,2018.[44]SongMei,TheodorMisiakiewicz,andAndreaMontanari.Mean-\ufb01eldtheoryoftwo-layersneuralnetworks:dimension-freeboundsandkernellimit.arXivpreprintarXiv:1902.06015,2019.[45]AriSMorcos,DavidGTBarrett,NeilCRabinowitz,andMatthewBotvinick.Ontheimportanceofsingledirectionsforgeneralization.arXivpreprintarXiv:1803.06959,2018.[46]MorShpigelNacson,JasonLee,SuriyaGunasekar,NathanSrebro,andDanielSoudry.Conver-genceofgradientdescentonseparabledata.arXivpreprintarXiv:1803.01905,2018.[47]RadfordMNeal.Priorsforin\ufb01nitenetworks.InBayesianLearningforNeuralNetworks,pages29\u201353.Springer,1996.[48]BehnamNeyshabur,RyotaTomioka,andNathanSrebro.Insearchoftherealinductivebias:Ontheroleofimplicitregularizationindeeplearning.arXivpreprintarXiv:1412.6614,2014.[49]BehnamNeyshabur,RuslanRSalakhutdinov,andNatiSrebro.Path-sgd:Path-normalizedoptimizationindeepneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages2422\u20132430,2015.[50]BehnamNeyshabur,RyotaTomioka,andNathanSrebro.Norm-basedcapacitycontrolinneuralnetworks.InConferenceonLearningTheory,pages1376\u20131401,2015.[51]BehnamNeyshabur,SrinadhBhojanapalli,DavidMcAllester,andNathanSrebro.Apac-bayesianapproachtospectrally-normalizedmarginboundsforneuralnetworks.arXivpreprintarXiv:1707.09564,2017.[52]BehnamNeyshabur,SrinadhBhojanapalli,DavidMcAllester,andNatiSrebro.Exploringgeneralizationindeeplearning.InAdvancesinNeuralInformationProcessingSystems,pages5947\u20135956,2017.[53]BehnamNeyshabur,ZhiyuanLi,SrinadhBhojanapalli,YannLeCun,andNathanSrebro.Towardsunderstandingtheroleofover-parametrizationingeneralizationofneuralnetworks.arXivpreprintarXiv:1805.12076,2018.[54]AndrewYNg.Featureselection,l1vs.l2regularization,androtationalinvariance.InProceedingsofthetwenty-\ufb01rstinternationalconferenceonMachinelearning,page78.ACM,2004.[55]QuynhNguyenandMatthiasHein.Thelosssurfaceofdeepandwideneuralnetworks.arXivpreprintarXiv:1704.08045,2017.[56]RomanNovak,LechaoXiao,JaehoonLee,YasamanBahri,DanielAAbola\ufb01a,JeffreyPenning-ton,andJaschaSohl-Dickstein.Bayesianconvolutionalneuralnetworkswithmanychannelsaregaussianprocesses.arXivpreprintarXiv:1810.05148,2018.[57]RyanO\u2019Donnell.Analysisofbooleanfunctions.CambridgeUniversityPress,2014.[58]SaharonRosset,JiZhu,andTrevorHastie.Boostingasaregularizedpathtoamaximummarginclassi\ufb01er.JournalofMachineLearningResearch,5(Aug):941\u2013973,2004.[59]SaharonRosset,JiZhu,andTrevorJHastie.Marginmaximizinglossfunctions.InAdvancesinneuralinformationprocessingsystems,pages1237\u20131244,2004.[60]SaharonRosset,GrzegorzSwirszcz,NathanSrebro,andJiZhu.l1regularizationinin\ufb01nitedimensionalfeaturespaces.InInternationalConferenceonComputationalLearningTheory,pages544\u2013558.Springer,2007.[61]GrantMRotskoffandEricVanden-Eijnden.Neuralnetworksasinteractingparticlesystems:Asymptoticconvexityofthelosslandscapeanduniversalscalingoftheapproximationerror.arXivpreprintarXiv:1805.00915,2018.12\f[62]MarkRudelson,RomanVershynin,etal.Hanson-wrightinequalityandsub-gaussianconcentra-tion.ElectronicCommunicationsinProbability,18,2013.[63]ItaySafranandOhadShamir.Onthequalityoftheinitialbasininoverspeci\ufb01edneuralnetworks.InInternationalConferenceonMachineLearning,pages774\u2013782,2016.[64]FilippoSantambrogio.{Euclidean,metric,andWasserstein}gradient\ufb02ows:anoverview.BulletinofMathematicalSciences,7(1):87\u2013154,2017.[65]JustinSirignanoandKonstantinosSpiliopoulos.Mean\ufb01eldanalysisofneuralnetworks.arXivpreprintarXiv:1805.01053,2018.[66]MahdiSoltanolkotabi,AdelJavanmard,andJasonDLee.Theoreticalinsightsintotheop-timizationlandscapeofover-parameterizedshallowneuralnetworks.IEEETransactionsonInformationTheory,65(2):742\u2013769,2019.[67]DanielSoudryandYairCarmon.Nobadlocalminima:Dataindependenttrainingerrorguaranteesformultilayerneuralnetworks.arXivpreprintarXiv:1605.08361,2016.[68]DanielSoudry,EladHoffer,andNathanSrebro.Theimplicitbiasofgradientdescentonseparabledata.InInternationalConferenceonLearningRepresentations,2018.URLhttps://openreview.net/forum?id=r1q7n9gAb.[69]LucaVenturi,AfonsoBandeira,andJoanBruna.Neuralnetworkswith\ufb01niteintrinsicdimensionhavenospuriousvalleys.arXivpreprintarXiv:1802.06384,2018.[70]ColinWeiandTengyuMa.Data-dependentSampleComplexityofDeepNeuralNetworksviaLipschitzAugmentation.arXive-prints,art.arXiv:1905.03684,May2019.[71]ChristopherKIWilliams.Computingwithin\ufb01nitenetworks.InAdvancesinneuralinformationprocessingsystems,pages295\u2013301,1997.[72]GregYang.Scalinglimitsofwideneuralnetworkswithweightsharing:Gaussianpro-cessbehavior,gradientindependence,andneuraltangentkernelderivation.arXivpreprintarXiv:1902.04760,2019.[73]GiladYehudaiandOhadShamir.Onthepowerandlimitationsofrandomfeaturesforunder-standingneuralnetworks.arXivpreprintarXiv:1904.00687,2019.[74]ChiyuanZhang,SamyBengio,MoritzHardt,BenjaminRecht,andOriolVinyals.Understandingdeeplearningrequiresrethinkinggeneralization.arXivpreprintarXiv:1611.03530,2016.[75]JiZhu,SaharonRosset,RobertTibshirani,andTrevorJHastie.1-normsupportvectormachines.InAdvancesinneuralinformationprocessingsystems,pages49\u201356,2004.[76]DifanZou,YuanCao,DongruoZhou,andQuanquanGu.Stochasticgradientdescentoptimizesover-parameterizeddeeprelunetworks.arXivpreprintarXiv:1811.08888,2018.13\f", "award": [], "sourceid": 5131, "authors": [{"given_name": "Colin", "family_name": "Wei", "institution": "Stanford University"}, {"given_name": "Jason", "family_name": "Lee", "institution": "Princeton University"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "UT Austin"}, {"given_name": "Tengyu", "family_name": "Ma", "institution": "Stanford University"}]}