{"title": "PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions", "book": "Advances in Neural Information Processing Systems", "page_first": 947, "page_last": 955, "abstract": "We propose a novel approach to reduce the computational cost of evaluation of convolutional neural networks, a factor that has hindered their deployment in low-power devices such as mobile phones. Inspired by the loop perforation technique from source code optimization, we speed up the bottleneck convolutional layers by skipping their evaluation in some of the spatial positions. We propose and analyze several strategies of choosing these positions. We demonstrate that perforation can accelerate modern convolutional networks such as AlexNet and VGG-16 by a factor of 2x - 4x. Additionally, we show that perforation is complementary to the recently proposed acceleration method of Zhang et al.", "full_text": "PerforatedCNNs:AccelerationthroughEliminationofRedundantConvolutionsMichaelFigurnov1,2,AijanIbraimova4,DmitryVetrov1,3,andPushmeetKohli51NationalResearchUniversityHigherSchoolofEconomics2LomonosovMoscowStateUniversity3Yandex4SkolkovoInstituteofScienceandTechnology5MicrosoftResearchmichael@figurnov.ru,aijan.ibraimova@gmail.com,vetrovd@yandex.ru,pkohli@microsoft.comAbstractWeproposeanovelapproachtoreducethecomputationalcostofevaluationofconvolutionalneuralnetworks,afactorthathashinderedtheirdeploymentinlow-powerdevicessuchasmobilephones.Inspiredbytheloopperforationtechniquefromsourcecodeoptimization,wespeedupthebottleneckconvolutionallayersbyskippingtheirevaluationinsomeofthespatialpositions.Weproposeandanalyzeseveralstrategiesofchoosingthesepositions.WedemonstratethatperforationcanacceleratemodernconvolutionalnetworkssuchasAlexNetandVGG-16byafactorof2\u00d7-4\u00d7.Additionally,weshowthatperforationiscomplementarytotherecentlyproposedaccelerationmethodofZhangetal.[28].1IntroductionThelastfewyearshaveseenconvolutionalneuralnetworks(CNNs)emergeasanindispensabletoolforcomputervision.However,modernCNNshaveahighcomputationalcostofevaluation,withconvolutionallayersusuallytakingupover80%ofthetime.Forinstance,VGG-16network[25]fortheproblemofobjectrecognitionrequires1.5\u00b71010\ufb02oatingpointmultiplicationsperimage.ThesecomputationalrequirementshinderthedeploymentofsuchnetworksonsystemswithoutGPUsandinscenarioswherepowerconsumptionisamajorconcern,suchasmobiledevices.Theproblemoftradingaccuracyofcomputationsforspeediswell-knownwithinthesoftwareengineeringcommunity.Oneofthemostprominentmethodsforthisproblemisloopperforation[18,19,24].Inanutshell,thistechniqueisolatesloopsinthecodethatarenotcriticalfortheexecution,andthenreducestheircomputationalcostbyskippingsomeiterations.Morerecently,researchershaveconsideredproblem-dependentperforationstrategiesthatexploitthestructureoftheproblem[23].Inspiredbythegeneralprincipleofperforation,weproposetoreducethecomputationalcostofCNNevaluationbyexploitingthespatialredundancyofthenetwork.ModernCNNs,suchasAlexNet,exploitthisredundancythroughtheuseofstridesintheconvolutionallayers.However,usingtheconvolutionalstrideschangesthearchitectureofthenetwork(intermediaterepresentationssizeandthenumberofweightsinthe\ufb01rstfully-connectedlayer),whichmightbeundesirable.Insteadofusingstrides,wearguefortheuseofinterpolation(perforation)ofresponsesintheconvolutionallayer.Akeyelementofthisapproachisthechoiceoftheperforationmask,whichde\ufb01nestheoutputpositionstoevaluateexactly.Weproposeseveralapproachestoselecttheperforationmasksandamethodofchoosingacombinationofperforationmasksfordifferentlayers.Torestorethenetworkaccuracy,weperform\ufb01ne-tuningoftheperforatednetwork.OurexperimentsshowthatthismethodcanreducetheevaluationtimeofmodernCNNarchitecturesproposedintheliteraturebyafactorof2\u00d7-4\u00d7withasmalldecreaseinaccuracy.2RelatedWorkReducingthecomputationalcostofCNNevaluationisanactiveareaofresearch,withbothhighlyoptimizedimplementationsandapproximatemethodsinvestigated.30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain.\f\ud835\udc51\"\ud835\udc46\ud835\udc4b%\ud835\udc4c\u2032tensorUdata matrix M\ud835\udc4c\ud835\udc4b\ud835\udc51\ud835\udc51\ud835\udc46\ud835\udc46im2rowkernel KtensorV\u00d7=\ud835\udc47\ud835\udc4c\u2032X\u2032\ud835\udc47\ud835\udc51\"\ud835\udc46\ud835\udc4711Figure1:Reductionofconvolutionallayerevaluationtomatrixmultiplication.Ourideaistoleaveonlyasubsetofrows(de\ufb01nedbyaperforationmask)inthedatamatrixMandtointerpolatethemissingoutputvalues.ImplementationsthatexploittheparallelismavailableincomputationalarchitectureslikeGPUs(cuda-convnet2[13],CuDNN[3])haveallowedtosigni\ufb01cantlyreducetheevaluationtimeofCNNs.SinceCuDNNinternallyreducesthecomputationofconvolutionallayerstothematrix-by-matrixmultiplication(withoutexplicitlymaterializingthedatamatrix),ourapproachcanpotentiallybeincorporatedintothislibrary.Inasimilarvein,theuseofFPFGAs[22]leadstobettertrade-offsbetweenspeedandpowerconsumption.Severalpapers[5,9]showedthatCNNsmaybeef\ufb01cientlyevaluatedusinglowprecisionarithmetic,whichisimportantforFPFGAimplementations.MostapproximatemethodsofdecreasingtheCNNcomputationalcostexploittheredundanciesoftheconvolutionalkernelusinglow-ranktensordecompositions[6,10,16,28].Inmostcases,aconvolutionallayerisreplacedbyseveralconvolutionallayersappliedsequentially,whichhaveamuchlowertotalcomputationalcost.WeshowthatthecombinationofperforationwiththemethodofZhangetal.[28]improvesuponbothapproaches.Forspatiallysparseinputs,itispossibletoexploitthissparsitytospeedupevaluationandtraining[8].Whilethisapproachissimilartooursinthespirit,wedonotrelyonspatiallysparseinputs.Instead,wesparselysampletheoutputsofaconvolutionallayerandinterpolatetheremainingvalues.Inarecentwork,LebedevandLempitsky[15]alsodecreasetheCNNcomputationalcostbyreducingthesizeofthedatamatrix.Thedifferenceisthattheirapproachreducestheconvolutionalkernel\u2019ssupportwhileourapproachdecreasesthenumberofspatialpositionsinwhichtheconvolutionsareevaluated.Thetwomethodsarecomplementary.Severalpapershavedemonstratedthatitispossibletocompresstheparametersofthefully-connectedlayers(wheremostCNNparametersreside)withamarginalerrorincrease[4,21,27].Sinceourmethoddoesnotdirectlymodifythefully-connectedlayers,itispossibletocombinethesemethodswithourapproachandobtainafastandsmallCNN.3PerforatedCNNsThesectionprovidesadetaileddescriptionofourapproach.Beforeproceedingfurther,weintroducethenotationthatwillbeusedintherestofthepaper.Notation.AconvolutionallayertakesasinputatensorUofsizeX\u00d7Y\u00d7SandoutputsatensorVofsizeX0\u00d7Y0\u00d7T,X0=X\u2212d+1,Y0=Y\u2212d+1.The\ufb01rsttwodimensionsarespatial(heightandwidth),andthethirddimensionisthenumberofchannels(forexample,foranRGBinputimageS=3).ThesetofTconvolutionkernelsKisgivenbyatensorofsized\u00d7d\u00d7S\u00d7T.Forsimplicityofnotation,weassumeunitstride,nozero-paddingandskipthebiases.Theconvolutionallayeroutputmaybede\ufb01nedasfollows:V(x,y,t)=dXi=1dXj=1SXs=1K(i,j,s,t)U(x+i\u22121,y+j\u22121,s)(1)Additionally,wede\ufb01nethesetofallspatialindices(positions)oftheoutput\u2126={1,...,X0}\u00d7{1,...,Y0}.PerforationmaskI\u2286\u2126isthesetofindicesinwhichtheoutputsarecalculatedexactly.DenoteN=|I|thenumberofpositionstobecalculatedexactly,andr=1\u2212N|\u2126|theperforationrate.Reductiontomatrixmultiplication.Toachievehighcomputationalperformance,manydeeplearn-ingframeworks,includingCaffe[12]andMatConvNet[26],reducethecomputationofconvolutional2\flayerstotheheavily-optimizedmatrix-by-matrixmultiplicationroutineofbasiclinearalgebrapack-ages.Thisprocess,sometimesreferredtoaslowering,isillustratedin\ufb01g.1.First,adatamatrixMofsizeX0Y0\u00d7d2Sisconstructedusingim2rowfunction.TherowsofMareelementsofpatchesofinputtensorUofsized\u00d7d\u00d7S.Then,MismultipliedbythekerneltensorKreshapedintosized2S\u00d7T.TheresultingmatrixofsizeX0Y0\u00d7TistheoutputtensorV,uptoareshape.Foramoredetailedexposition,see[26].3.1PerforatedconvolutionallayerInthissectionwepresenttheperforatedconvolutionallayer.Inasmallfractionofspatialpositions,theoutputsoftheproposedlayerareequaltotheoutputsofausualconvolutionallayer.Theremainingvaluesareinterpolatedusingthenearestneighborfromthissetofpositions.WeevaluateotherinterpolationstrategiesinappendixA.Theperforatedconvolutionallayerisageneralizationofthestandardconvolutionallayer.Whentheperforationmaskisequaltoalltheoutputspatialpositions,theperforatedconvolutionallayer\u2019soutputequalstheconventionalconvolutionallayer\u2019soutput.Formally,letI\u2286\u2126betheperforationmaskofspatialoutputtobecalculatedexactly(theconstraintthatthemasksaresharedforallchannelsoftheoutputisrequiredforthereductiontomatrixmultiplication).Thefunction\u2018(x,y):\u2126\u2192IreturnstheindexofthenearestneighborinIaccordingtoEuclideandistance(withtiesbrokenrandomly):\u2018(x,y)=(\u20181(x,y),\u20182(x,y))=argmin(x0,y0)\u2208Ip(x\u2212x0)2+(y\u2212y0)2.(2)Notethatthefunction\u2018(x,y)maybecalculatedinadvanceandcached.Theperforatedconvolutionallayeroutput\u02c6Visde\ufb01nedasfollows:\u02c6V(x,y,t)=V(\u20181(x,y),\u20182(x,y),t),(3)whereV(x,y,t)istheoutputoftheusualconvolutionallayer,de\ufb01nedby(1).Since\u2018(x,y)=(x,y)for(x,y)\u2208I,theoutputsinthespatialpositionsIarecalculatedexactly.Thevaluesinotherpositionsareinterpolatedusingthevalueofthenearestneighbor.Toevaluateaperforatedconvolutionallayer,weonlyneedtocalculatethevaluesV(x,y,t)for(x,y)\u2208I,whichcanbedoneef\ufb01cientlybyreductiontomatrixmultiplication.Inthiscase,thedatamatrixMcontainsjustN=|I|rows,insteadoftheoriginalX0Y0=|\u2126|rows.Perforationisnotlimitedtothisimplementationofaconvolutionallayer,andcanbecombinedwithotherimplementationsthatsupportstridedconvolutions,suchasthedirectconvolutionapproachofcuda-convnet2[13].Inourimplementation,weonlystoretheoutputvaluesV(x,y,t)for(x,y)\u2208I.Theinterpolationisperformedimplicitlybymaskingthereadsofthefollowingpoolingorconvolutionallayer.Forexample,whenacceleratingconv3layerofAlexNet,theinterpolationcostistransferredtoconv4layer.Weobservenoslowdownoftheconv4layerwhenusingGPU,anda0-3%slowdownwhenusingCPU.Thisdesignchoicehasseveraladvantages.Firstly,thememorysizerequiredtostoretheactivationsisreducedbyafactorof11\u2212r.Secondly,thefollowingnon-linearitylayersand1\u00d71convolutionallayersarealsospedupsincetheyareappliedtoasmallernumberofelements.3.2PerforationmasksWeproposeseveralwaysofgeneratingtheperforationmasks,orchoosingNpointsfrom\u2126.WevisualizetheperforationmasksIasbinarymatriceswithblacksquaresinthepositionsofthesetI.Weonlyconsidertheperforationmasksthatareindependentoftheinputobjectandleaveexplorationofinput-dependentperforationmaskstothefuturework.UniformperforationmaskisjustNpointschosenrandomlywithoutreplacementfromtheset\u2126.However,ascanbeseenfrom\ufb01g.2a,forN(cid:28)|\u2126|,thepointstendtocluster.ThisisundesirablebecauseamorescatteredsetIwouldreducetheaveragedistancetothesetI.GridperforationmaskisasetofpointsI={a(1),...,a(Kx)}\u00d7{b(1),...,b(Ky)},see\ufb01g.2b.Wechoosethevaluesofa(i),b(i)usingthepseudorandomintegersequencegenerationschemeof[7].Poolingstructuremaskexploitsthestructureoftheoverlapsofpoolingoperators.DenotebyA(x,y)thenumberoftimesanoutputoftheconvolutionallayerisusedinthepoolingoperators.Thegrid-likepatternasin\ufb01g.2discausedbyapoolingofsize3\u00d73withstride2(suchparametersareusede.g.inNetworkinNetworkandAlexNet).Thepoolingstructuremaskisobtainedbypickingtop-NpositionswiththehighestvaluesofA(x,y),withtiesbrokenrandomly,see\ufb01g.2c.3\f(a)Uniform(b)Grid(c)Poolingstruc-ture1234(d)WeightsA(x,y)Figure2:Perforationmasks,AlexNetconv2,r=80.25%.Bestviewedincolor.00.010.020.030.040.050.060.070.080.050.10.150.20.250.300.050.10.150.20.250.30.010.020.030.040.050.060.070.080.09(a)B(x,y),origi-nalnetwork00.050.10.150.20.250.30.350.4(b)B(x,y),perfo-ratednetwork(c)Impactmask,r=90%Figure3:Top:ImageNetimagesandcorrespondingvaluesofimpactG(x,y;V)forAlexNetconv2.Bottom:averageimpactsandimpactperforationmaskforAlexNetconv2.Bestviewedincolor.ImpactmaskestimatestheimpactofperforationofeachpositionontheCNNlossfunction,andthenremovestheleastimportantpositions.DenotebyL(V)thelossfunctionoftheCNN(suchasnegativelog-likelihood)asafunctionoftheconsideredconvolutionallayeroutputsV.Next,supposeV0isobtainedfromVbyreplacingoneelement(x0,y0,t0)withaneutralvaluezero.Weestimatetheimpactofapositionasa\ufb01rst-orderTaylorapproximationofthemagnitudeofchangeofL(V):|L(V0)\u2212L(V)|\u2248(cid:12)(cid:12)(cid:12)XXx=1YXy=1TXt=1\u2202L(V)\u2202V(x,y,t)(V0(x,y,t)\u2212V(x,y,t))(cid:12)(cid:12)(cid:12)=(cid:12)(cid:12)(cid:12)\u2202L(V)\u2202V(x0,y0,t0)V(x0,y0,t0)(cid:12)(cid:12)(cid:12).(4)Thevalue\u2202L(V)\u2202V(x0,y0,t0)maybeobtainedusingbackpropagation.Inthecaseofaperforatedconvolu-tionallayer,wecalculatethederivativeswithrespecttotheconvolutionallayeroutputV(nottheinterpolatedoutput\u02c6V).Thismakestheimpactofthepreviouslyperforatedpositionszeroandsumstheimpactofthenon-perforatedpositionsoveralltheoutputswhichsharethevalue.Sinceweareinterestedinthetotalimpactofaspatialposition(x,y)\u2208\u2126,wetakeasumoverallthechannelsandaveragethisestimateofimpactsoverthetrainingdataset:G(x,y;V)=TXt=1(cid:12)(cid:12)(cid:12)\u2202L(V)\u2202V(x,y,t)V(x,y,t)(cid:12)(cid:12)(cid:12)(5)B(x,y)=EV\u223ctrainingsetG(x,y;V)(6)Finally,theimpactmaskisformedbytakingthetop-NpositionswiththehighestvaluesofB(x,y).ExamplesofthevaluesofG(x,y;V),B(x,y)andimpactmaskareshownon\ufb01g.3.NotethattheregionsofthehighvalueofG(x,y;V)usuallycontainthemostsalientfeaturesoftheimage.TheaveragedweightsB(x,y)tendtobehigherinthecentersinceImageNet\u2019simagesusuallycontainacenteredobject.Additionally,agrid-likestructureofpoolingstructuremaskisautomaticallyinferred.4\fNetworkDatasetErrorCPUtimeGPUtimeMem.Mult.#convNINCIFAR-10top-110.4%4.6ms0.8ms5.1MB2.2\u00b71083AlexNetImageNettop-519.6%16.7ms2.0ms6.6MB0.5\u00b71095VGG-16top-510.1%300ms29ms110MB1.5\u00b7101013Table1:DetailsoftheCNNsusedfortheexperimentalevaluation.Timings,memoryconsumptionandnumberofmultiplicationsarenormalizedbythebatchsize.Memoryconsumptionisthememoryrequiredtostoreactivations(intermediateresults)ofthenetworkduringtheforwardpass.CPU speedup (times)123456Top-5 error increase (%)0246810UniformGridPooling structureImpact(a)conv2,CPUGPU speedup (times)123456Top-5 error increase (%)0246810(b)conv2,GPUCPU speedup (times)123456Top-5 error increase (%)0246810UniformGridImpact(c)conv3,CPUGPU speedup (times)123456Top-5 error increase (%)0246810(d)conv3,GPUFigure4:AccelerationofasinglelayerofAlexNetfordifferentmasktypeswithout\ufb01ne-tuning.Valuesareaveragedover5runs.Sinceperforationofalayerchangestheimpactsofallthelayers,intheexperimentsweiteratebetweenincreasingtheperforationrateofalayerandrecalculationofimpacts.We\ufb01ndthatthisimprovesresultsbyco-adaptingtheperforationmasksofdifferentconvolutionallayers.3.3Choosingtheperforationcon\ufb01gurationsForwholenetworkacceleration,itisimportantto\ufb01ndacombinationofper-layerperforationratesthatwouldachievehighspeedupwithlowerrorincrease.Todothis,weemployasimplegreedystrategy.Weuseasingleperforationmasktypeanda\ufb01xedrangeofincreasingperforationrates.Denotebyttheevaluationtimeoftheacceleratednetworkandbyetheobjective(weusenegativelog-likelihoodforasubsetoftrainingimages).Lett0ande0betherespectivevaluesforthenon-acceleratednetwork.Ateachiteration,wetrytoincreasetheperforationrateforeachlayerandchoosethelayerforwhichthisresultsintheminimalvalueofthecostfunctione\u2212e0t0\u2212t.4ExperimentsWeusethreeconvolutionalneuralnetworksofincreasingsizeandcomputationalcomplexity:Net-workinNetwork[17],AlexNet[14]andVGG-16[25],seetable1.Inallnetworks,weattempttoperforatealltheconvolutionallayers,exceptforthe1\u00d71convolutionallayersofNIN.Weperformtimingsonacomputerwithaquad-coreIntelCorei5-4460CPU,16GBRAMandanVidiaGeforceGTX980GPU.Thebatchsizeusedfortimingsis128forNIN,256forAlexNetand16forVGG-16.ThenetworksareobtainedfromCaffeModelZoo.ForAlexNet,theCaffereimplementationisusedwhichisslightlydifferentfromtheoriginalarchitecture(poolingandnormalizationlayersareswapped).WeuseaforkofMatConvNetframeworkforallexperi-ments,exceptfor\ufb01ne-tuningofAlexNetandVGG-16,forwhichweuseaforkofCaffe.Thesourcecodeisavailableathttps://github.com/mfigurnov/perforated-cnn-matconvnet,https://github.com/mfigurnov/perforated-cnn-caffe.Webeginourexperimentsbycomparingtheproposedperforationmasksinacommonbenchmarksetting:accelerationofasingleAlexNetlayer.Then,wecomparewhole-networkaccelerationwiththebest-performingmaskstobaselinessuchasdecreaseofinputimagessizeandanincreaseofstrides.Weproceedtoshowthatperforationscalestolargenetworksbypresentingthewhole-networkaccelerationresultsforAlexNetandVGG-16.Finally,wedemonstratethatperforationiscomplementarytotherecentlyproposedaccelerationmethodofZhangetal.[28].5\fMethodCPUtime\u2193Error\u2191(%)Impact,r=34,3\u00d73\ufb01lters9.1\u00d7+1Impact,r=565.3\u00d7+1.4Impact,r=454.2\u00d7+0.9LebedevandLempitsky[15]20\u00d7top-1+1.1LebedevandLempitsky[15]9\u00d7top-1+0.3Jaderbergetal.[10]6.6\u00d7+1Lebedevetal.[16]4.5\u00d7+1Dentonetal.[6]2.7\u00d7+1Table2:AccelerationofAlexNet\u2019sconv2.Top:ourresultsafter\ufb01ne-tuning,bottom:previouslypublishedresults.Resultof[10]providedby[16].Theexperimentwithreducedspatialsizeofthekernel(3\u00d73,insteadof5\u00d75)suggeststhatperforationiscomplementarytothe\u201cbraindamage\u201dmethodof[15]whichalsoreducesthespatialsupportofthekernel.4.1SinglelayerresultsWeexplorethespeedup-errortrade-offoftheproposedperforationmasksonthetwobottleneckconvolutionallayersofAlexNet,conv2andconv3,see\ufb01g.4.Thepoolingstructureperforationmaskisonlyapplicabletotheconv2becauseitisdirectlyfollowedbyamax-pooling,whereastheconv3isfollowedbyanotherconvolutionallayer.Weseethatimpactperforationmaskworksbestfortheconv2layerwhilegridmaskperformsverywellforconv3.Thestandarddeviationofresultsissmallforalltheperforationmasks,excepttheuniformmaskforhighspeedups(wherethegridmaskoutperformsit).TheresultsaresimilarforbothCPUandGPU,showingtheapplicabilityofourmethodforbothplatforms.Notethatifweconsiderthebestperforationmaskforeachspeedupvalue,thenweseethattheconv2layeriseasiertoacceleratethantheconv3layer.Weobservethispatterninotherexperiments:layersimmediatelyfollowedbyamax-poolingareeasiertoacceleratethanthelayersfollowedbyaconvolutionallayer.AdditionalresultsforNINnetworkarepresentedinappendixB.Wecompareourresultsafter\ufb01ne-tuningtothepreviouslypublishedresultsontheaccelerationofAlexNet\u2019sconv2intable2.Motivatedbytheresultsof[15]thatthespatialsupportofconv2convolutionalkernelmaybereducedwithasmallerrorincrease,wereducethekernel\u2019sspatialsizefrom5\u00d75to3\u00d73andapplytheimpactperforationmask.Thisleadstothe9.1\u00d7accelerationfor1%top-5errorincrease.Usingthemoresophisticatedmethodof[15]toreducethespatialsupportmayleadtofurtherimprovements.4.2BaselinesWecomparePerforatedCNNswiththebaselinemethodsofdecreasingthecomputationalcostofCNNsbyexploitingthespatialredundancy.Unlikeperforation,thesemethodsdecreasethesizeoftheactivations(intermediateoutputs)oftheCNN.Foranetworkwithfully-connected(FC)layers,thiswouldchangethenumberofCNNparametersinthe\ufb01rstFClayer,effectivelymodifyingthearchitecture.Toavoidthis,weuseCIFAR-10NINnetwork,whichreplacesFClayerswithglobalaveragepooling(mean-poolingoverallspatialpositionsinthelastlayer).Weconsiderthefollowingbaselinemethods.Resize.Theinputimageisdownscaledwiththeaspectratiopreserved.Stride.Thestridesoftheconvolutionallayersareincreased,makingtheactivationsspatiallysmaller.Fractionalstride.Motivatedbyfractionalmax-pooling[7],weintroduceamore\ufb02exiblemodi\ufb01cationofstrideswhichevaluatesconvolutionsonanon-regulargrid(withavaryingstepsize),providingamore\ufb01ne-grainedcontrolovertheactivationssizeandspeedup.Weusegridperforationmaskgenerationschemetochoosetheoutputpositionstoevaluate.Wecomparethesestrategiestoperforationofallthelayerswiththetwotypesofmaskswhichperformedbestintheprevioussection:gridandimpact.Notethat\u201cgrid\u201dis,infact,equivalenttofractionalstrides,butwithmissingvaluesbeinginterpolated.Allthemethods,exceptresize,requireaparametervalueperconvolutionallayer,leadingtoalargenumberofpossiblecon\ufb01gurations.Weusetheoriginalnetworktoexplorethisspaceofcon\ufb01gurations.Forimpact,weusethegreedyalgorithm.Forstride,weevaluateallpossiblecombinationsofparameters.Forgridandfractionalstrides,foreachlayerweconsiderthesetofrates13,12,...,89,910(forfractionalstridesthisisthefractionofconvolutionscalculated),andevaluateallcombinationsofsuchrates.Then,foreachmethod,webuildaPareto-optimalfrontofparameters6\fCPU speedup (times)1234Top-1 error (%)102030405060ResizeStrideFrac. strideGridImpact(a)OriginalnetworkCPU speedup (times)1234Top-1 error (%)10.51111.51212.51313.51414.515(b)AfterretrainingFigure5:Comparisonofwholenetworkperforation(gridandimpactmask)withbaselinestrategies(resizingtheinputimages,increasingthestridesofconvolutionallayers)foraccelerationofCIFAR-10NINnetwork.whichproducedsmallesterrorincreaseforagivenCPUspeedup.Finally,wetrainthenetworkweights\u201cfromscratch\u201d(startingfromarandominitialization)forthePareto-optimalcon\ufb01gurationswithaccelerationscloseto2\u00d7,3\u00d7,4\u00d7.Forfractionalstrides,weuse\ufb01ne-tuning,sinceitperformssigni\ufb01cantlybetterthantrainingfromscratch.Theresultsaredisplayedon\ufb01g.5.Impactperforationisthebeststrategybothfortheoriginalnetworkandaftertrainingthenetworkfromscratch.Gridperforationisslightlyworse.ConvolutionalstridesareusedinmanyCNNs,suchasAlexNet,todecreasethecomputationalcostoftrainingandevaluation.Ourresultsshowthatifchangingtheintermediaterepresentationssizeandtrainingthenetworkfromscratchisanoption,thenitisindeedagoodstrategy.Althoughmoregeneral,fractionalstridesperformpoorlycomparedtostrides,mostlikelybecausethey\u201cdownsample\u201dtheoutputsofaconvolutionallayernon-uniformly,makingthemhardtoprocessbythenextconvolutionallayer.4.3WholenetworkresultsWeevaluatetheeffectofperforationofalltheconvolutionallayersofthreeCNNmodels.Totunetheperforationrates,weemploythegreedymethoddescribedinsection3.3.Weusetwentyperforationrates:13,12,23,...,1819,1920.ForNINandAlexNetweusetheimpactperforationmask.ForVGG-16weusethegridperforationmaskaswe\ufb01ndthatitconsiderablysimpli\ufb01es\ufb01ne-tuning.Usingmorethanonetypeofperforationmasksdoesnotimprovetheresults.Obtainingtheperforationratescon\ufb01gurationtakesaboutonedayforthelargestnetworkweconsidered,VGG-16.Inordertodecreasetheerroroftheacceleratednetwork,wetunethenetwork\u2019sweights.Wedonotobserveanyproblemswithbackpropagation,suchasexploding/vanishinggradients.Theresultsarepresentedintable3.Perforationdamagesthenetworkperformancesigni\ufb01cantly,butnetworkweightstuningrestoresmostoftheaccuracy.AlltheconsiderednetworksmaybeacceleratedbyafactoroftwoonbothCPUandGPU,withunder2.6%increaseoferror.Theoreticalspeedups(reductionofthenumberofmultiplications)areusuallyclosetotheempiricalones.Additionally,thememoryrequiredtostorenetworkactivationsissigni\ufb01cantlyreducedbystoringonlythenon-perforatedoutputvalues.4.4CombiningaccelerationmethodsApromisingwaytoachievehighspeedupswithlowerrorincreaseistocombinemultipleaccelerationmethods.Forthistosucceed,themethodsshouldexploitdifferenttypesofredundancyinthenetwork.Inthissection,weverifythatperforationcanbecombinedwiththeinter-channelredundancyeliminationapproachof[28]toachieveimprovedspeedup-errorratios.Wereimplementthelinearasymmetricmethodof[28].Itdecomposesaconvolutionallayerwitha(d\u00d7d\u00d7S\u00d7T)kernel(height-width-inputchannels-outputchannels)intoasequenceoftwolayers,(d\u00d7d\u00d7S\u00d7T0)\u2192(1\u00d71\u00d7T0\u00d7T),T0<T.Thesecondlayeristypicallyveryfast,sotheoverallspeedupisroughlyTT0.Whendecomposingaperforatedconvolutionallayer,wetransfertheperforationmasktothe\ufb01rstobtainedlayer.We\ufb01rstapplyperforationtothenetworkand\ufb01ne-tuneit,asintheprevioussection.Then,weapplytheinter-channelredundancyeliminationmethodtothisnetwork.Finally,weperformthesecondroundof\ufb01ne-tuningwithamuchlowerlearningrateof1e-9,duetoexplodinggradients.Allthemethodsaretestedatthetheoreticalspeeduplevelof4\u00d7.Whenthetwomethodsarecombined,theaccelerationrateforeachmethodistakentoberoughlyequal.Theresultsarepresentedinthetable7\fNetworkDeviceSpeedupMult.\u2193Mem.\u2193Error\u2191(%)Tunederror\u2191(%)NINCPU2.2\u00d72.5\u00d72.0\u00d7+1.5+0.43.1\u00d74.4\u00d73.5\u00d7+5.5+1.94.2\u00d76.6\u00d74.4\u00d7+8.3+2.9GPU2.1\u00d73.6\u00d73.3\u00d7+4.5+1.63.0\u00d710.1\u00d75.7\u00d7+18.2+5.63.5\u00d719.1\u00d79.2\u00d7+37.4+12.4AlexNetCPU2.0\u00d72.1\u00d71.8\u00d7+10.7+2.33.0\u00d73.5\u00d72.6\u00d7+28.0+6.13.6\u00d74.4\u00d72.9\u00d7+60.7+9.9GPU2.0\u00d72.0\u00d71.7\u00d7+8.5+2.03.0\u00d72.6\u00d72.0\u00d7+16.4+3.24.1\u00d73.4\u00d72.4\u00d7+28.1+6.2VGG-16CPU2.0\u00d71.8\u00d71.5\u00d7+15.6+1.13.0\u00d72.9\u00d71.8\u00d7+54.3+3.74.0\u00d74.0\u00d72.5\u00d7+71.6+5.5GPU2.0\u00d71.9\u00d71.7\u00d7+23.1+2.53.0\u00d72.8\u00d72.4\u00d7+65.0+6.84.0\u00d74.7\u00d73.4\u00d7+76.5+7.3Table3:Fullnetworkaccelerationresults.Arrowsindicateincreaseordecreaseinthemetric.Speedupisthewall-clockacceleration.Mult.isareductionofthenumberofmultiplicationsinconvolutionallayers(theoreticalspeedup).Mem.isareductionofmemoryrequiredtostorethenetworkactivations.Tunederroristheerroraftertrainingfromscratch(NIN)or\ufb01ne-tuning(AlexNet,VGG16)oftheacceleratednetwork\u2019sweights.PerforationAsymm.[28]Mult.\u2193Mem.\u2193Error\u2191(%)Tunederror\u2191(%)4.0\u00d7-4.0\u00d72.5\u00d7+71.6+5.5-3.9\u00d73.9\u00d70.93\u00d7+6.7+2.01.8\u00d72.2\u00d74.0\u00d71.4\u00d7+2.9+1.6Table4:AccelerationofVGG-16,4\u00d7theoreticalspeedup.Firstrowistheproposedmethod,thesecondrowisourreimplementationoflinearasymmetricmethodofZhangetal.[28],thethirdrowisthecombinedmethod.PerforationiscomplementarytotheaccelerationmethodofZhangetal.4.Whilethedecompositionmethodoutperformsperforation,thecombinedmethodisbetterthanbothofthecomponents.5ConclusionWehavepresentedPerforatedCNNswhichexploitredundancyofintermediaterepresentationsofmodernCNNstoreducetheevaluationtimeandmemoryconsumption.Perforationrequiresonlyaminormodi\ufb01cationoftheconvolutionlayerandobtainsspeedupsclosetotheoreticalonesonbothCPUandGPU.Comparedtothebaselines,PerforatedCNNsachievelowererror,aremore\ufb02exibleanddonotchangethearchitectureofaCNN(numberofparametersinthefully-connectedlayersandthesizeoftheintermediaterepresentations).RetainingthearchitectureallowstoeasilypluginPerforatedCNNsintotheexistingcomputervisionpipelinesandonlyperform\ufb01ne-tuningofthenetwork,insteadofcompleteretraining.Additionally,perforationcanbecombinedwithaccelerationmethodswhichexploitothertypesofnetworkredundancytoachievefurtherspeedups.Infuture,weplantoexploretheconnectionbetweenPerforatedCNNsandvisualattentionbyconsideringinput-dependentperforationmasksthatcanfocusonthesalientpartsoftheinput.Unlikerecentworksonvisualattention[1,11,20]whichconsiderrectangularcropsofanimage,PerforatedCNNscanprocessnon-rectangularandevendisjointsalientpartsoftheimagebychoosingappropriateperforationmasksintheconvolutionallayers.Acknowledgments.WewouldliketothankAlexanderKirillovandDmitryKropotovforhelpfuldiscussions,andYandexforprovidingcomputationalresourcesforthisproject.ThisworkwassupportedbyRFBRprojectNo.15-31-20596(mol-a-ved)andbyMicrosoft:MoscowStateUniversityJointResearchCenter(RPD1053945).8\fReferences[1]J.Ba,R.Salakhutdinov,R.Grosse,andB.Frey,\u201cLearningwake-sleeprecurrentattentionmodels,\u201dNIPS,2015.[2]T.Chen,\u201cMatrixshadowlibrary,\u201dhttps://github.com/dmlc/mshadow,2015.[3]S.Chetlur,C.Woolley,P.Vandermersch,J.Cohen,J.Tran,B.Catanzaro,andE.Shelhamer,\u201ccuDNN:Ef\ufb01cientprimitivesfordeeplearning,\u201darXiv,2014.[4]M.D.CollinsandP.Kohli,\u201cMemoryboundeddeepconvolutionalnetworks,\u201darXiv,2014.[5]M.Courbariaux,Y.Bengio,andJ.David,\u201cLowprecisionarithmeticfordeeplearning,\u201dICLR,2015.[6]E.L.Denton,W.Zaremba,J.Bruna,Y.LeCun,andR.Fergus,\u201cExploitinglinearstructurewithinconvolutionalnetworksforef\ufb01cientevaluation,\u201dNIPS,2014.[7]B.Graham,\u201cFractionalmax-pooling,\u201darXiv,2014.[8]\u2014\u2014,\u201cSpatially-sparseconvolutionalneuralnetworks,\u201darXiv,2014.[9]S.Gupta,A.Agrawal,K.Gopalakrishnan,andP.Narayanan,\u201cDeeplearningwithlimitednumericalprecision,\u201dICML,2015.[10]M.Jaderberg,A.Vedaldi,andA.Zisserman,\u201cSpeedingupconvolutionalneuralnetworkswithlowrankexpansions,\u201dBMVC,2014.[11]M.Jaderberg,K.Simonyan,A.Zissermanetal.,\u201cSpatialtransformernetworks,\u201dNIPS,2015.[12]Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Girshick,S.Guadarrama,andT.Darrell,\u201cCaffe:Convolutionalarchitectureforfastfeatureembedding,\u201dACMICM,2014.[13]A.Krizhevsky,\u201ccuda\u2013convnet2,\u201dhttps://github.com/akrizhevsky/cuda-convnet2/,2014.[14]A.Krizhevsky,I.Sutskever,andG.E.Hinton,\u201cImagenetclassi\ufb01cationwithdeepconvolutionalneuralnetworks,\u201dNIPS,2012.[15]V.LebedevandV.Lempitsky,\u201cFastconvnetsusinggroup-wisebraindamage,\u201dCVPR,2016.[16]V.Lebedev,Y.Ganin,M.Rakhuba,I.Oseledets,andV.Lempitsky,\u201cSpeeding-upconvolutionalneuralnetworksusing\ufb01ne-tunedCP-decomposition,\u201dICLR,2015.[17]M.Lin,Q.Chen,andS.Yan,\u201cNetworkinnetwork,\u201dICLR,2014.[18]S.Misailovic,S.Sidiroglou,H.Hoffmann,andM.Rinard,\u201cQualityofservicepro\ufb01ling,\u201dICSE,2010.[19]S.Misailovic,D.M.Roy,andM.C.Rinard,\u201cProbabilisticallyaccurateprogramtransformations,\u201dStaticAnalysis,2011.[20]V.Mnih,N.Heess,A.Gravesetal.,\u201cRecurrentmodelsofvisualattention,\u201dNIPS,2014.[21]A.Novikov,D.Podoprikhin,A.Osokin,andD.Vetrov,\u201cTensorizingneuralnetworks,\u201dNIPS,2015.[22]K.Ovtcharov,O.Ruwase,J.-Y.Kim,J.Fowers,K.Strauss,andE.S.Chung,\u201cAcceleratingdeepconvolu-tionalneuralnetworksusingspecializedhardware,\u201dMicrosoftResearchWhitepaper,2015.[23]M.Samadi,D.A.Jamshidi,J.Lee,andS.Mahlke,\u201cParaprox:Pattern-basedapproximationfordataparallelapplications,\u201dASPLOS,2014.[24]S.Sidiroglou-Douskos,S.Misailovic,H.Hoffmann,andM.Rinard,\u201cManagingperformancevs.accuracytrade-offswithloopperforation,\u201dACMSIGSOFT,2011.[25]K.SimonyanandA.Zisserman,\u201cVerydeepconvolutionalnetworksforlarge-scaleimagerecognition,\u201dICLR,2015.[26]A.VedaldiandK.Lenc,\u201cMatConvNet\u2013convolutionalneuralnetworksforMATLAB,\u201darXiv,2014.[27]Z.Yang,M.Moczulski,M.Denil,N.deFreitas,A.J.Smola,L.Song,andZ.Wang,\u201cDeepfriedconvnets,\u201dICCV,2015.[28]X.Zhang,J.Zou,K.He,andJ.Sun,\u201cAcceleratingverydeepconvolutionalnetworksforclassi\ufb01cationanddetection,\u201darXiv,2015.9\f", "award": [], "sourceid": 574, "authors": [{"given_name": "Mikhail", "family_name": "Figurnov", "institution": "Skolkovo Inst. of Sc and Tech"}, {"given_name": "Aizhan", "family_name": "Ibraimova", "institution": "Skolkovo Institute of Science and Technology"}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Higher School of Economics, Yandex"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "Microsoft Research"}]}