{"title": "Large Margin Learning of Upstream Scene Understanding Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2586, "page_last": 2594, "abstract": "Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classification. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efficiently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.", "full_text": "LargeMarginLearningofUpstreamSceneUnderstandingModelsJunZhu\u2020Li-JiaLi\u2021Fei-FeiLi\u2021EricP.Xing\u2020\u2020{junzhu,epxing}@cs.cmu.edu\u2021{lijiali,feifeili}@cs.stanford.edu\u2020SchoolofComputerScience,CarnegieMellonUniversity,Pittsburgh,PA15213\u2021DepartmentofComputerScience,StanfordUniversity,Stanford,CA94305AbstractUpstreamsupervisedtopicmodelshavebeenwidelyusedforcomplicatedsceneunderstanding.However,existingmaximumlikelihoodestimation(MLE)schemescanmakethepredictionmodellearningindependentoflatenttopicdis-coveryandresultinanimbalancedpredictionruleforsceneclassi\ufb01cation.Thispaperpresentsajointmax-marginandmax-likelihoodlearningmethodforup-streamsceneunderstandingmodels,inwhichlatenttopicdiscoveryandpredic-tionmodelestimationarecloselycoupledandwell-balanced.Theoptimizationproblemisef\ufb01cientlysolvedwithavariationalEMprocedure,whichiterativelysolvesanonlineloss-augmentedSVM.Wedemonstratetheadvantagesofthelarge-marginapproachonbothan8-categorysportsdatasetandthe67-classMITindoorscenedatasetforscenecategorization.1IntroductionProbabilistictopicmodelslikethelatentDirichletallocation(LDA)[5]haverecentlybeenappliedtoanumberofcomputervisiontaskssuchasobjectionannotationandsceneclassi\ufb01cationduetotheirabilitytocapturelatentsemanticcompositionsofnaturalimages[22,23,9,13].Oneoftheadvocatedadvantagesofsuchmodelsisthattheydonotrequire\u201csupervision\u201dduringtraining,whichisarguablypreferredoversupervisedlearningthatwouldnecessitateextracost.Butwiththeincreasingavailabilityoffreeon-lineinformationsuchasimagetags,userratings,etc.,variousformsof\u201cside-information\u201dthatcanpotentiallyoffer\u201cfree\u201dsupervisionhaveledtoaneedfornewmodelsandtrainingschemesthatcanmakeeffectiveuseofsuchinformationtoachievebetterresults,suchasmorediscriminativetopicrepresentationsofimagecontents,andmoreaccurateimageclassi\ufb01ers.ThestandardunsupervisedLDAignoresthecommonlyavailablesupervisioninformation,andthuscandiscoverasub-optimaltopicrepresentationforpredictiontasks.Extensionstosupervisedtopicmodelswhichcanexploresideinformationfordiscoveringpredictivetopicrepresentationshavebeenproposed,suchasthesLDA[4,25]andMedLDA[27].Acommoncharacteristicofthesemodelsisthattheyaredownstream,thatis,thesupervisedresponsevariablesaregeneratedfromtopicassignmentvariables.Anothertypeofsupervisedtopicmodelsaretheso-calledupstreammodels,ofwhichtheresponsevariablesdirectlyorindirectlygeneratelatenttopicvariables.Incontrasttodownstreamsupervisedtopicmodels(dSTM),whicharemainlydesignedbymachinelearningresearchers,upstreamsupervisedtopicmodels(uSTM)arewell-motivatedfromhumanvisionandpsychologyresearch[18,10]andhavebeenwidelyusedforsceneunderstandingtasks.Forexample,intherecentlydevelopedsceneunderstandingmodels[23,13,14,8],complexsceneimagesaremodeledasahierarchyofsemanticconceptswherethemosttoplevelcorrespondstoascene,whichcanberepresentedasasetoflatentobjectslikelytobefoundinagivenscene.Tolearnanupstreamscenemodel,maximumlikelihoodestimation(MLE)isthemostcommonchoice.However,MLEcanmakethepredictionmodelestimationindependentoflatenttopicdiscoveryandresultinanimbalancedpredictionruleforsceneclassi\ufb01cation,asweexplaininSection3.1\fInthispaper,ourgoalistoaddresstheweaknessofMLEforlearningupstreamsupervisedtopicmodels.Ourapproachisbasedonthemax-marginprincipleforsupervisedlearningwhichhasshowngreatpromiseinmanymachinelearningtasks,suchasclassi\ufb01cation[21]andstructuredout-putprediction[24].ForthedSTM,max-margintraininghasbeendevelopedinMedLDA[27],whichhasachievedbetterpredictionperformancethanMLE.Insuchdownstreammodels,latenttopicas-signmentsaresuf\ufb01cientstatisticsforthepredictionmodelanditiseasytode\ufb01nethemax-marginconstraintsbasedonexistingmax-marginmethods(e.g.,SVM).However,forupstreamsupervisedtopicmodels,thediscriminantfunctionforpredictioninvolvesanintractablecomputationofposte-riordistributions,whichmakesthemax-margintrainingmoredelicate.Speci\ufb01cally,wepresentajointmax-marginandmax-likelihoodestimationmethodforlearningup-streamsceneunderstandingmodels.Byusingavariationalapproximationtotheposteriordistri-butionofsupervisedvariables(e.g.,scenecategories),ourmax-marginlearningapproachiteratesbetweenposteriorprobabilisticinferenceandmax-marginparameterlearning.Theparameterlearn-ingsolvesanonlineloss-augmentedSVM,whichcloselycouplesthepredictionmodelestimationandlatenttopicdiscovery,andthiscloseinterplayresultsinawell-balancedpredictionruleforscenecategorization.Finally,wedemonstratetheadvantagesofourmax-marginapproachonboththe8-categorysports[13]andthe67-classMITindoorscene[20]datasets.Empiricalresultsshowthatmax-marginlearningcansigni\ufb01cantlyimprovethesceneclassi\ufb01cationaccuracy.Thepaperisstructuredasfollows.Sec.2presentsagenericsceneunderstandingmodelwewillworkon.Sec.3discussestheweaknessofMLEinlearningupstreammodels.Sec.4presentsthemax-marginlearningapproach.Sec.5presentsempiricalresultsandSec.6concludes.2JointSceneandObjectModel:aGenericRunningExampleInthissection,wepresentagenericjointscenecategorizationandobjectannotationmodel,whichwillbeusedtodemonstratethelargemarginlearningofupstreamsceneunderstandingmodels.2.1ImageRepresentationHowshouldwerepresentasceneimage?Friedman[10]pointedoutthatobjectrecognitioniscriticalintherecognitionofascene.Whileindividualobjectscontributetotherecognitionofvisualscenes,humanvisionresearchersNavon[18]andBiederman[2]alsoshowedthatpeopleperformrapidglobalsceneanalysisbeforeconductingmoredetailedlocalobjectanalysiswhenrecognizingsceneimages.Toobtainagenericmodel,werepresentascenebyusingitsglobalscenefeaturesandobjectswithinit.We\ufb01rstsegmentanimageIintoasetoflocalregions{r1,\u00b7\u00b7\u00b7,rN}.EachregionisrepresentedbythreeregionfeaturesR(i.e.,color,locationandtexture)andasetofimagepatchesX.Theseregionfeaturesarerepresentedasvisualcodewords.Todescribedetailedlocalinformationofobjects,wepartitioneachregionintopatches.Foreachpatch,weextracttheSIFT[16]features,whichareinsensitivetoview-pointandilluminationchanges.Tomodeltheglobalscenerepresentation,weextractasetofglobalfeaturesG[19].Inourdataset,werepresentanimageasatuple(r,x,g),whererdenotesaninstanceofR,andlikewiseforxandg.2.2TheJointSceneandObjectModelThemodelisshowninFig.1(a).Sisthescenerandomvariable,takingvaluesfroma\ufb01nitesetS={s1,\u00b7\u00b7\u00b7,sMs}.Foranimage,thedistributionoverscenecategoriesdependsonitsglobalrepresentationfeaturesG.EachsceneisrepresentedasamixtureoverlatentobjectsOandthemixingweightsarede\ufb01nedwithageneralizedlinearmodel(GLM)parameterizedby\u03c8.Byusinganormalprioron\u03c8,thescenemodelcancapturethemutualcorrelationsbetweendifferentobjects,similartothecorrelatedtopicmodels(CTMs)[3].Here,weassumethatfordifferentscenes,theobjectshavedifferentdistributionsandcorrelations.Letfdenotethevectorofreal-valuedfeaturefunctionsofSandG,thegeneratingprocedureofanimageisasfollows:1.Sampleascenecategoryfromaconditionalscenemodel:p(s|g,\u03b8)=exp(\u03b8\u22a4f(g,s)\u2211s\u2032exp(\u03b8\u22a4f(g,s\u2032)).2.Sampletheparameters\u03c8|s,\u00b5,\u03a3\u223cN(\u00b5s,\u03a3s).3.Foreachregionn(a)sampleanobjectfrom:p(on=k|\u03c8)=exp(\u03c8k)\u2211jexp(\u03c8j).(b)sampleMr(i.e.,3:color,locationandtexture)regionfeatures:rnm|on,\u03b2\u223cMulti(\u03b2mon).(c)sampleMximagepatchesxnm|on,\u03b7\u223cMulti(\u03b7on).2\fDRNXGMrMxOSMs(a)1234567802468x 10\u22123Log\u2212likelihood RatioMLE1234567802468x 10\u22123Max\u2212Margin(b)120.30.350.40.450.50.550.60.650.70.750.8Scene Classification AccuracyMax\u2212MarginMLE(c)Figure1:(a)ajointscenecategorizationandobjectannotationmodelwithglobalfeaturesG;(b)averagelog-likelihoodratiologp(s|g,\u03b8)/L\u2212\u03b8underMLEandmax-marginestimations,wherethe\ufb01rstbarisfortruecategoriesandtherestareforcategoriessortedbasedontheirdifferencefromthe\ufb01rstone;(c)sceneclassi\ufb01ca-tionaccuracybyusing(Blue)L\u2212\u03b8,(Green)logp(s|g,\u03b8),and(Red)L\u2212\u03b8+logp(s|g,\u03b8)forprediction.Group1isforMLEandgroup2isformax-margintraining.Thegenerativemodelde\ufb01nesajointdistributionp(s,\u03c8,o,r,x|g,\u0398)=p(s|\u03b8,g)p(\u03c8|\u00b5s,\u03a3s)N\u220fn=1(p(on|\u03c8)Mr\u220fm=1p(rnm|on,\u03b2)Mx\u220fm=1p(xnm|on,\u03b7)),wherewehaveused\u0398todenotealltheunknownparameters(\u03b8,\u00b5,\u03a3,\u03b2,\u03b7).Fromthejointdistribu-tion,wecanmaketwotypesofpredictions,namelysceneclassi\ufb01cationandobjectannotation.Forsceneclassi\ufb01cation,weinferthemaximumaposterioriprediction\u02c6s,argmaxsp(s|g,r,x)=argmaxslogp(s,r,x|g).(1)Forobjectannotation,wecanusetheinferredlatentrepresentationofregionsbasedonp(o|g,r,x)andbuildaclassi\ufb01ertocategorizeregionsintoobjectclasses,whensometrainingexampleswithmanuallyannotatedobjectsareprovided.Sincecollectingfullylabeledimageswithannotatedob-jectsisdif\ufb01cult,upstreamscenemodelsareusuallylearnedwithpartiallylabeledimagesforscenecategorization,whereonlyscenecategoriesareprovidedandobjectsaretreatedaslatenttopicsorthemes[9].Inthispaper,wefocusonsceneclassi\ufb01cation.Someempiricalresultsonobjectannotationwillbereportedwhenlabeledobjectsareavailable.Weusethisjointmodelasarunningexampletodemonstratethebasicprincipeofperformingmax-marginlearningforthewidelyappliedupstreamsceneunderstandingmodelsbecauseitiswell-motivated,verygenericandcoversmanyotherexistingsceneunderstandingmodels.Forexample,ifwedonotincorporatetheglobalscenerepresentationG,thejointmodelwillbereducedtoamodelsimilaras[14,6,23].Moreover,thegenericjointmodelprovidesagoodframeworkforstudyingtherelativecontributionsoflocalobjectmodelingandglobalscenerepresentation,whichhasbeenshowntobeusefulforsceneclassi\ufb01cation[20]andobjectdetection[17]tasks.3WeakCouplingofMLEinLearningUpstreamSceneModelsTolearnanupstreamscenemodel,themostcommonlyusedmethodisthemaximumlikelihoodestimation(MLE),suchasin[23,6,14].Inthissection,wediscusstheweaknessofMLEforlearningupstreamscenemodelsandmotivatethemax-marginapproach.LetD={(Id,sd)}Dd=1denoteasetofpartiallylabeledtrainingimages.ThestandardMLEobtainstheoptimummodelparametersbymaximizingthelog-likelihood1\u2211Dd=1logp(sd,rd,xd|gd,\u0398).Byusingthefactorizationofp(s,\u03c8,o,r,x|g,\u0398),MLEsolvesthefollowingequivalentproblemmax\u03b8,\u0398\u2212\u03b8\u2211d(logp(sd|gd,\u03b8)+Lsd,\u2212\u03b8),(2)whereLsd,\u2212\u03b8,log\u222b\u03c8\u2211op(\u03c8,o,rd,xd|sd,\u0398)=logp(rd,xd|sd,\u0398)isthelog-likelihoodofim-agefeaturesgiventhesceneclass,and\u0398\u2212\u03b8denotesalltheparametersexcept\u03b8.SinceLs,\u2212\u03b8doesnotdependon\u03b8,theMLEestimationoftheconditionalscenemodelistosolvemax\u03b8\u2211dlogp(sd|gd,\u03b8),(3)whichdoesnotdependonthelatentobjectmodel.Thisisinconsistentwiththepredictionrule(1)whichdoesdependonboththeconditionalscenemodel(i.e.,p(s|g,\u03b8))andthelocalobjectmodel.1Theconditionallikelihoodestimationcanavoidthisproblemtosomeextend,butithasnotbeenstudied,tothebestofourknowledge.3\fThisdecouplingwillresultinanimbalancedcombinationbetweentheconditionalsceneandobjectmodelsforprediction,asweexplainbelow.We\ufb01rstpresentsomedetailsoftheMLEmethod.For\u03b8,theproblem(3)isanMLEestimationofaGLM,anditcanbeef\ufb01cientlysolvedwithgradientdescentmethods,suchasquasi-Newtonmethods[15].For\u0398\u2212\u03b8,sincethelikelihoodLs,\u2212\u03b8isintractabletocompute,weapplyvariationalmethodstoobtainanapproximation.Byintroducingavariationaldistributionqs(\u03c8,o)toapproximatetheposteriorp(\u03c8,o|s,r,x,\u0398)andusingtheJensen\u2019sinequality,wecanderivealowerboundLs,\u2212\u03b8\u2265Eqs[logp(\u03c8,o,r,x|s,\u0398)]+H(qs),L\u2212\u03b8(qs,\u0398),(4)whereH(q)=\u2212Eq[q]istheentropy.Then,theintractablepredictionrule(1)canbeapproximatedwiththevariationalpredictionrule\u02c6s,argmaxs,qs(logp(s|g,\u03b8)+L\u2212\u03b8(qs,\u0398)).(5)Maximizing\u2211dL\u2212\u03b8(qsd,\u0398)willleadtoaclosedformsolutionof\u0398\u2212\u03b8.SeeAppendixfortheinferenceofqsasinvolvedinthepredictionrule(5)andtheestimationof\u0398\u2212\u03b8.Now,weexaminetheeffectsoftheconditionalscenemodelp(s|g,\u03b8)inmakingapredictionviathepredictionrule(5).Fig.1(b-left)showstherelativeimportanceoflogp(s|g,\u03b8)inthejointdecisionrule(5)onthesportsdataset[13].WecanseethatinMLEtheconditionalscenemodelplaysaveryweakroleinmakingapredictionwhenitiscombinedwiththeobjectmodel,i.e.,L\u2212\u03b8.Therefore,asshowninFig.1(c),althoughasimplelogisticregressionwithglobalfeatures(i.e.,thegreenbar)canachieveagoodaccuracy,theaccuracyofthepredictionrule(5)thatusesthejointlikelihoodbound(i.e,theredbar)isdecreasedduetothestrongeffectofthepotentiallybadpredictionrulebasedonL\u2212\u03b8(i.e.,thebluebar),whichonlyconsiderslocalimagefeatures.Incontrast,asshowninFig.1(b-right),inthemax-marginapproachtobepresented,theconditionalscenemodelplaysamuchmorein\ufb02uentialroleinmakingapredictionviatherule(5).Thisresultsinabetterbalancedcombinationbetweenthesceneandtheobjectmodels.Thestrongcouplingisduetosolvinganonlineloss-augmentedSVM,asweexplainbelow.NotethatwearenotclaiminganyweaknessofMLEingeneral.Allourdiscussionsareconcentratedonlearningupstreamsupervisedtopicmodels,asgenericallyrepresentedbythemodelinFig.1.4Max-MarginTrainingNow,wepresentthemax-marginmethodforlearningupstreamsceneunderstandingmodels.4.1ProblemDe\ufb01nitionForthepredictiverule(1),weuseF(s,g,r,x;\u0398),logp(s|g,r,x,\u0398)todenotethediscriminantfunction,whichismorecomplicatedthanthecommonlychosenlinearform,inthesensewewillexplainshortly.Inthesamespiritofmax-marginclassi\ufb01ers(e.g.,SVMs),wede\ufb01nethehingelossofthepredictionrule(1)onDasRhinge(\u0398)=1D\u2211dmaxs[\u2206\u2113d(s)\u2212\u2206Fd(s;\u0398)],where\u2206\u2113d(s)isalossfunction(e.g.,0/1loss),and\u2206Fd(s;\u0398)=F(sd,gd,rd,xd;\u0398)\u2212F(s,gd,rd,xd;\u0398)isthemarginfavoredbythetruecategorysdoveranyothercategorys.Theproblemwiththeabovede\ufb01nitionisthatexactlycomputingtheposteriordistributionp(s|g,r,x,\u0398)isintractable.AsinMLE,weuseavariationaldistributionqstoapproximateit.ByusingtheBayes\u2019sruleandthevariationalboundinEq.(4),wecanlowerboundthelog-likelihoodlogp(s|g,r,x,\u0398)=logp(s,r,x|g,\u0398)\u2212logp(r,x|g,\u0398)\u2265logp(s|g,\u03b8)+L\u2212\u03b8(qs,\u0398)\u2212c,(6)wherec=logp(r,x|g,\u0398).Withoutcausingambiguity,wewilluseL\u2212\u03b8(qs)without\u0398.Sinceweneedtomakesomeassumptionsaboutqs,theequalityin(6)usuallydoesnothold.Therefore,thetightestlowerboundisanapproximationoftheintractablediscriminantfunctionF(s,g,r,x;\u0398)\u2248logp(s|g,\u03b8)+maxqsL\u2212\u03b8(qs)\u2212c.(7)Then,themarginis\u2206Fd(s;\u0398)=\u03b8\u22a4\u2206fd(s)+maxqsdL\u2212\u03b8(qsd)\u2212maxqsL\u2212\u03b8(qs),ofwhichthelineartermisthesameasthatinalinearSVM[7]andthedifferencebetweentwovariationalboundscausesthetopicdiscoverytobiasthelearningofthesceneclassi\ufb01cationmodel,asweshallsee.4\fUsingthevariationaldiscriminantfunctioninEq.(7)andapplyingtheprincipleofregularizedempiricalriskminimization,wede\ufb01nethemax-marginlearningofthejointsceneandobjectmodelassolvingmin\u0398\u2126(\u0398)+\u03bb\u2211d(\u2212maxqsdL\u2212\u03b8(qsd))+CRhinge(\u0398),(8)where\u2126(\u0398)isaregularizeroftheparameters.Here,wede\ufb01ne\u2126(\u0398),12\u2225\u03b8\u222522.Forthenormalmean\u00b5sorcovariancematrix\u03a3s,asimilar\u21132-normorFrobeniusnormcanbeusedwithoutchangingouralgorithm.Thefreeparameters\u03bbandCarepositiveandtradeofftheclassi\ufb01cationlossandthedatalikelihood.When\u03bb\u2192\u221e,theproblem(8)reducestothestandardMLEofthejointscenemodelwitha\ufb01xeduniformprioronsceneclasses.Moreover,wecanseethedifferencefromthestandardMLE(2).Here,weminimizeahingeloss,whichisde\ufb01nedonthejointpredictionrule,whileMLEminimizesthelog-likelihoodlosslogp(sd|gd,\u03b8),whichdoesnotdependonthelatentobjectmodel.Therefore,ourapproachcanbeexpectedtoachieveacloserdependencebetweentheconditionalscenemodelandthelatentobjectmodel.Moreinsightswillbeprovidedinthenextsection.4.2SolvingtheOptimizationProblemTheproblem(8)isgenerallyhardtosolvebecausethemodelparametersandvariationaldistribu-tionsarestronglycoupled.Therefore,wedevelopanaturaliterativeprocedurethatestimatestheparameters\u0398andperformsposteriorinferencealternatively.Theintuitionisthatby\ufb01xingonepart(e.g.,qs)theotherpart(e.g.,\u0398)canbeef\ufb01cientlydone.Speci\ufb01cally,usingthede\ufb01nitions,werewritetheproblem(8)asamin-maxoptimizationproblemmin\u0398,{qsd}max{s,qs}(12\u2225\u0398\u222522\u2212(\u03bb+C)\u2211dL\u2212\u03b8(qsd)+C\u2211d[\u2212\u03b8\u22a4\u2206fd(s)+\u2206\u2113d(s)+L\u2212\u03b8(qs)]),(9)wherethefactor1/DinRhingeisabsorbedintheconstantC.Thismin-maxproblemcanbeapproximatelysolvedwithaniterativeprocedure.First,weinfertheoptimalvariationalposterior2q\u22c6s=argmaxqsL\u2212\u03b8(qs)foreachsandeachtrainingimage.Then,wesolvemin\u0398,{qsd}(12\u2225\u0398\u222522\u2212(\u03bb+C)\u2211dL\u2212\u03b8(qsd)+C\u2211dmaxs[\u2212\u03b8\u22a4\u2206fd(s)+\u2206\u2113d(s)+L\u2212\u03b8(q\u22c6s)]),Forthissub-step,again,weapplyanalterativeproceduretosolvetheminimizationproblemover\u0398andqsd.We\ufb01rstinfertheoptimalvariationalposteriorq\u22c6sd=argmaxqsdL\u2212\u03b8(qsd),andthenweestimatetheparametersbysolvingthefollowingproblemmin\u0398(12\u2225\u0398\u222522\u2212(\u03bb+C)\u2211dL\u2212\u03b8(q\u22c6sd)+C\u2211dmaxs[\u2212\u03b8\u22a4\u2206fd(s)+\u2206\u2113d(s)+L\u2212\u03b8(q\u22c6s)]),(10)Sinceinferringq\u22c6sdisincludedinthestepofinferringq\u22c6s(\u2200s),thealgorithmcanbesummarizedasatwo-stepEM-procedurethatiterativelyperformsposteriorinferenceofqsandmax-marginparameterestimation.Anotherwaytounderstandthisiterativeprocedureisfromthede\ufb01nitions.The\ufb01rststepofinferringq\u22c6sistocomputethediscriminantfunctionFunderthecurrentmodel.Then,weupdatethemodelparameters\u0398bysolvingalarge-marginlearningproblem.Forbrevity,wepresenttheparameterestimationonly.TheposteriorinferenceisdetailedinAppendixA.1.ParameterEstimation:Thisstepcanbedonewithanalternatingminimizationprocedure.FortheGaussianparameters(\u00b5,\u03a3)andmultinomialparameters(\u03b7,\u03b2),theestimationcanbewritteninaclosed-formasinastandardMLEofCTMs[3]byusingaloss-augmentedpredictionofs.Forbrevity,wedeferthedetailstotheAppendixA.2.Now,wepresentthestepofestimating\u03b8,whichillustratestheessentialdifferencebetweenthelarge-marginapproachandthestandardMLE.Speci\ufb01cally,theoptimumsolutionof\u03b8isobtainedbysolvingthesub-problem3min\u03b812\u2225\u03b8\u222522+C\u2211d(maxs[\u03b8\u22a4f(gd,s)+\u2206\u2113d(s)+L\u2212\u03b8(q\u22c6s)]\u2212[\u03b8\u22a4f(gd,sd)+L\u2212\u03b8(q\u22c6sd)]),whichisequivalenttoaconstrainedproblembyintroducingasetofnon-negativeslackvariables\u03bemin\u03b8,\u03be12\u2225\u03b8\u222522+CD\u2211d=1\u03beds.t.:\u03b8\u22a4\u2206fd(s)+[L\u2212\u03b8(q\u22c6sd)\u2212L\u2212\u03b8(q\u22c6s)]\u2265\u2206\u2113d(s)\u2212\u03bed,\u2200d,s.(11)2Toretainanaccuratelarge-margincriterionforestimatingmodelparameters(especially\u03b8),wedonotperformthemaximizationoversatthisstep.3Theconstant(w.r.t.\u03b8)term\u2212C\u2211dL\u2212\u03b8(q\u22c6sd)iskeptforeasyexplanation.Itwon\u2019tchangetheestimation.5\fTheconstrainedoptimizationproblemissimilartothatofalinearSVM[7].However,thedifferenceisthatwehavetheadditionalterm\u2206L\u22c6d(s),L\u2212\u03b8(q\u22c6sd)\u2212L\u2212\u03b8(q\u22c6s).Thistermindicatesthattheestimationofthesceneclassi\ufb01cationmodelisin\ufb02uencedbythetopicdiscoveryprocedure,which\ufb01ndsanoptimumposteriordistributionq\u22c6.If\u2206L\u22c6d(s)<0,s\u0338=sd,whichmeansitisverylikelythatawrongscenesexplainstheimagecontentbetterthanthetruescenesd,thentheterm\u2206L\u22c6d(s)actsinaroleofaugmentingthelineardecisionboundary\u03b8tomakeacorrectpredictiononthisimagebyusingthepredictionrule(5).If\u2206L\u22c6d(s)>0,whichmeansthetruescenecanexplaintheimagecontentbetterthans,thenthelineardecisionboundarycanbeslightlyrelaxed.Ifwemovetheadditionaltermtotherighthandside,theproblem(11)istolearnalinearSVM,butwithanonlineupdatedlossfunction\u2206\u2113d(s)\u2212\u2206L\u22c6d(s).WecallthisSVManonlineloss-augmentedSVM.Solvingtheloss-augmentedSVMwillresultinanampli\ufb01edin\ufb02uenceofthesceneclassi\ufb01cationmodelinthejointpredictiverule(5)asshowninFig.1(b).5ExperimentsNow,wepresentempiricalevaluationofourapproachonthesports[13]andMITindoorscene[20]datasets.Ourgoalistodemonstratetheadvantagesofthemax-marginmethodovertheMLEforlearningupstreamscenemodelswithorwithoutglobalfeatures.AlthoughthemodelinFig.1canalsobeusedforobjectannotation,wereporttheperformanceonscenecategorizationonly,whichisourmainfocusinthispaper.Forobjectannotation,whichrequiresadditionalhumanannotatedexamplesofobjects,somepreliminaryresultsarereportedintheAppendixduetospacelimitation.5.1DatasetsandFeaturesThesportsdatacontain1574diversesceneimagesfrom8categories,aslistedinFig.2withexampleimages.Theindoorscenedataset[20]contains15620sceneimagesfrom67categoriesaslistedinTable2.Weusethemethod[1]tosegmenttheseimagesintosmallregionsbasedoncolor,bright-nessandtexturehomogeneity.Foreachregion,weextractcolor,textureandlocationfeatures,andquantizetheminto30,50and120codewords,respectively.Similarly,theSIFTfeaturesextractedfromthesmallpatcheswithineachregionarequantizedinto300SIFTcodewords.Weusethegistfeatures[19]asoneexampleofglobalfeatures.Extensiontoincludeotherglobalfeatures,suchasSIFTsparsecodes[26],canbedirectlydonewithoutchangingthemodelorthealgorithm.5.2ModelsFortheupstreamscenemodelasinFig.1,wecomparethemax-marginlearningwiththeMLEmethod,andwedenotethescenemodelstrainedwithmax-margintrainingandMLEbyMM-SceneandMLE-Scene,respectively.Forbothmethods,weevaluatetheeffectivenessofglobalfeatures,andwedenotethescenemodelswithoutglobalfeaturesbyMM-Scene-NGandMLE-Scene-NG,respectively.Sinceourmaingoalinthispaperistodemonstratetheadvantagesofmax-marginlearninginupstreamsupervisedtopicmodels,ratherthandominanceofsuchmodelsoverallothers,wejustcomparewithoneexampleofdownstreammodels\u2013themulti-classsLDA(Multi-sLDA)[25].Systematicalcomparisonwithothermethods,includingDiscLDA[12]andMedLDA[27],isdeferredtoafullversion.ForthedownstreamMulti-sLDA,theimage-wisescenecategoryvariableSisgeneratedfromlatentobjectvariablesOviaasoftmaxfunction.Forthisdownstreammodel,theparameterestimationcanbedonewithMLEasdetailedin[25].Finally,toshowtheusefulnessoftheobjectmodelinscenecategorization,wealsocomparewiththemargin-basedmulti-classSVM[7]andlikelihood-basedlogisticregressionforsceneclassi\ufb01cationbasedontheglobalfeatures.FortheSVM,weusethesoftwareSVMmulticlass4,whichimplementsafastcutting-planealgorithm[11]todoparameterlearning.Weusethesamesoftwarewithslightchangestolearntheloss-augmentedSVMinourmax-marginmethod.5.3SceneCategorizationonthe8-ClassSportsDatasetWepartitionthedatasetequallyintotrainingandtestingdata.ForallthemodelsexceptSVMandlogisticregression,werun5timeswithrandominitializationofthetopicparameters(e.g.,\u03b2and\u03b7).4http://svmlight.joachims.org/svmmulticlass.html6\fbadmintonboccecroquetpolobadmintoncroquetboccerockclimbingcroquetpolopolobadmintonrockclimbingrowingsailingsnowboardingrockclimbingsnowboardingrowingsailingsailingrowingsnowboardingbocceFigure2:Exampleimagesfromeachcategoryinthesportsdatasetwithpredictedsceneclasses,wherethepredictionsinbluearecorrectwhileredonesarewrongpredictions.1020304050607080901000.250.30.350.40.450.50.550.60.650.70.75# TopicsScene Classification AccuracyMM\u2212SceneMM\u2212Scene\u2212NGMLE\u2212SceneMLE\u2212Scene\u2212NGMulti\u2212sLDAMulti\u2212SVMFigure3:Classi\ufb01cationaccuracyofdifferentmodelswithrespecttothenumberoftopics.Theaverageoverallaccuracyofscenecategorizationon8categoriesanditsstandarddeviationareshowninFig.3.TheresultoflogisticregressionisshownintheleftgreenbarinFig.1(c).Wealsoshowtheconfusionmatrixofthemax-marginscenemodelwith100latenttopicsinTable1,andexampleimagesfromeachcat-egoryareshowninFig.2withpredictedlabels.Over-all,themax-marginscenemodelwithglobalfeaturesachievessigni\ufb01cantimprovementsascomparedtoallotherapproacheswehavetested.Interestingly,al-thoughweprovideonlyscenecategoriesassupervisedinformationduringtraining,ourbestperformancewithglobalfeaturesisclosetothatreportedin[13],whereadditionalsupervisionofobjectsisused.Theoutstandingperformanceofthemax-marginmethodforsceneclassi\ufb01cationcanbeunderstoodfromthefollowingaspects.Max-margintraining:fromthecomparisonofthemax-marginapproachwiththestandardMLEinbothcasesofusingglobalfeaturesandnotusingglobalfeatures,wecanseethatthemax-marginlearningcanimprovetheperformancedramatically,especiallywhenthescenemodelusesglobalfeatures(about3percent).Thisisduetothewell-balancedpredictionruleachievedbythemax-marginmethod,aswehaveexplainedinSection3.Globalfeatures:fromthecomparisonbetweenthescenemodelswithandwithoutglobalfeatures,wecanseethatusingthegistfeaturescansigni\ufb01cantly(about8percent)improvethescenecatego-rizationaccuracyinbothMLEandmax-margintraining.WealsodidsomepreliminaryexperimentsontheSIFTsparsecodesfeature[26],whichareabitmoreexpensivetoextract.Byusingbothgistandsparsecodesfeatures,wecanachievedramaticimprovementsinbothmax-marginandMLEmethods.Speci\ufb01cally,themax-marginscenemodelachievesanaccuracyofabout0.83insceneclassi\ufb01cation,andthelikelihood-basedmodelobtainsanaccuracyofabout0.80.Objectmodeling:thesuperiorperformanceofthemax-marginlearnedMM-scenemodelcomparingtotheSVMandlogisticregression(SeetheleftgreenbarofFig.1(c)),whichuseglobalfeaturesonly,indicatesthatmodelingobjectscanfacilitatescenecategorization.Thisisbecausethesceneclassi\ufb01cationmodelisin\ufb02uencedbythelatentobjectmodelingthroughtheterm\u2206L\u22c6d(s),whichcanimprovethedecisionboundaryofastandardlinearSVMforthoseimagesthathavenegativescoresof\u2206L\u22c6d(s),aswehavediscussedintheonlineloss-augmentedSVM.However,objectmodelingdoesnotimprovetheclassi\ufb01cationaccuracyandsometimesitcanevenbeharmfulwhenthescenemodelislearnedwiththestandardMLE.Thisisbecausetheobjectmodel(usingthestate-of-the-artrepresentation)(e.g.,MM-MLE-NG)aloneperformsmuchworsethanglobalfeaturemodels(e.g.,logisticregression),asshowninFig.1andFig.3,andthestandardMLElearnsanimbalancedpredictionrule,aswehaveanalyzedinSection3.Giventhatthestate-of-the-artobjectmodelisnotgood,itisveryencouragingtoseethatwecanstillobtainpositiveimprovementsbyusingthecloselycoupledandwell-balancedmax-marginlearning.Theseresultsindicatethatfurtherimprovementscanbeexpectedbyimprovingthelocalobjectmodel,e.g.,byincorporatingrichfeatures.Wealsocomparewiththethememodel[9],whichisforscenecategorizationonly.Thethememodelusesadifferentimagerepresentation,whereeachimageisavectorofimagepatchcodewords.Thethememodelachievesabout0.65inclassi\ufb01cationaccuracy,lowerthanthatofMM-Scene.7\fTable1:Confusionmatrixfor100-topicMM-Sceneonthesportsdataset.0.717badmin-boccecroquetpolorock-rowingsailingsnow-tonclimbingboardingbadminton0.7680.0510.0510.0810.0200.0200.0000.010bocce0.0430.3330.2750.1450.0870.0580.0140.043croquet0.0250.1440.6690.0930.0250.0250.0080.008polo0.2200.0550.0990.5160.0220.0220.0110.055rockclimbing0.0000.0100.0210.0000.8450.0310.0100.082rowing0.0080.0080.0080.0080.0240.9120.0160.016sailing0.0110.0210.0000.0210.0110.0530.8840.000snowboarding0.0110.0210.0320.0950.0840.0530.0630.642Table2:The67indoorcategoriessortedbyclassi\ufb01cationaccuracyby70-topicMM-Scene.buffet0.85lobby0.40stairscase0.25hospitalroom0.10greenhouse0.84prisoncell0.39studiomusic0.24kindergarden0.10cloister0.71casino0.36childrenroom0.21laundromat0.10insidebus0.61diningroom0.35garage0.20of\ufb01ce0.10movietheater0.60kitchen0.35gym0.20restaurantkitchen0.09poolinside0.59winecellar0.34hairsalon0.20shoeshop0.09churchinside0.56library0.31livingroom0.20videostore0.08classroom0.55tvstudio0.30operatingroom0.20airportinside0.07concerthall0.55warehouse0.29pantry0.20bar0.06corridor0.55batchroom0.26subway0.20deli0.06\ufb02orist0.55bookstore0.25toystore0.19jewelleryshop0.06trainstation0.54computerroom0.25artstudio0.14laboratorywet0.05closet0.51dentalof\ufb01ce0.25fastfoodrestaurant0.13lockerroom0.05elevator0.49grocerystore0.25auditorium0.12museum0.05nursery0.44insidesubway0.25bakery0.11restaurant0.05bowling0.41mall0.25bedroom0.11waitingroom0.04gameroom0.40meetingroom0.25clothingstore0.100.50.550.60.650.70.750.8Scene Classification Accuracy0/10/50/100/200/300/400/50Figure4:Classi\ufb01cationaccuracyofMM-Scenewithdifferentlossfunctions\u2206\u2113d(s).Finally,weexaminethein\ufb02uenceofthelossfunction\u2206\u2113d(s)ontheperformanceofthemax-marginscenemodel.Aswecanseeinproblem(11),thelossfunction\u2206\u2113d(s)isanotherimportantfactorthatin\ufb02uencestheestimationof\u03b8anditsrelativeimportanceinthepredic-tionrule(5).Here,weusethe0/\u2113-lossfunction,thatis,\u2206\u2113d(s)=\u2113ifs\u0338=sd;otherwise0.Fig.4showstheperformanceofthe100-topicMM-Scenemodelwhenusingdifferentlossfunctions.When\u2113issetbetween10and20,theMM-Scenemethodstablyachievesthebestperformance.TheaboveresultsinFig.3andTable1areachievedwith\u2113selectedfrom5to40withcross-validationduringtraining.5.4SceneCategorizationonthe67-ClassMITIndoorSceneDataset0.140.160.180.20.220.240.260.280.30.320.34  MLE\u2212Scene\u2212NGMM\u2212Scene\u2212NGSVMLRROI+Gist(segmentation)ROI+Gist(annotation)MLE\u2212SceneMM\u2212SceneScene Classification AccuracyFigure5:Classi\ufb01cationaccuracyonthe67-classMITindoordataset.TheMITindoordataset[20]containscomplexsceneimagesfrom67categories.Weusethesametrainingandtestingdatasetasin[20],inwhicheachcategoryhasabout80imagesfortrainingandabout20imagesfortesting.WecomparethejointscenemodelwithSVM,logisticregression(LR),andtheprototype-basedmethods[20].BoththeSVMandLRarebasedontheglobalgistfeaturesonly.Forthejointscenemodel,wesetthenumberoflatenttopicsat70.TheoverallperformanceofdifferentmethodsareshowninFig.5andtheclassi\ufb01cationaccuracyofeachclassisshowninTable2.Fortheprototype-basedmethods,wecitetheresultsfrom[20].Wecanseethatthejointscenemodel(bothMLE-SceneandMM-Scene)signi\ufb01cantlyoutperformsSVMandLRthatuseglobalfeaturesonly.Thelikelihood-basedMLE-SceneslightlyoutperformstheROI-Gist(segmentation),whichusesboththeglobalgistfeaturesandlocalregion-of-interest(ROI)featuresextractedfromautomaticallysegmentedregions[20].Byusingmax-margintraining,thejointscenemodel(i.e.,MM-Scene)achievessigni\ufb01cantimprovementscomparedtoMLE-Scene.Moreover,themargin-basedMM-Scene,whichusesautomaticallysegmentedregionstoextractfeatures,outperformstheROI-Gist(annotation)methodthatuseshumanannotatedinterestedregions.6ConclusionsInthispaper,weaddresstheweakcouplingproblemofthecommonlyusedmaximumlikelihoodestimationinlearningupstreamsceneunderstandingmodelsbypresentingajointmaximummar-ginandmaximumlikelihoodlearningmethod.Theproposedapproachachievesacloseinterplaybetweenthepredictionmodelestimationandlatenttopicdiscovery,andtherebyawell-balancedpredictionrule.Theoptimizationproblemisef\ufb01cientlysolvedwithavariationalEMprocedure,whichiterativelylearnsanonlineloss-augmentedSVM.Finally,wedemonstratetheadvantagesofmax-margintrainingandtheeffectivenessofusingglobalfeaturesinsceneunderstandingonbothan8-categorysportsdatasetandthe67-classMITindoorscenedata.8\fAcknowledgementsJ.ZandE.P.XaresupportedbyONRN000140910758,NSFIIS-0713379,NSFCareerDBI-0546594,andanAlfredP.SloanResearchFellowshiptoE.P.X.L.F-FispartiallysupportedbyanNSFCAREERgrant(IIS-0845230),aGoogleresearchaward,andaMicrosoftResearchFellow-ship.WealsowouldliketothankOlgaRussakovskyforhelpfulcomments.References[1]P.Arbel\u00b4aezandL.Cohen.Constrainedimagesegmentationfromhierarchicalboundaries.InCVPR,2008.[2]I.Biederman.Onthesemanticsofaglanceatascene.PerceptualOrganization,213\u2013253,1981.[3]D.BleiandJ.Lafferty.Correlatedtopicmodels.InNIPS,2006.[4]D.BleiandJ.D.McAuliffe.Supervisedtopicmodels.InNIPS,2007.[5]D.Blei,A.Ng,andM.Jordan.LatentDirichletallocation.JMLR,(3):993\u20131022,2003.[6]L.-L.CaoandL.Fei-Fei.Spatiallycoherentlatenttopicmodelforconcurrentsegmentationandclassi\ufb01-cationofobjectsandscenes.InICCV,2007.[7]K.CrammerandY.Singer.Onthealgorithmicimplementationofmulticlasskernel-basedvectorma-chines.JMLR,(2):265\u2013292,2001.[8]L.Du,L.Ren,D.Dunson,andL.Carin.Abayesianmodelforsimultaneousimagecluster,annotationandobjectsegmentation.InNIPS,2009.[9]L.Fei-FeiandP.Perona.Abayesianhierarchicalmodelforlearningnaturalscenecategories.InCVPR,2005.[10]A.Friedman.Framingpictures:Theroleofknowledgeinautomatizedencodingandmemoryforgist.JournalofExperimentalPsychology:General,108(3):316\u2013355,1979.[11]T.Joachims,T.Finley,andC.-N.Yu.Cutting-planetrainingofstructuralSVMs.MachineLearning,77(1):27\u201359,2009.[12]S.Lacoste-Jullien,F.Sha,andM.Jordan.DiscLDA:Discriminativelearningfordimensionalityreductionandclassi\ufb01cation.InNIPS,2008.[13]L.-J.LiandL.Fei-Fei.What,whereandwho?classifyingeventsbysceneandobjectrecognition.InCVPR,2007.[14]L.-J.Li,R.Socher,andL.Fei-Fei.Towardstotalsceneunderstanding:Classi\ufb01cation,annotationandsegmentationinanautomaticframework.InCVPR,2009.[15]D.C.LiuandJ.Nocedal.OnthelimitedmemoryBFGSmethodforlargescaleoptimization.Mathemat-icalProgramming,(45):503\u2013528,1989.[16]D.G.Lowe.Objectrecognitionfromlocalscale-invariantfeatures.InICCV,1999.[17]K.Murphy,A.Torralba,andW.Freeman.Usingtheforesttoseethetrees:Agraphicalmodelrelatingfeatures,objects,andscenes.InNIPS,2003.[18]D.Navon.Forestbeforetrees:Theprecedenceofglobalfeaturesinvisualperception.PerceptionandPsychophysics,5:197\u2013200,1969.[19]A.OlivaandA.Torralba.Modelingtheshapeofthescene:aholisticrepresentationofthespatialenve-lope.IJCV,42(3):145\u2013175,2001.[20]A.QuattoniandA.Torralba.Recognizingindoorscenes.InCVPR,2009.[21]B.Sch\u00a8olkopfandA.Smola.LearningwithKernels:SupportVectorMachines,Regularization,Optimiza-tion,andBeyond.MITPress,2001.[22]J.Sivic,B.C.Russell,A.Efros,A.Zisserman,andW.T.Freeman.Discoveringobjectsandtheirlocatioinsinimages.InICCV,2005.[23]E.Sudderth,A.Torralba,W.Freeman,andA.Willsky.Learninghierarchicalmodelsofscenes,objects,andparts.InCVPR,2005.[24]B.Taskar,C.Guestrin,andD.Koller.Max-marginMarkovnetworks.InNIPS,2003.[25]C.Wang,D.Blei,andL.Fei-Fei.Simultaneousimageclassi\ufb01cationandannotation.InCVPR,2009.[26]J.Yang,K.Yu,Y.Gong,andT.Huang.Linearspatialpyramidmatchingusingsparsecodingforimageclassi\ufb01cation.InCVPR,2009.[27]J.Zhu,A.Ahmed,andE.P.Xing.MedLDA:Maximummarginsupervisedtopicmodelsforregressionandclassi\ufb01cation.InICML,2009.9\f", "award": [], "sourceid": 386, "authors": [{"given_name": "Jun", "family_name": "Zhu", "institution": null}, {"given_name": "Li-jia", "family_name": "Li", "institution": null}, {"given_name": "Li", "family_name": "Fei-fei", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}