{"title": "Monte Carlo Methods for Maximum Margin Supervised Topic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1592, "page_last": 1600, "abstract": "An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihood-based supervised topic models, of which posterior inference can be carried out using the Bayes' rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.", "full_text": "MonteCarloMethodsforMaximumMarginSupervisedTopicModelsQixiaJiang\u2020\u2021,JunZhu\u2020\u2021,MaosongSun\u2020,andEricP.Xing\u2217\u2217\u2020DepartmentofComputerScience&Technology,TsinghuaNationalTNListLab,\u2020StateKeyLabofIntelligentTech.&Sys.,TsinghuaUniversity,Beijing100084,China\u2217SchoolofComputerScience,CarnegieMellonUniversity,Pittsburgh,PA15213{qixia,dcszj,sms}@mail.tsinghua.edu.cn;epxing@cs.cmu.eduAbstractAneffectivestrategytoexploitthesupervisingsideinformationfordiscoveringpredictivetopicrepresentationsistoimposediscriminativeconstraintsinducedbysuchinformationontheposteriordistributionsunderatopicmodel.Thisstrate-gyhasbeenadoptedbyanumberofsupervisedtopicmodels,suchasMedLDA,whichemploysmax-marginposteriorconstraints.However,unlikethelikelihood-basedsupervisedtopicmodels,ofwhichposteriorinferencecanbecarriedoutus-ingtheBayes\u2019rule,themax-marginposteriorconstraintshavemadeMonteCarlomethodsinfeasibleoratleastnotdirectlyapplicable,therebylimitedthechoiceofinferencealgorithmstobebasedonvariationalapproximationwithstrictmean\ufb01eldassumptions.Inthispaper,wedeveloptwoef\ufb01cientMonteCarlomethodsundermuchweakerassumptionsformax-marginsupervisedtopicmodelsbasedonanimportancesamplerandacollapsedGibbssampler,respectively,inacon-vexdualformulation.Wereportthoroughexperimentalresultsthatcompareourapproachfavorablyagainstexistingalternativesinbothaccuracyandef\ufb01ciency.1IntroductionTopicmodels,suchasLatentDirichletAllocation(LDA)[3],haveshowngreatpromiseindiscover-inglatentsemanticrepresentationsoflargecollectionsoftextdocuments.Inorderto\ufb01tdatabetter,LDAhasbeensuccessfullyextendedinvariousways.Onenotableextensionissupervisedtopicmodels,whichweredevelopedtoincorporatesupervisingsideinformationfordiscoveringpredic-tivelatenttopicrepresentations.RepresentativemethodsincludesupervisedLDA(sLDA)[2,12],discriminativeLDA(DiscLDA)[8],andmax-entropydiscriminationLDA(MedLDA)[16].MedLDAdiffersfromitscounterpartsupervisedtopicmodelsbyimposingdiscriminativecon-straints(i.e.,max-marginconstraints)directlyonthedesiredposteriordistributions,insteadofde\ufb01n-inganormalizedlikelihoodmodelasinsLDAandDiscLDA.Suchtopicmodelswithmax-marginposteriorconstraintshaveshownsuperiorperformanceinvarioussettings[16,14,13,9].However,theirconstrainedformulations,especiallywhenusingsoftmarginconstraintsforinseparablepracti-calproblems,makeitinfeasibleoratleasthardifpossibleatall1todirectlyapplyMonteCarlo(MC)methods[10],whichhavebeenwidelyusedintheposteriorinferenceoflikelihoodbasedmodels,suchasthecollapsedGibbssamplingmethodsforLDA[5].Previousinferencemethodsforsuchmodelswithmax-marginposteriorconstraintshavebeenexclusivelyonthevariationalmethods[7]usuallywithastrictmean-\ufb01eldassumption.Althoughfactorizedvariationalmethodsoftenseekfasterapproximationsolutions,theycouldbeinaccurateorobtaintoocompactresults[1].\u2217\u2021indicatesequalcontributionsfromtheseauthors.1Rejectionsamplingcanbeappliedwhentheconstraintsarehard,e.g.,forseparableproblems.Butitwouldbeinef\ufb01cientwhenthesamplespaceislarge.1\fInthispaper,wedevelopef\ufb01cientMonteCarlomethodsformax-marginsupervisedtopicmodels,whichwebelieveiscrucialforhighlyscalableimplementation,andfurtherperformanceenhance-mentofthisclassofmodels.Speci\ufb01cally,we\ufb01rstprovideanewandequivalentformulationoftheMedLDAasaregularizedBayesianmodelwithmax-marginposteriorconstraints,basedonZell-ner\u2019sinterpretationofBayes\u2019ruleasalearningmodel[15]andtherecentdevelopmentofregularizedBayesianinference[17].ThisinterpretationisarguablymorenaturalthantheoriginalformulationofMedLDAasahybridmax-likelihoodandmax-marginlearning,wherethelog-likelihoodisap-proximatedbyavariationalupperboundforcomputationaltractability.Then,wedealwiththesetofsoftmax-marginconstraintswithconvexdualitymethodsandderivetheoptimalsolutionsofthedesiredposteriordistributions.Toeffectivelyreducethesizeofthesamplingspace,wedeveloptwosamplers,namely,animportancesamplerandacollapsedGibbssampler[4,1],withamuchweakerassumptiononthedesiredposteriordistributioncomparedtothemean\ufb01eldmethodsin[16].Wenotethatthework[11]presentsadualitymethodtohandlemomentmatchingconstraintsinmax-imumentropymodels.Ourworkisanextensionoftheirresultstolearntopicmodels,whichhavenontriviallystructuredlatentvariablesandalsousethegeneralsoftmarginconstraints.2LatentDirichletAllocationLDA[3]isahierarchicalBayesianmodelthatpositseachdocumentasanadmixtureofKtopics,whereeachtopic\u03a6kisamultinomialdistributionoveraV-wordvocabulary.Fordocumentd,itstopicproportion\u03b8disamultinomialdistributiondrawnfromaDirichletprior.Letwd={wdn}Nn=1denotethewordsappearingindocumentd.Forthen-thwordwdn,atopicassignmentzdn=kisdrawnfrom\u03b8dandwdnisdrawnfrom\u03a6k.Inshort,thegenerativeprocessofdis\u03b8d\u223cDir(\u03b1),zdn=k\u223cMult(\u03b8d),wdn\u223cMult(\u03a6k),(1)whereDir(\u00b7)isaDirichlet,Mult(\u00b7)isamultinomial.Forfully-BayesianLDA,thetopicsarealsorandomsamplesdrawnfromaDirichletprior,i.e.,\u03a6k\u223cDir(\u03b2).LetW={wd}Dd=1denoteallthewordsinacorpuswithDdocuments,andde\ufb01nezd={zdn}Nn=1,Z={zd}Dd=1,\u0398={\u03b8d}Dd=1.ThegoalofLDAistoinfertheposteriordistributionp(\u0398,Z,\u03a6|W,\u03b1,\u03b2)=p0(\u0398,Z,\u03a6|\u03b1,\u03b2)p(W|\u0398,Z,\u03a6)p(W|\u03b1,\u03b2).(2)Sinceinferringthetrueposteriordistributionisintractable,researchersmustresorttovariational[3]orMonteCarlo[5]approximatemethods.Althoughbothmethodshaveshownsuccessinvariousscenarios.Theyhavecomplementaryadvantages.Forexample,variationalmethods(e.g.,mean-\ufb01eld)canbegenerallymoreef\ufb01cient,whileMCmethodscanobtainmoreaccurateestimates.3MedLDA:asupervisedtopicmodelwithmax-marginconstraintsMedLDAextendsLDAbyintegratingthemax-marginlearningintotheprocedureofdiscoveringlatenttopicrepresentationstolearnlatentrepresentationsthataregoodforpredictingclasslabelsorratingscoresofadocument.Empirically,MedLDAanditsvariousextensions[14,13,9]havedemonstratedpromiseinlearningmorediscriminativetopicrepresentations.TheoriginalMedL-DAwasdesignedasahybridmaxlikelihoodandmax-marginlearning,wheretheintractablelog-likelihoodisapproximatedbyavariationalbound.Toderiveoursamplingmethods,wepresentanewinterpretationofMedLDAfromtheperspectiveofregularizedBayesianinference[17].3.1BayesianinferenceasalearningmodelAsshowninEq.(2),Bayesianinferenceisaninformationprocessingrulethatprojectsthepriorp0andempiricalevidencetoapost-dataposteriordistributionviatheBayes\u2019rule.Itisthecoreforlikelihood-basedsupervisedtopicmodels[2,12].AfreshinterpretationofBayesianinferencewasgivenbyZellner[15],whichleadstoournovelinterpretationofMedLDA.Speci\ufb01cally,ZellnershowedthattheposteriordistributionbyBayes\u2019ruleisthesolutionofanoptimizationproblem.Forinstance,theposteriorp(\u0398,Z,\u03a6|W)ofLDAisequivalenttotheoptimumsolutionofminp(\u0398,Z,\u03a6)\u2208PKL[p(\u0398,Z,\u03a6)\u2225p0(\u0398,Z,\u03a6)]\u2212Ep[logp(W|\u0398,Z,\u03a6)],(3)whereKL(q||p)istheKullback-Leiblerdivergencefromqtop,andPisthespaceofprobabilitydistributions.WewilluseL(p(\u0398,Z,\u03a6))todenotetheobjectivefunction.2\f3.2MedLDA:aregularizedBayesianmodelForbrevity,weconsidertheclassi\ufb01cationmodel.LetD={(wd,yd)}Dd=1beagivenfully-labeledtrainingset,wheretheresponsevariableYtakesvaluesfroma\ufb01nitesetY={1,...,M}.MedLDAconsistsoftwoparts.The\ufb01rstpartisanLDAlikelihoodmodelfordescribinginputdocuments.Asinpreviouswork,weusethepartial2likelihoodmodelforW.Thesecondpartisamechanismtoconsidersupervisingsignal.SinceourgoalistodiscoverlatentrepresentationsZthataregoodforclassi\ufb01cation,onenaturalsolutionistoconnectZdirectlytoourultimategoal.MedLDAobtainssuchagoalbybuildingaclassi\ufb01cationmodelonZ.Onegoodcandidateoftheclassi\ufb01cationmodelisthemax-marginmethods,whichavoidde\ufb01ninganormalizedlikelihoodmodel[12].Formally,let\u03b7denotetheparametersoftheclassi\ufb01cationmodel.Tomakethemodelfully-Bayesian,wealsotreat\u03b7random.Then,wewanttoinferthejointposteriordistributionp(\u03b7,\u0398,Z,\u03a6|D).Forclassi\ufb01cation,MedLDAde\ufb01nesthefollowingdiscriminationfunctionF(y,\u03b7,z;w)=\u03b7\u22a4f(y,\u00afz),F(y;w)=Ep(\u03b7,z|w)[F(y,\u03b7,z;w)],(4)where\u00afzisaK-dimvectorwhoseelement\u00afzkequalsto1N\u2211Nn=1I(zn=k),andI(x)isanindicatorfunctionwhichequalsto1whenxistrueotherwise0;f(y,\u00afz)isanMK-dimvectorwhoseelementsfrom(y\u22121)KtoyKare\u00afzandallothersarezero;and\u03b7isanMK-dimensionalvectorconcatenatingMclass-speci\ufb01csub-vectors.Withtheabovede\ufb01nitions,anaturalpredictionruleis\u02c6y=argmaxyF(y;w),(5)andwewouldliketo\u201cregularize\u201dthepropertiesofthelatenttopicrepresentationstomakethemsuitableforaclassi\ufb01cationtask.OnewaytoachievethatgoalistotaketheoptimizationviewofBayes\u2019theoremandimposethefollowingmax-marginconstraintstoproblem(3)F(yd;wd)\u2212F(y;wd)\u2265\u2113d(y)\u2212\u03bed,\u2200y\u2208Y,\u2200d,(6)where\u2113d(y)isanon-negativefunctionthatpenalizesthewrongpredictions;\u03be={\u03bed}Dd=1arenon-negativeslackvariablesforinseparablecases.LetL(p)=KL(p||p0(\u03b7,\u0398,Z,\u03a6))\u2212Ep[logp(W|Z,\u03a6)]and\u2206f(y,\u00afzd)=f(yd,\u00afzd)\u2212f(y,\u00afzd).Then,wede\ufb01nethesoft-marginMedLDAassolvingminp(\u03b7,\u0398,Z,\u03a6)\u2208P,\u03beL(p(\u03b7,\u0398,Z,\u03a6))+CDD\u2211d=1\u03beds.t.:Ep[\u03b7\u22a4\u2206f(y,\u00afzd)]\u2265\u2113d(y)\u2212\u03bed,\u03bed\u22650,\u2200d,\u2200y,(7)wherethepriorisp0(\u03b7,\u0398,Z,\u03a6)=p0(\u03b7)p0(\u0398,Z,\u03a6).Withtheabovediscussions,wecanseethatMedLDAisaninstanceofregularizedBayesianmodels[17].Also,problem(7)canbeequivalentlywrittenasminp(\u03b7,\u0398,Z,\u03a6)\u2208PL(p(\u03b7,\u0398,Z,\u03a6))+CR(p(\u03b7,\u0398,Z,\u03a6))(8)whereR=1D\u2211dargmaxy(\u2113d(y)\u2212Ep[\u03b7\u22a4\u2206f(y,\u00afzd)])isthehingeloss,anupperboundofthepredictionerrorontrainingdata.4MonteCarlomethodsforMedLDAAsinothervariantsoftopicmodels,itisintractabletosolveproblem(7)ortheequivalentproblem(8)directly.Previoussolutionsresorttovariationalmean-\ufb01eldapproximationmethods.ItiseasytoshowthatthevariationalEMmethodin[16]isacoordinatedescentalgorithmtosolveproblem(7),withtheadditionalfully-factorizedmean-\ufb01eldconstraint,p(\u03b7,\u0398,Z,\u03a6)=p(\u03b7)(\u220fdp(\u03b8d)\u220fnp(zdn))\u220fkp(\u03a6k).(9)Below,wepresenttwoMCsamplingmethodstosolvetheMedLDAproblem,withmuchweakerconstraintsonp,andthustheycouldbeexpectedtoproducemoreaccuratesolutions.Speci\ufb01cally,weassumep(\u03b7,\u0398,Z,\u03a6)=p(\u03b7)p(\u0398,Z,\u03a6).Then,thegeneralprocedureistoalter-natelysolveproblem(8)byperformingthefollowingtwosteps.2AfulllikelihoodmodelonbothWandYcanbede\ufb01nedasin[12].Butitsnormalizationconstant(afunctionofZ)couldmaketheproblemhardtosolve.3\fEstimatep(\u03b7):Givenp(\u0398,Z,\u03a6),thesubproblem(inanequivalentconstrainedform)istosolveminp(\u03b7),\u03beKL(p(\u03b7)\u2225p0(\u03b7))+CDD\u2211d=1\u03beds.t.:Ep[\u03b7]\u22a4\u2206f(y,E[\u00afzd])\u2265\u2113d(y)\u2212\u03bed,\u03bed\u22650,\u2200d,\u2200y.(10)ByusingtheLagrangianmethodswithmultipliers\u03bb,wehavetheoptimumposteriordistributionp(\u03b7)\u221dp0(\u03b7)e\u03b7\u22a4\u00b7\u2211Dd=1\u2211y\u03bbyd\u2206f(y,E[\u00afzd]).(11)Forthepriorp0,forsimplicity,wechoosethestandardnormalprior,i.e.,p0(\u03b7)=N(0,I).Inthiscase,p(\u03b7)=N(\u03ba,I)andthedualproblemismax\u03bb\u221212\u03ba\u22a4\u03ba+D\u2211d=1\u2211y\u03bbyd\u2113d(y)s.t.:\u2211y\u03bbyd\u2208[0,CD],\u2200d.(12)where\u03ba=\u2211Dd=1\u2211y\u03bbyd\u2206f(y,E[\u00afzd]).Notethat\u03baistheposteriormeanofclassi\ufb01erparameters\u03b7,andtheelement\u03baykrepresentsthecontributionoftopickinclassifyingadatapointtocategoryy.Thisproblemisthedualproblemofamulti-classSVM[6]andwecansolveit(oritsprimalform)ef\ufb01cientlyusingexistinghigh-performanceSVMlearners.Wedenotetheoptimumsolutionofthisproblemby(p\u2217(\u03b7),\u03ba\u2217,\u03be\u2217,\u03bb\u2217).Estimatep(\u0398,Z,\u03a6):Givenp(\u03b7),thesubproblem(inanequivalentconstrainedform)istosolveminp(\u0398,Z,\u03a6),\u03beL(p(\u0398,Z,\u03a6))+CDD\u2211d=1\u03beds.t.:(\u03ba\u2217)\u22a4\u2206f(y,Ep[\u00afzd])\u2265\u2113d(y)\u2212\u03bed,\u03bed\u22650,\u2200d,\u2200y.(13)AlthoughintheorywecansolvethissubproblemagainusingLagrangiandualmethods,itwouldbehardtoderivethedualobjectivefunction(ifpossibleatall).Here,weusethesamestrategyasin[16],thatis,toupdatep(\u0398,Z,\u03a6)foronlyonestepwith\u03bebeing\ufb01xedat\u03be\u2217(theoptimumsolutionofthepreviousstep).Itiseasytoshowthatby\ufb01xing\u03beat\u03be\u2217,wewillhavetheoptimumsolutionp(\u0398,Z,\u03a6)\u221dp(W,Z,\u0398,\u03a6)e(\u03ba\u2217)\u22a4\u2211dy(\u03bbyd)\u2217\u2206f(y,\u00afzd),(14)ThedifferencesbetweenMedLDAandLDAlieintheaboveposteriordistribution.The\ufb01rsttermisthesameastheposteriorofLDA(theevidencep(W)canbeabsorbedintothenormalizationconstant).Thesecondtermindicatestheregularizationeffectsduetothemax-marginposteriorcon-straints,whichisconsistentwithourintuition.Speci\ufb01cally,forthosedatawithnon-zeroLagrangemultipliers(i.e.,thedataarearoundthedecisionboundaryormisclassi\ufb01ed),thesecondtermwillbiasthemodeltowardsanewposteriordistributionthatfavorsmorediscriminativerepresentationsonthese\u201chard\u201ddatapoints.Now,theremainingproblemishowtoef\ufb01cientlydrawsamplesfromp(\u0398,Z,\u03a6)andestimatetheexpectationsE[\u00afz]asaccurateaspossible,whichareneededinlearningclassi\ufb01cationmodels.Below,wepresenttworepresentativesamplers\u2013animportancesamplerandacollapsedGibbssampler.4.1ImportancesamplerToavoiddealingwiththeintractablenormalizationconstantofp(\u0398,Z,\u03a6),onenaturalchoiceistouseimportancesampling.Importancesamplingaimsatdrawingsomesamplesfroma\u201csimple\u201ddistributionandtheexpectationisestimatedasaweightedaverageoverthesesamples.However,directlyapplyingimportancesamplingtop(\u0398,Z,\u03a6)maycausesomeissuessinceimportancesam-plingsuffersfromseverelimitationsinlargesamplespaces.Alternatively,sincethedistributionp(\u0398,Z,\u03a6)inEq.(14)hasthefactorizationformp(\u0398,Z,\u03a6)=p0(\u0398,\u03a6)p(Z|\u0398,\u03a6),anotherpos-siblemethodistoadopttheancestralsamplingstrategytodrawsample(\u02c6\u0398,\u02c6\u03a6)fromp0(\u0398,\u03a6)andthendrawsamplesfromp(Z|\u02c6\u0398,\u02c6\u03a6).AlthoughitiseasytodrawasamplefromtheDirichletpriorp0(\u0398,\u03a6)=Dir(\u03b1)Dir(\u03b2),itwouldrequirealargenumberofsamplestogetarobustestimateoftheexpectationsE[Z].Below,wepresentonesolutiontoreducesamplespace.4\fOnefeasiblemethodtoreducethesamplespaceistocollapse(\u0398,\u03a6)outanddirectlydrawsamplesfromthemarginaldistributionp(Z).However,thiswillcausetightcouplingsbetweenZandmakethenumberofsamplesneededtoestimatetheexpectationgrowexponentiallywiththedimensionalityofZforimportancesampler.ApracticalsamplerforthiscollapseddistributionwouldbeaMarkovchain,aswewillpresentinnextsection.Here,weproposetousetheMAPestimateof(\u0398,\u03a6)astheir\u201csinglesample\u201d3andproceedtodrawsamplesofZ.Speci\ufb01cally,given(\u02c6\u0398,\u02c6\u03a6),wehavetheconditionaldistributionp(Z|\u02c6\u0398,\u02c6\u03a6)\u221dp(W,Z|\u02c6\u0398,\u02c6\u03a6)e(\u03ba\u2217)\u22a4\u2211dy(\u03bbyd)\u2217\u2206f(y,\u00afzd)=D\u220fd=1Nd\u220fn=1p(zdn|\u02c6\u03b8d,\u02c6\u03a6),(15)wherep(zdn=k|\u02c6\u03b8d,\u02c6\u03a6,wdn=t)=1Zdn\u02c6\u03d5kt\u02c6\u03b8dke1Nd\u2211y(\u03bbyd)\u2217(\u03ba\u2217ydk\u2212\u03ba\u2217yk)(16)andZdnisanormalizationconstant,and\u03ba\u2217ykisthe[(y\u22121)K+k]-thelementof\u03ba\u2217.Thedifference(\u03ba\u2217ydk\u2212\u03ba\u2217yk)representsthedifferentcontributionoftopickinclassifyingdtothetruecategoryydandawrongcategoryy.Ifthedifferenceispositive,topickcontributestomakeacorrectpredictionford;otherwise,itcontributestomakeawrongprediction.Then,wedrawJsamples{z(j)dn}Jj=1fromaproposaldistributiong(z)andcomputetheexpectationsE[\u00afzdk]=1NdNd\u2211n=1E[zdn],\u2200\u00afzdk\u2208\u00afzdandE[zdn]\u2248J\u2211j=1\u03b3jdn\u2211Jj=1\u03b3jdnz(j)dn,(17)wheretheimportanceweight\u03b3jdnis\u03b3jdn=K\u220fk=1(\u02c6\u03b8dk\u02c6\u03d5kwdng(k)e1Nd\u2211y(\u03bbyd)\u2217(\u03ba\u2217ydk\u2212\u03ba\u2217yk))I(z(j)dn=k)(18)WiththeJsamples,weupdatetheMAPestimate(\u02c6\u0398,\u02c6\u03a6)\u02c6\u03b8dk\u221d1J\u2211Ndn=1\u2211Jj=1\u03b3jdn\u2211Jj=1\u03b3jdnI(z(j)dn=k)+\u03b1k\u02c6\u03d5kt\u221d1J\u2211Dd=1\u2211Ndn=1\u2211Jj=1\u03b3jdn\u2211Jj=1\u03b3jdnI(z(j)dn=k)I(wdn=t)+\u03b2t.(19)Theabovetwostepsarerepeateduntilconvergence,initializing(\u02c6\u0398,\u02c6\u03a6)tobeuniform,andthesamplesfromthelastiterationareusedtoestimatetheexpectationstatisticsneededintheproblemofinferringp(\u03b7).4.2CollapsedGibbssamplerAswehavestated,anotherwaytoeffectivelyreducethesamplespaceistointegrateouttheintermediatevariables(\u0398,\u03a6)andbuildaMarkovchainwhoseequilibriumdistributionistheresultingmarginaldistributionp(Z).WeproposetousecollapsedGibbssampling,whichhasbeensuccessfullyusedforLDA[5].ForMedLDA,weintegrateout(\u0398,\u03a6)andgetthemarginalizedposteriordistributionp(Z)=p(W,Z|\u03b1,\u03b2)Zqe(\u03ba\u2217)\u22a4\u2211d\u2211y(\u03bbyd)\u2217\u2206f(y,\u00afzd)=1Z[\u220fDd=1\u03b4(Cd+\u03b1)\u03b4(\u03b1)e(\u03ba\u2217)\u22a4\u2211y(\u03bbyd)\u2217\u2206f(y,\u00afzd)][\u220fKk=1\u03b4(Ck+\u03b2)\u03b4(\u03b2)],(20)where\u03b4(x)=\u220fdim(x)i=1\u0393(xi)\u0393(\u2211dim(x)i=1xi),CtkisthenumberoftimesthetermtbeingassignedtotopickoverthewholecorpusandCk={Ctk}Vt=1;Ckdisthenumberoftimesthattermsbeingassociatedwithtopickwithinthed-thdocumentandCd={Ckd}Kk=1.WecanalsoderivethetransitionprobabilityofonevariablezdngivenotherswhichwedenotebyZ\u00acas:p(zdn=k|Z\u00ac,W\u00ac,wdn=t)\u221dCtk,\u00acn+\u03b2t\u2211tCtk,\u00acn+\u2211Vt=1\u03b2t(Ckd,\u00acn+\u03b1k)e1Nd\u2211y(\u03bbyd)\u2217(\u03ba\u2217ydk\u2212\u03ba\u2217yk)(21)whereC\u00b7\u00b7,\u00acnindicatesthattermnisexcludedfromthecorrespondingdocumentortopic.Again,wecanseethedifferencebetweenMedLDAandLDA(usingcollapsedGibbssampling)fromtheadditionallastterminEq.(21),whichisduetothemax-marginposteriorconstraints.3Thiscollapsesthesamplespaceof(\u0398,\u03a6)toasinglepoint.5\fForthosedataonthemarginormisclassi\ufb01ed(withnon-zeroLagrangemultipliers),thelasttermisnon-zeroandactsasaregularizerdirectlyaffectingthetopicassignmentsofthesedif\ufb01cultdata.Then,weusethetransitiondistributioninEq.(21)toconstructaMarkovchain.AfterthisMarkovchainhasconverged(i.e.,\ufb01nishedtheburn-instage),wedrawJsamples{Z(j)}andestimatetheexpectationstatisticsE[\u00afzdk]=1NdNd\u2211n=1E[zdn],\u2200\u00afzdk\u2208\u00afzd,andE[zdn]=1JJ\u2211j=1z(j)dn.(22)4.3PredictionTomakepredictiononunlabeledtestingdatausingthepredictionrule(5),wetaketheapproachthathasbeenadoptedforthevariationalMedLDA,whichusesapointestimateoftopics\u03a6fromtrainingdataandmakespredictionbasedonthem.Speci\ufb01cally,weusetheMAPestimate\u02c6\u03a6toreplacetheprobabilitydistributionp(\u03a6).Fortheimportancesampler,\u02c6\u03a6iscomputedasinEq.(19).ForthecollapsedGibbssampler,anestimateof\u02c6\u03a6usingthesamplesis\u02c6\u03d5kt\u221d1J\u2211Jj=1Ctk(j)+\u03b2t,whereCtk(j)isthetimesthattermtisassignedtotopickinthej-thsample.Givenanewdocumentwtobepredicted,forimportancesampler,theimportanceweightshouldbealteredas\u03b3jn=\u220fKk=1(\u03b8k\u02c6\u03d5kwn/g(k))I(z(j)n=k).Then,weapproximatetheexpectationofzasinEq.(17).ForGibbssampler,weinferitslatentcomponentszusingtheobtained\u02c6\u03a6asp(zn=k|z\u00acn)\u221d\u02c6\u03d5kwn(Ck\u00acn+\u03b1k),whereCk\u00acnisthetimesthatthetermsinthisdocumentwassignedtotopickwiththen-thtermexcluded.Then,weapproximatetheE[\u00afz]asinEq.(22).5ExperimentsWeempiricallyevaluatetheimportancesamplerandtheGibbssamplerforMedLDA(denotedbyiMedLDAandgMedLDArespectively)onthe20Newsgroupsdatasetwithastandardlistofstopwords4removed.Thisdatasetcontainsabout20Kpostingswithin20groups.Duetospacelimita-tion,wefocusonthemulti-classsetting.Weusethecutting-planealgorithm[6]tosolvethemulti-classSVMtoinferp(\u03b7)andsolveforthelagrangemultipliers\u03bbinMedLDA.Forsimplicity,weusetheuniformproposaldistributionginiMedLDA.Inthiscase,wecangloballydrawJ(e.g.,=3\u00d7K)samples{Z(j)}Jj=1fromg(z)outsidetheiterationloopandonlyupdatetheimportanceweightstosavetime.ForgMedLDA,wekeepJ(e.g.,20)adjacentsamplesaftergMedLDAhasconvergedtoestimatetheexpectationstatistics.Tobefair,weusethesameCfordifferentMedLDAmethods.TheoptimumCischosenvia5-foldcrossvalidationduringthetrainingprocedureoffMedLDAfrom{a2:a=1,...,8}.WeusesymmetricDirichletpriorsforallLDAtopicmodels,i.e.,\u03b1=\u03b1eKand\u03b2=\u03b2eV,whereenisan-dimvectorwitheveryentrybeing1.WeassesstheconvergenceofaMarkovchainwhen(1)ithasrunforamaximumnumberofiterations(e.g.,100),or(2)therelativechangeinitsobjective,i.e.,|Lt+1\u2212Lt|Lt,islessthanatolerancethreshold\u03f5(e.g.,\u03f5=10\u22124).Weusethesamestrategytojudgewhethertheoverallinferencealgorithmconverges.Werandomlyselect7,505documentsfromthewholesetasthetestsetandtherestasthetrainingdata.Wesetthecostparameter\u2113d(y)inproblem(7)tobe16,whichproducesbetterclassi\ufb01cationperformancethanthestandard0/1cost[16].Tomeasurethesparsityofthelatentrepresentationsofdocuments,wecomputetheaverageentropyovertestdocuments:1|Dt|\u2211d\u2208DtH(\u03b8d).Wealsomeasurethesparsityoftheinferredtopicdistributions\u03a6intermsoftheaverageentropyovertopics,i.e.,1K\u2211Kk=1H(\u03a6k).AllexperimentsarecarriedoutonaPCwith2.2GHzCPUand3.6GRAM.Wereportthemeanandstandarddeviationforeachmodelwith4timesrandomlyinitializedruns.5.1PerformancewithdifferenttopicnumbersThissectioncomparesgMedLDAandiMedLDAwithbaselinemethods.MedLDAwasshowntooutperformsLDAfordocumentclassi\ufb01cation.Here,wefocusoncomparingtheperformanceofMedLDAandLDAwhenusingdifferentinferencealgorithms.Speci\ufb01cally,wecomparewiththe4http://mallet.cs.umass.edu/6\f204060801001200.40.50.60.70.8# TopicsAccuracy  iMedLDAgMedLDAfMedLDAgLDAfLDA(a)2040608010012012345# TopicsAverage Entropy over Docs  iMedLDAgMedLDAfMedLDAgLDAfLDA(b)20406080100120456789# TopicsAverage Entropy over Topics  iMedLDAgMedLDAfMedLDAgLDAfLDA(c)Figure1:Performanceofmulti-classclassi\ufb01cationofdifferenttopicmodelswithdifferenttopicnumberson20-Newsgroupsdataset:(a)classi\ufb01cationaccuracy,(b)theaverageentropyof\u0398overtestdocuments,and(c)Theaverageentropyoftopicdistributions\u03a6.LDAmodelthatusescollapsedGibbssampling[5](denotedbygLDA)andtheLDAmodelthatusesfully-factorizedvariationalmethods[3](denotedbyfLDA).ForLDAmodels,wediscoverthelatentrepresentationsofthetrainingdocumentsandusethemtobuildamulti-classSVMclassi\ufb01er.ForMedLDA,wereporttheresultswhenusingfully-factorizedvariationalmethods(denotedbyfMedLDA)asin[16].Furthermore,fMedLDAandfLDAoptimizethehyper-parameter\u03b1usingtheNewton-Rampionmethod[3],whilegMedLDA,iMedLDAandgLDAdetermine\u03b1by5-foldcross-validation.Wehavetestedawiderangeofvaluesof\u03b2(e.g.,10\u221216\u223c103)andfoundthattheperformanceofiMedLDAdegradesseriouslywhen\u03b2islargerthan10\u22123.Therefore,weset\u03b2tobe10\u22125foriMedLDAwhile0.01fortheothertopicmodelsjustasintheliterature[5].Fig.1(a)showstheaccuracy.WecanseethatMonteCarlomethodsgenerallyoutperformthefully-factorizedmean-\ufb01eldmethods,mainlybecauseoftheirweakerfactorizationassumptions.Therea-sonforthesuperiorperformanceofiMedLDAovergMedLDAisprobablybecauseiMedLDAismoreeffectiveindealingwithsamplesparsityissues.MoreinsightswillbeprovidedinSection5.2.Fig.1(b)showstheaverageentropyoflatentrepresentations\u0398overtestdocuments.We\ufb01ndthattheentropyofgMedLDAandiMedLDAaresmallerthanthoseofgLDAandfLDA,especiallyfor(relatively)largeK.ThisimpliesthatsamplingmethodsforMedLDAcaneffectivelyconcentratetheprobabilitymassonjustseveraltopicsthusdiscovermorepredictivetopicrepresentations.How-ever,fMedLDAyieldsthesmallestentropy,whichismainlybecausethefully-factorizedvariationalmethodstendtogettoocompactresults,e.g.,sparselocaloptimums.Fig.1(c)showstheaverageentropyoftopicdistributions\u03a6overtopics.WecanseethatgMedLDAimprovesthesparsityof\u03a6thanfMedLDA.However,gMedLDA\u2019sentropyislargerthangLDA\u2019s.Thisisbecauseforthose\u201chard\u201ddocuments,theexponentialcomponentinEq.(21)\u201cregularizes\u201dtheconditionalprobabilityp(zdn|Z\u00ac)andleadstoasmootherestimateof\u03a6.Ontheotherhand,we\ufb01ndthatiMedLDAhasthelargestentropy.Thisisprobablybecausemanyofthesamples(topicassignments)generatedbytheproposaldistributionare\u201cincorrect\u201dbutimportancesamplerstillassignsweightstothesesamples.Asaresult,theinferredtopicdistributionsareverydenseandthushavealargeentropy.Moreover,intheaboveexperiments,wefoundthatthelagrangemultipliersinMedLDAareverysparse(about1%non-zerosforbothiMedLDAandgMedLDA;about1.5%forfMedLDA),muchsparserthanthoseofSVMbuiltonrawinputdata(about8%non-zeros).5.2SensitivityanalysiswithrespecttokeyparametersSensitivityto\u03b1.Fig.2(a)showstheclassi\ufb01cationperformanceofgMedLDAandiMedLDAwithdifferentvaluesof\u03b1.WecanseethattheperformanceofgMedLDAincreasesas\u03b1becomeslargeandretainsstablewhen\u03b1islargerthan0.1.Incontrast,theaccuracyofiMedLDAdecreasesabit(especiallyforsmallK)when\u03b1becomeslarge,butisrelativestablewhen\u03b1issmall(e.g.,\u22640.01).Thisisprobablybecausewitha\ufb01nitenumberofsamples,GibbssamplertendstoproduceatoosparseestimateofE[Z],andaslightlystrongerpriorishelpfultodealwiththesamplesparsityissue.Incontrast,theimportancesampleravoidssuchsparsityissuebyusingauniformproposaldistribution,whichcouldmakethesampleswellcoveralltopicdimensions.Thus,asmallpriorissuf\ufb01cienttogetgoodperformance,andincreasingtheprior\u2019sstrengthcouldpotentiallyhurt.SensitivitytosamplesizeJ.Forsamplingmethods,wealwaysneedtodecidehowmanysamples(samplesizeJ)tokeeptoensuresuf\ufb01cientstatisticspower.Fig.2(b)showstheclassi\ufb01cationaccu-racyofbothgMedLDAandiMedLDAwithdifferentsamplesizeJwhen\u03b1=10\u22122/KandC=16.7\f10\u2212410\u2212310\u2212210\u221211000.50.60.70.8\u03b1Accuracy  iMedLDAK=30iMedLDAK=60iMedLDAK=90gMedLDAK=30gMedLDAK=60gMedLDAK=90(a)510100100000.20.40.60.8Sample SizeAccuracy  iMedLDAK=30iMedLDAK=60iMedLDAK=90gMedLDAK=30gMedLDAK=60gMedLDAK=90(b)10\u2212410\u2212310\u221220.70.750.80.85\u03b5Accuracy  K=30K=60K=90(c)15105010000.20.40.60.81# iterationAccuracy  iMedLDAgMedLDAfMedLDA(d)Figure2:SensitivitystudyofiMedLDAandgMedLDA:(a)classi\ufb01cationaccuracywithdifferent\u03b1fordifferenttopicnumbers,(b)classi\ufb01cationaccuracywithdifferentsamplesizeJ,(c)classi\ufb01cationaccuracywithdifferentconvergencecriterion\u03f5forgMedLDA,and(d)classi\ufb01cationaccuracyofdifferentmethodsvariesasafunctionofiterationswhenthetopicnumberis30.ForgMedLDA,wehavetesteddifferentvaluesofJfortrainingandprediction.Wefoundthatthesamplesizeinthetrainingprocesshasalmostnoin\ufb02uenceonthepredictionaccuracyevenwhenitequalsto1.Hence,foref\ufb01ciency,wesetJtobe1duringthetraining.ItshowsthatgMedLDAisrelativelystablewhenJislargerthanabout20atprediction.ForiMedLDA,Fig.2(b)showsthatitbecomesstablewhenthepredictionsamplesizeJislargerthan3\u00d7K.Sensitivitytoconvergencecriterion\u03f5.ForgMedLDA,wehavetojudgewhetheraMarkovchainhasreacheditsstationarity.Relativechangeintheobjectiveisacommonlyuseddiagnostictojustifytheconvergence.Westudythein\ufb02uenceof\u03f5.Inthisexperiment,wedon\u2019tboundthemaximumnumberofiterationsandallowtheGibbssamplertorununtilthetolerance\u03f5isreached.Fig.2(c)showstheaccuracyofgMedLDAwithdifferentvaluesof\u03f5.WecanseethatgMedLDAisrelativelyinsensitiveto\u03f5.ThisismainlybecausegMedLDAalternatelyupdatesposteriordistributionandLagrangianmultipliers.Thus,itdoesGibbssamplingformanytimes,whichcompensatesforthein\ufb02uencethateachMarkovchainhasnotreacheditsstationarity.Ontheotherhand,small\u03f5valuescangreatlyslowtheconvergence.Forinstance,whenthetopicnumberis90,gMedLDAtakes11,986secondsontrainingwhen\u03f5=10\u22124but1,795secondswhen\u03f5=10\u22122.Theseresultsimplythatwecanloosetheconvergencecriteriontospeeduptrainingwhilestillobtainagoodmodel.Sensitivitytoiteration.Fig.2(d)showsthetheclassi\ufb01cationaccuracyofMedLDAwithvariousinferencemethodsasafunctionofiterationwhenthetopicnumberissetat30.WecanseethatallthevariousMedLDAmodelsconvergequitequicklytogetgoodaccuracy.ComparedtofMedLDA,whichusesmean-\ufb01eldvariationalinference,thetwoMedLDAmodelsusingMonteCarlomethods(i.e.,iMedLDAandgMedLDA)areslightlyfastertogetstablepredictionperformance.5.3Timeef\ufb01ciency20406080100120102103104# TopicsCPU\u2212Seconds  iMedLDAgMedLDAfMedLDAgLDAfLDAFigure3:Trainingtime.AlthoughgMedLDAcangetgoodresultsevenforaloosenconver-gencecriterion\u03f5asdiscussedinSec.5.2,weset\u03f5tobe10\u22124forallthemethodsinordertogetamoreobjectivecomparison.Fig.3reportsthetotaltrainingtimeofdifferentmodels,whichincludestwophases\u2013inferringthelatenttopicrepresentationsandtrainingSVMs.We\ufb01ndiMedLDAisthemostef\ufb01cient,whichbene\ufb01tsfrom(1)generateingsamplesoutsidetheiterationloopandusesthemforalliterations;and(2)usingtheMAPestimatestocollapsethesamplespaceof(\u0398,\u03a6)toa\u201csinglesample\u201dforef\ufb01ciency.Incontrast,bothgMedLDAandfMedLDAhavetoiterativelyupdatethevariablesorvariationalparameters.gMedLDArequiresmoretimethanfMedLDAbutiscompara-blewhen\u03f5issettobe0.01.Byusingtheequivalent1-slackformulation,about76%ofthetrainingtimespentoninferenceforiMedLDAand90%forgMedLDA.Forprediction,bothiMedLDAandgMedLDAareslightlyslowerthanfMedLDA.6ConclusionsWehavepresentedtwoMonteCarlomethodsforMedLDA,asupervisedtopicmodelusingmax-marginconstraintsdirectlyonthedesiredposteriordistributionsfordiscoveringpredictivelatenttopicrepresentations.OurmethodsarebasedonanovelinterpretationofMedLDAasaregular-izedBayesianmodelandtheaconvexdualformulationtodealwithsoft-marginconstraints.Ex-perimentalresultsonthe20NewsgroupsdatasetshowthatMonteCarlomethodsarerobusttohyper-parametersandcouldyieldverycompetitiveresultsforsuchmax-margintopicmodels.8\fAcknowledgementsPartoftheworkwasdonewhenQJwasvisitingCMU.JZandMSaresupportedbytheNationalBasicResearchProgramofChina(No.2013CB329403and2012CB316301),NationalNaturalScienceFoundationofChina(No.91120011,61273023and61170196)andTsinghuaInitiativeScienti\ufb01cResearchProgramNo.20121088071.EXissupportedbyAFOSRFA95501010247,ONRN000140910758,NSFCareerDBI-0546594andAlfredP.SloanResearchFellowship.References[1]C.M.Bishop.Patternrecognitionandmachinelearning,volume4.springerNewYork,2006.[2]D.M.BleiandJ.D.McAuliffe.Supervisedtopicmodels.NIPS,pages121\u2013128,2007.[3]D.M.Blei,A.Y.Ng,andM.I.Jordan.LatentDirichletallocation.JMLR,3:993\u20131022,2003.[4]A.Gelman,J.B.Carlin,H.S.Stern,andD.B.Rubin.Bayesiandataanalysis.BocaRaton,FL:ChapmanandHall/CRC,2004.[5]T.L.Grif\ufb01thsandM.Steyvers.Findingscienti\ufb01ctopics.Proc.ofNationalAcademyofSci.,pages5228\u20135235,2004.[6]T.Joachims,T.Finley,andC.N.J.Yu.Cutting-planetrainingofstructuralSVMs.MachineLearning,77(1):27\u201359,2009.[7]M.I.Jordan,Z.Ghahramani,T.S.Jaakkola,andL.K.Saul.Anintroductiontovariationalmethodsforgraphicalmodels.Machinelearning,37(2):183\u2013233,1999.[8]S.Lacoste-Jullien,F.Sha,andM.I.Jordan.DiscLDA:Discriminativelearningfordimension-alityreductionandclassi\ufb01cation.NIPS,pages897\u2013904,2009.[9]D.Li,S.Somasundaran,andA.Chakraborty.Acombinationoftopicmodelswithmax-marginlearningforrelationdetection.InACLTextGraphs-6Workshop,2011.[10]R.Y.RubinsteinandD.P.Kroese.SimulationandtheMonteCarlomethod,volume707.Wiley-interscience,2008.[11]E.Scho\ufb01eld.Fittingmaximum-entropymodelsonlargesamplespaces.PhDthesis,Depart-mentofComputing,ImperialCollegeLondon,2006.[12]C.Wang,D.M.Blei,andLiF.F.Simultaneousimageclassi\ufb01cationandannotation.CVPR,2009.[13]Y.WangandG.Mori.Max-marginlatentDirichletallocationforimageclassi\ufb01cationandannotation.InBMVC,2011.[14]S.Yang,J.Bian,andH.Zha.Hybridgenerative/discriminativelearningforautomaticimageannotation.InUAI,2010.[15]A.Zellner.OptimalinformationprocessingandBayes\u2019stheorem.AmericanStatistician,pages278\u2013280,1988.[16]J.Zhu,A.Ahmed,andE.P.Xing.MedLDA:maximummarginsupervisedtopicmodelsforregressionandclassi\ufb01cation.InICML,pages1257\u20131264,2009.[17]J.Zhu,N.Chen,andE.P.Xing.In\ufb01nitelatentSVMforclassi\ufb01cationandmulti-tasklearning.InNIPS,2011.9\f", "award": [], "sourceid": 751, "authors": [{"given_name": "Qixia", "family_name": "Jiang", "institution": null}, {"given_name": "Jun", "family_name": "Zhu", "institution": null}, {"given_name": "Maosong", "family_name": "Sun", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}