{"title": "Multilabel reductions: what is my loss optimising?", "book": "Advances in Neural Information Processing Systems", "page_first": 10600, "page_last": 10611, "abstract": "Multilabel classification is a challenging problem arising in applications ranging from information retrieval to image tagging. A popular approach to this problem is to employ a reduction to a suitable series of binary or multiclass problems (e.g., computing a softmax based cross-entropy over the relevant labels). While such methods have seen empirical success, less is understood about how well they approximate two fundamental performance measures: precision@$k$ and recall@$k$. In this paper, we study five commonly used reductions, including the one-versus-all reduction, a reduction to multiclass classification, and normalised versions of the same, wherein the contribution of each instance is normalised by the number of relevant labels. Our main result is a formal justification of each reduction: we explicate their underlying risks, and show they are each consistent with respect to either precision or recall. Further, we show that in general no reduction can be optimal for both measures. We empirically validate our results, demonstrating scenarios where normalised reductions yield recall gains over unnormalised counterparts.", "full_text": "Multilabelreductions:whatismylossoptimising?AdityaKrishnaMenon,AnkitSinghRawat,SashankJ.Reddi,andSanjivKumarGoogleResearchNewYork,NY10011{adityakmenon,sashank,ankitsrawat,sanjivk}@google.comAbstractMultilabelclassi\ufb01cationisachallengingproblemarisinginapplicationsrangingfrominformationretrievaltoimagetagging.Apopularapproachtothisproblemistoemployareductiontoasuitableseriesofbinaryormulticlassproblems(e.g.,computingasoftmaxbasedcross-entropyovertherelevantlabels).Whilesuchmethodshaveseenempiricalsuccess,lessisunderstoodabouthowwelltheyap-proximatetwofundamentalperformancemeasures:precision@kandrecall@k.Inthispaper,westudy\ufb01vecommonlyusedreductions,includingtheone-versus-allreduction,areductiontomulticlassclassi\ufb01cation,andnormalisedversionsofthesame,whereinthecontributionofeachinstanceisnormalisedbythenumberofrelevantlabels.Ourmainresultisaformaljusti\ufb01cationofeachreduction:weexpli-catetheirunderlyingrisks,andshowtheyareeachconsistentwithrespecttoeitherprecisionorrecall.Further,weshowthatingeneralnoreductioncanbeoptimalforbothmeasures.Weempiricallyvalidateourresults,demonstratingscenarioswherenormalisedreductionsyieldrecallgainsoverunnormalisedcounterparts.1IntroductionMultilabelclassi\ufb01cationistheproblemofpredictingmultiplelabelsforagiveninstance[TsoumakasandKatakis,2007,ZhangandZhou,2014].Forexample,ininformationretrieval,onemaypredictwhetheranumberofdocumentsarerelevanttoagivenquery[Manningetal.,2008];inimagetagging,onemaypredictwhetherseveralindividuals\u2019facesarecontainedinanimage[Xiaoetal.,2010].Intheextremeclassi\ufb01cationscenariowherethenumberofpotentiallabelsislarge,na\u00efvemodellingofallpossiblelabelcombinationsisprohibitive.Thishasmotivatedanumberofalgorithmstargettingthissetting[Agrawaletal.,2013,Yuetal.,2014,Bhatiaetal.,2015,Jainetal.,2016,BabbarandSch\u00f6lkopf,2017,Yenetal.,2017,Prabhuetal.,2018,Jainetal.,2019,Reddietal.,2019].Onepopularstrategyistoemployareductiontoaseriesofbinaryormulticlassproblems,andthereaftertreatlabelsindependently.Suchreductionssigni\ufb01cantlyreducethecomplexityoflearning,andhaveseenempiricalsuccess.Forexample,intheone-versus-allreduction,onereducestheproblemtoaseriesofindependentbinaryclassi\ufb01cationtasks[Brinkeretal.,2006,Dembczy\u00b4nskietal.,2010,2012].Similarly,inthepick-one-labelreduction,onereducestheproblemtomulticlassclassi\ufb01cationwitharandomlydrawnpositivelabel[Boutelletal.,2004,Jerniteetal.,2017,Joulinetal.,2017].Thetheoreticalaspectsofthesereductionsarelessclear,however.Inparticular,preciselywhatpropertiesoftheoriginalmultilabelproblemdothesereductionspreserve?Whilesimilarquestionsarewell-studiedforreductionsofmulticlasstobinaryclassi\ufb01cation[Zhang,2004,TewariandBartlett,2007,Ramaswamyetal.,2014],reductionsformultilabelproblemshavereceivedlessattention,despitetheirwideuseinpractice.Recently,Wydmuchetal.[2018]establishedthatthepick-one-labelisinconsistentwithrespecttotheprecision@k,akeymeasureofretrievalperformance.Ontheotherhand,theyshowedthataprobabilisticlabeltree-basedimplementationofone-versus-allisconsistent.Thisintriguingobservationraisestwonaturalquestions:whatcanbesaidaboutconsistencyofotherreductions?Anddoesthepicturechangeifweconsideradifferentmetric,suchastherecall@k?33rdConferenceonNeuralInformationProcessingSystems(NeurIPS2019),Vancouver,Canada.\fReductionNotationBayes-optimalf\u2217i(x)Prec@kRec@kConsistencyanalysisOne-versus-all\u2018OVAP(yi=1|x)X\u00d7[Wydmuchetal.,2018]Pick-all-labels\u2018PALP(yi=1|x)/N(x)X\u00d7ThispaperOne-versus-allnormalised\u2018OVA\u2212NP(y0i=1|x)\u00d7XThispaperPick-all-labelsnormalised\u2018PAL\u2212NP(y0i=1|x)\u00d7XThispaperPick-one-label\u2018POLP(y0i=1|x)\u00d7XThispaperTable1:Summaryofreductionsofmultilabeltobinaryormulticlassclassi\ufb01cationstudiedin\u00a74.1ofthispaper.Foreachreduction,wespecifythenotationusedforthelossfunction;theBayes-optimalscorerfortheithlabel,assumingrandom(instance,label)pairs(x,y);whetherthereductionisconsistentfortothemultilabelprecision@kandrecall@k;andwherethisanalysisisprovided.Here,P(yi=1|x)denotesthemarginalprobabilityoftheithlabel;P(y0i=1|x)denotesanonlineartransformationofthisprobability,per(5);andN(x)istheexpectednumberofrelevantlabelsforx,alsoper(12).Inthispaper,weprovideasystematicstudyofthesequestionsbyinvestigatingtheconsistencyof\ufb01vemultilabelreductionswithrespecttobothprecision@kandrecall@k:(a)theone-versus-allreduction(OVA)toindependentbinaryproblems[Dembczy\u00b4nskietal.,2010];(b)apick-all-labels(PAL)multiclassreduction,whereinaseparatemulticlassexampleiscreatedforeachpositivelabel[Reddietal.,2019];(c)anormalisedone-versus-allreduction(OVA-N),wherethecontributionofeachinstanceisnormalisedbythenumberofrelevantlabels;(d)anormalisedpick-all-labelsreduction(PAL-N),withthesamenormalisationasabove;and,(e)thepick-one-label(POL)multiclassreduction,whereinasinglemulticlassexampleiscreatedbyrandomlysamplingapositivelabel[Joulinetal.,2017,Jerniteetal.,2017].Ourmainresultisaformaljusti\ufb01cationofeachreduction:weexplicatethemultilabelrisksunderpin-ningeachofthem(Proposition5),andusethistoshowtheyareeachconsistentwithrespecttoeitherprecisionorrecall(Proposition9,10).Further,thisdichotomyisinescapable:theBayes-optimalscorersforthetwomeasuresarefundamentallyincompatible(Corollary2,4),exceptintrivialcaseswherealllabelsareconditionallyindependent,orthenumberofrelevantlabelsisconstant.This\ufb01ndinghastwoimportantimplications.First,whiletheabovereductionsappearsuper\ufb01ciallysimilar,theytargetfundamentallydifferentperformancemeasures.Consequently,whenrecallisofprimaryinterest,suchaswhenthereareonlyafewrelevantlabelsforaninstance[Lapinetal.,2018],employingthe\u201cwrong\u201dreductioncanpotentiallyleadtosuboptimalperformance.Second,theprobabilityscoresobtainedfromeachreductionmustbeinterpretedwithcaution:exceptfortheone-versus-allreduction,theydonotcoincidewiththemarginalprobabilitiesforeachlabel.Na\u00efvelyusingtheseprobabilitiesfordownstreamdecision-makingmaythusbesub-optimal.Insummary,ourcontributionsarethefollowing(seealsoTable1):(1)weformalisetheimplicitmultilabellossandriskunderpinning\ufb01vedistinctmultilabellearningreductions(\u00a74.1)toasuitablebinaryormulticlassproblem(Proposition5).(2)weestablishsuitableconsistencyofeachreduction:theunnormalisedreductionsareconsistentforprecision,whilethenormalisedreductionsareconsistentforrecall(Proposition9,10).(3)weempiricallycon\ufb01rmthatnormalisedreductionscanyieldrecallgainsoverunnormalisedcounterparts,whilethelattercanyieldprecisiongainsovertheformer.2BackgroundandnotationWeformalisethemultilabelclassi\ufb01cationproblem,anditsspecialcaseofmulticlassclassi\ufb01cation.2.1Multilabelclassi\ufb01cationSupposewehaveaninstancespaceX(e.g.,queries)andlabelspaceY.={0,1}L(e.g.,documents)forsomeL\u2208N+.Here,Lrepresentsthetotalnumberofpossiblelabels.Givenaninstancex\u2208Xwithlabelvectory\u2208Y,weinterpretyi=1tomeanthatthelabeliis\u201crelevant\u201dtotheinstancex.Importantly,theremaybemultiplerelevantlabelsforagiveninstance.Ourgoalis,informally,to\ufb01ndarankingoverlabelsgivenaninstance(e.g.,rankthemostrelevantdocumentsforaquery).Moreprecisely,letPbeadistributionoverX\u00d7Y,whereP(y|x)denotesthesuitabilityoflabelvectoryforinstancex,andP(0L|x)=0(i.e.,eachinstancemusthaveatleastonerelevant2\flabel).1Ourgoalistolearnascorerf:X\u2192RLthatorderslabelsaccordingtotheirsuitability(e.g.,scoresdocumentsbasedontheirrelevanceforagivenquery).Weevaluateascoreraccordingtotheprecision-at-kandrecall-at-kforgivenk\u2208[L].={1,2,...,L}[Lapinetal.,2018]:2Prec@k(f).=E(x,y)(cid:20)|rel(y)\u2229Topk(f(x))|k(cid:21)Rec@k(f).=E(x,y)(cid:20)|rel(y)\u2229Topk(f(x))||rel(y)|(cid:21),(1)whereTopk(f)returnsthetopkscoringlabelsaccordingtof(assumingnoties),andrel(y)denotestheindicesoftherelevant(positive)labelsofy.Inaretrievalcontext,therecall@kmaybefavourablewhenk(cid:29)rel(y)(i.e.,weretrievealargenumberofdocuments,butthereareonlyafewrelevantdocumentsforaquery),sincetheprecisionwilldegradeaskincreases[Lapinetal.,2018].Optimisingeitherofthesemeasuresdirectlyisintractable,andsotypicallyonepicksamultil-abelsurrogateloss\u2018ML:{0,1}L\u00d7RL\u2192R+,andminimisesthemultilabelriskRML(f).=E(x,y)[\u2018ML(y,f(x))].ABayes-optimalscorerf\u2217isanyminimiserofRML.Forseveralperformancemeasures,f\u2217isanymonotonetransformationofthemarginallabelprobabilitiesP(yi=1|x)[Dem-bczy\u00b4nskietal.,2010,Koyejoetal.,2015,WuandZhou,2017].Thus,accurateestimationofthesemarginalssuf\ufb01cesforgoodperformanceonthesemeasures.Thisgivescredencetotheexistenceofef\ufb01cientreductionsthatpreservemultilabelclassi\ufb01cationperformance,atopicweshallstudyin\u00a74.2.2Multiclassclassi\ufb01cationMulticlassclassi\ufb01cationisaspecialcaseofmultilabelclassi\ufb01cationwhereeachinstancehasonlyonerelevantlabel.Concretely,ourlabelspaceisnowZ.=[L].SupposethereisanunknowndistributionPoverX\u00d7Z,whereP(z|x)denotesthesuitabilityoflabelzforinstancex.Wemaynowevaluateacandidatescorerf:X\u2192RLaccordingtothetop-kriskforgivenk\u2208[L]:Rtop\u2212k(f).=P(z/\u2208Topk(f(x))).(2)Thismeasureisnaturalwhenwecanmakekguessesastoaninstance\u2019slabel,andareonlypenalisedifallguessesareincorrect[Lapinetal.,2018].Thisisequivalenttotheexpectedtop-kloss,givenby\u2018top\u2212k(z,f).=Jz/\u2208Topk(f)K.(3)Whenk=1,weobtainthezero-oneormisclassi\ufb01cationloss.Forcomputationaltractability,ratherthanminimise(2)directly,oneoftenoptimisesasurrogateloss\u2018MC:[L]\u00d7RL\u2192R+,withriskRMC(f).=E(x,z)[\u2018MC(z,f(x))].Severalsurrogatelosseshavebeenstudied[Zhang,2004,\u00c1vilaPiresandSzepesv\u00e1ri,2016],themostpopularbeingthesoftmaxcross-entropy\u2018SM(i,f).=\u2212fi+logPj\u2208[L]efj,withfibeingtheithcoordinateofthevectorf.Consistencyofsuchsurrogateswithrespecttothetop-kerrorhavebeenconsideredinseveralrecentworks(seee.g.,[Lapinetal.,2015,Lapinetal.,2018,YangandKoyejo,2019]).Wesaythat\u2018MCisconsistentforthetop-kerrorifdrivingtheexcessriskfor\u2018MCtozeroalsodrivestheexcessriskforthe\u2018top\u2212ktozero;thatis,foranysequence(fn)\u221en=1ofscorers,reg(fn;\u2018MC)\u21920=\u21d2reg(fn;\u2018top\u2212k)\u21920,(4)wheretheregretofascorerfwithrespectto\u2018MCisreg(f;\u2018MC).=RMC(f)\u2212infg:X\u2192RLRMC(g).3Optimalscorersformultilabelprecisionandrecall@kOurfocusinthispaperisonmultilabelclassi\ufb01cationperformanceaccordingtotheprecision@kandrecall@k(cf.(1)).Itisthereforeprudenttoask:whataretheBayes-optimalpredictionsforeachmeasure?Answeringthisgivesinsightintowhataspectsofascorerthesemeasuresfocuson.Recently,Wydmuchetal.[2018]studiedthisquestionforprecision@k,establishingthatitisoptimisedbyanyorder-preservingtransformationofthemarginalprobabilitiesP(yi=1|x).Lemma1([Wydmuchetal.,2018]).Themultilabelprecision@kofascorerf:X\u2192RLisPrec@k(f)=Ex\uf8ee\uf8f0Xi\u2208Topk(f(x))1k\u00b7P(yi=1|x)\uf8f9\uf8fb.1Withoutthisassumption,onemaytaketheconventionthat0/0=1inde\ufb01ningtherecall@k.2Similarmetricsmaybede\ufb01nedwhenL=1[Karetal.,2014,2015,Liuetal.,2016,Tasche,2018].3\fCorollary2([Wydmuchetal.,2018]).AssumingtherearenotiesintheprobabilitiesP(yi=1|x),f\u2217\u2208argmaxf:X\u2192RPrec@k(f)\u21d0\u21d2(\u2200x\u2208X)Topk(f\u2217(x))=Topk(cid:16)[P(yi=1|x)]Li=1(cid:17).Wenowshowthat,bycontrast,therecall@kwillingeneralnotencourageorderingbythemarginalprobabilities.Indeed,itcanbeexpressedinanearlyidenticalformtotheprecison@k,butwithacrucialdifference:themarginalprobabilitiesaretransformedbyanadditionalnonlinearweighting.Lemma3.Themultilabelrecall@kofascorerf:X\u2192RLisRec@k(f)=Ex\uf8ee\uf8f0Xi\u2208Topk(f(x))P(y0i=1|x)\uf8f9\uf8fbP(y0i=1|x).=P(yi=1|x)\u00b7Ey\u00aci|x,yi=1\"11+Pj6=iyj#,(5)wherey\u00acidenotesthevectorofallbuttheithlabel,i.e.,(y1,...,yi\u22121,yi+1,...,yL)\u2208{0,1}L\u22121.The\u201ctransformed\u201dprobabilitiesP(y0i=1|x)willingeneralnotpreservetheorderingofthemarginalprobabilitiesP(yi=1|x),owingtothemultiplicationbyanon-constantterm.Wethushavethefollowing,whichisimplicitinWydmuchetal.[2018,Proposition1],whereinitwasshownthepick-one-labelreductionisinconsistentwithrespecttoprecision.Corollary4.AssumingtherearenotiesintheprobabilitiesP(y0i=1|x),f\u2217\u2208argmaxf:X\u2192RRec@k(f)\u21d0\u21d2(\u2200x\u2208X)Topk(f\u2217(x))=Topk(cid:16)[P(y0i=1|x)]Li=1(cid:17).Further,theorderofP(y0i=1|x)andP(yi=1|x)donotcoincideingeneral.OneimplicationofCorollary4isthat,whendesigningreductionsformultilabellearning,onemustcarefullyassesswhichofthesetwomeasures(ifany)thereductionisoptimalfor.Wecannothopeforareductiontobeoptimalforboth,sincetheirBayes-optimalscorersaregenerallyincompatible.WeremarkhoweverthatonespecialcasewhereP(y0i=1|x)=P(yi=1|x)iswhenPi\u2208[L]yiisaconstantforeveryinstance,whichmayhappenifthelabelsareconditionallyindependent.Havingobtainedahandleontheseperformancemeasures,weproceedwithourcentralobjectofinquiry:aretheywellapproximatedbyexistingmultilabelreductions?4Reductionsfrommulti-tosingle-labelclassi\ufb01cationWestudy\ufb01vedistinctreductionsofmultilabeltobinaryormulticlassclassi\ufb01cation.Ourparticularinterestisinwhatmultilabelperformancemeasure(ifany)thesereductionsimplicitlyaimtooptimise.Asa\ufb01rststeptoansweringthis,weexplicatethemultilabellossandriskunderpinningeachofthem.4.1Multilabelreductions:lossfunctionsRecallfrom\u00a72.1thatastandardapproachtomultilabelclassi\ufb01cationisminimisingasuitablemultilabellossfunction\u2018ML:{0,1}L\u00d7RL\u2192R+.Toconstructsuchaloss,apopularapproachistodecomposeitintoasuitablecombinationofbinaryormulticlasslosses;implicitly,thisisareductionofmultilabellearningtoasuitablebinaryormulticlassproblem.Weconsider\ufb01vedistinctdecompositions(seeTable2).Foreach,weexplicatetheirunderlyingmultilabellossinordertocomparethemonanequalfooting.Despitethewidespreaduseofthesereductions,toourknowledge,theyhavenotbeenexplicitlycomparedinthismannerbypriorwork.Ourgoalistothusprovideauni\ufb01edanalysisofdistinctmethods,similartotheanalysisinDembczy\u00b4nskietal.[2012]ofthelabeldependenceassumptionsunderpinningmultilabelalgorithms.One-versus-all(OVA).The\ufb01rstapproachisarguablythesimplest:wetrainLindependentbinaryclassi\ufb01cationmodelstopredicteachyi\u2208{0,1}.Thiscanbeinterpretedasusingthemultilabelloss\u2018OVA(y,f).=Xi\u2208[L]\u2018BC(yi,fi)=Xi\u2208[L]{yi\u00b7\u2018BC(1,fi)+(1\u2212yi)\u00b7\u2018BC(0,fi)},(6)4\fReductionExampleinstantiationOne-versus-allPi\u2208[L]n\u2212yi\u00b7logefi1+efi\u2212(1\u2212yi)\u00b7log11+efioPick-all-labelsPi\u2208[L]\u2212yi\u00b7logefiPj\u2208[L]efjOne-versus-allnormalisedPi\u2208[L]n\u2212yiPj\u2208[L]yj\u00b7logefi1+efi\u2212(cid:16)1\u2212yiPj\u2208[L]yj(cid:17)\u00b7log11+efioPick-all-labelsnormalisedPi\u2208[L]\u2212yiPj\u2208[L]yj\u00b7logefiPj\u2208[L]efjPick-one-label\u2212yi0\u00b7logefi0Pj\u2208[L]efj,i0\u223cDiscrete(cid:16)nyiPj\u2208[L]yjo(cid:17)Table2:Examplesofmultilabellossesunderpinningvariousreductions,givenlabelsy\u2208{0,1}Landpredictionsf\u2208RL.Weassumeasigmoidorsoftmaxcross-entropyfortherelevantbaselosses\u2018BC,\u2018MC.where\u2018BC:{0,1}\u00d7R\u2192R+issomebinaryclassi\ufb01cationloss(e.g.,logisticloss).Inwords,weconverteach(x,y)intoapositiveexampleforeachlabelwithyi=1,andanegativeexampleforeachlabelwithyi=0.Thisisalsoknownasthebinaryrelevancemodel[Brinkeretal.,2006,TsoumakasandVlahavas,2007,Dembczy\u00b4nskietal.,2010].Pick-all-labels(PAL).Anothernaturalapproachinvolvesamulticlassratherthanbinaryloss:weconverteach(x,y)fory\u2208Yintomulti-classobservations{(x,i):i\u2208[L],yi=1},withoneobservationperpositivelabel[Reddietal.,2019].Thiscanbeinterpretedasusingthemultilabelloss\u2018PAL(y,f).=Xi\u2208[L]yi\u00b7\u2018MC(i,f),(7)where\u2018MC:L\u00d7RL\u2192R+issomemulticlassloss(e.g.,softmaxwithcross-entropyorBOWL,per\u00a72.2).Whilethebaselossismulticlass\u2013which,inherently,assumesthereisonlyonerelevantlabel\u2013wecomputethesumofmanysuchmulticlasslosses,oneforeachpositivelabeliny.3Observethateachlossinthesuminvolvestheentirevectorofscoresf;thisisincontrasttotheOVAloss,whereineachlossinthesumonlydependsonthescoresfortheithlabel.Further,notethatwhenLislarge,onemaydesignef\ufb01cientstochasticapproximationstosuchaloss[Reddietal.,2019].One-versus-allnormalised(OVA-N).Anaturalvariantoftheapproachadoptedintheabovetworeductionsistonormalisethecontributionofeachlossbythenumberofpositivelabels.FortheOVAmethod,ratherthanindependentlymodeleachlabelyi,wethusmodelnormalisedlabels:\u2018OVA\u2212N(y,f).=Xi\u2208[L](yiPj\u2208[L]yj\u00b7\u2018BC(1,fi)+ 1\u2212yiPj\u2208[L]yj!\u00b7\u2018BC(0,fi)).(8)Togainsomeintuitionforthisloss,takethespecialcaseofsquareloss,\u2018BC(yi,fi)=(yi\u2212fi)2.Onemayverifythat\u2018OVA\u2212N(y,f)=Pi\u2208[L](y0i\u2212fi)2plusaconstant,fory0i.=yiPyj.Thus,thelossencouragesfitoestimatethe\u201cnormalisedlabels\u201dy0i,ratherthantherawlabelsyiasinOVA.Pick-all-labelsnormalised(PAL-N).SimilartotheOVA-Nmethod,wecannormalisePALto:\u2018PAL\u2212N(y,f).=Xi\u2208[L]yiPj\u2208[L]yj\u00b7\u2018MC(i,f)=1Pj\u2208[L]yj\u00b7\u2018PAL(y,f).(9)Suchareductionappearstobefolk-knowledgeamongstpractitioners(inparticular,beingallowedbypopularlibraries[Abadietal.,2016]),butasfarasweareaware,hasnotbeenpreviouslystudied.Togainsomeintuitionforthisloss,observethaty0i.=yiPj\u2208[L]yjformsadistributionoverthelabels.Supposeourscoresf\u2208RLareconvertedtoaprobabilitydistributionviaasuitablelinkfunction\u03c3(e.g.,thesoftmax),andweapplythelog-lossasour\u2018MC.Then,\u2018PAL\u2212Ncorrespondstominimisingthecross-entropybetweenthetrueandmodeldistributionsoverlabels:\u2018PAL\u2212N(y,f)=Xi\u2208[L]\u2212y0i\u00b7log\u03c3(fi)=KL(y0k\u03c3(f))+Constant.3Thisistobecontrastwiththelabelpowersetapproach[Boutelletal.,2004],whichtreatseachdistinctlabelvectorasaseparateclass,andthuscreatesamulticlassproblemwith2Lclasses.5\fPick-one-label(POL).Inthisreduction,givenanexample(x,y),weselectasinglerandompositivelabelfromyasthetruelabelforx[Jerniteetal.,2017,Joulinetal.,2017].Thiscanbeunderstoodasastochasticversionof(9),whichconsidersaweightedcombinationofallpositivelabels.4.2Multilabelreductions:populationrisksEachoftheabovereductionsisintuitivelyplausible.Forexample,theOVAreductionisaclassicalapproach,whichhasseensuccessinmulticlasstobinarycontexts;itisnaturaltoconsideritsuseinmultilabelcontexts.Ontheotherhand,thePALreductionexplicitlyencourages\u201ccompetition\u201damongstthevariouslabelsif,e.g.,usedwithasoftmaxcross-entropyloss.Finally,thenormalisedreductionsintuitivelypreventinstanceswithmanyrelevantlabelsfromdominatingourmodelling.Inordertomakesuchintuitionsprecise,amorecarefulanalysisisneeded.Todoso,weconsiderwhatunderlyingmultilabelrisk(i.e.,expectedloss)eachreductionimplicitlyoptimises.Proposition5.Givenascorerf:X\u2192R,themultilabelrisksforeachoftheabovereductionsare:ROVA(f)=Xi\u2208[L]E(x,yi)[\u2018BC(yi,fi(x))]ROVA\u2212N(f)=Xi\u2208[L]E(x,y0i)[\u2018BC(y0i,fi(x))]RPAL(f)=E(x,z)[N(x)\u00b7\u2018MC(z,f(x))]RPAL\u2212N(f)=RPOL(f)=E(x,z0)[\u2018MC(z0,f(x))],whereP(y0i=1|x)isper(5),andwehavede\ufb01neddiscreterandomvariablesz,z0over[L]byP(z0=i|x).=P(y0i=1|x)(10)P(z=i|x).=N(x)\u22121\u00b7P(yi=1|x)(11)N(x).=Xi\u2208[L]P(yi=1|x).(12)Proposition5explicatesthat,asexpected,theOVAandOVA-Nmethodsdecomposeintosumsofbinaryclassi\ufb01cationrisks,whiletheotherreductionsdecomposeintosumsofmulticlassrisks.Therearethreemoreinterestingimplications.First,normalisationhasanon-trivialeffect:forbothOVAandPAL,theirnormalisedcounterpartsinvolvemodi\ufb01edbinaryandmulticlasslabeldistributionsrespectively.Inparticular,whilePALinvolvesP(z|x)constructedfromthemarginallabelprobabilities,PAL-NinvolvesP(z0|x)constructedfromthe\u201ctransformed\u201dprobabilitiesin(5).Second,PALyieldsaweightedmulticlassrisk,wheretheweightN(x)istheexpectednumberofrelevantlabelsforx.SincePALtreatsthemultilabelproblemasaseriesofmulticlassproblemsforeachpositivelabel,instanceswithmanyrelevantlabelshaveagreatercontributiontotheloss.Theweightcanalsobeseenasnormalisingthemarginallabelprobabilities[P(yi=1|x)]i\u2208[L]toavalidmulticlassdistributionovertheLlabels.Bycontrast,theriskinPAL-Nisunweighted,despitethelossesforeachbeingrelatedbyascalingfactorper(9).Intuitively,thisisaconsequenceofthefactthatthenormalisercanvaryacrossdrawsfromP(y|x),i.e.,Ey|x[\u2018PAL\u2212N(y,f(x))]=Ey|x\"1Pj\u2208[L]yj\u00b7\u2018PAL(y,f(x))#6=Ey|x\"1Pj\u2208[L]yj#\u00b7Ey|x[\u2018PAL(y,f(x))].Third,thereisasubtledistinctionbetweenPALandOVA.Theformerallowsfortheuseofanarbitrarymulticlassloss\u2018MC;inthesimplestcase,forabasebinaryclassi\ufb01cationloss\u2018BC,wemaychoosealosswhichtreatsthegivenlabelasapositive,andallotherlabelsasnegative:\u2018MC(i,f)=\u2018BC(1,fi)+Pj6=i\u2018BC(0,fj).Thisisamulticlassversionoftheone-versus-allreduction[RifkinandKlautau,2004].However,evenwiththischoice,thePALandOVArisksdonotagree.Lemma6.TheriskofthePALreductionusingthemulticlassone-verus-alllossisRPAL(f)=ROVA(f)+Xi\u2208[L]Ex[(N(x)\u22121)\u00b7\u2018BC(0,fi(x))].Tounderstandthisintuitively,thePALlossdecomposesintoonetermforeachrelevantlabel;foreachsuchterm,weapplytheloss\u2018MC,whichconsidersallotherlabelstobenegative.Consequently,everyirrelevantlabeliscountedmultipletimes,whichmanifestsintheextraweightingtermabove.Wewillshortlyseehowthisin\ufb02uencestheoptimalscorersforthereduction.6\f5OptimalscorersformultilabelreductionsHavingexplicatedtherisksunderpinningeachofthereductions,wearenowinapositiontoanswerthequestionofwhatpreciselytheyareoptimising.Weshowthatinfacteachreductionisconsistentforeithertheprecision@korrecall@k;following\u00a73,theycannotbeoptimalforbothingeneral.5.1Multilabelreductions:Bayes-optimalpredictionsWebeginouranalysisbycomputingtheBayes-optimalscorersforeachreduction.Doingsorequireswecommittoaparticularfamilyoflossfunctions\u2018BCand\u2018MC.Weconsiderthefamilyofstrictlyproperlosses[Savage,1971,Bujaetal.,2005,ReidandWilliamson,2010],whoseBayes-optimalsolutioninabinaryormulticlasssettingistheunderlyingclass-probability.Canonicalexamplesarethecross-entropyandsquareloss.Inthefollowing,weshallassumeourscorersareoftheformf:X\u2192[0,1],sothattheyoutputvalidprobabilitiesratherthanarbitraryrealnumbers;inpracticethisistypicallyachievedbycouplingascorerwithalinkfunction,e.g.,thesoftmaxorsigmoid.Corollary7.Suppose\u2018BCand\u2018MCarestrictlyproperlosses,andweusescorersf:X\u2192[0,1].Then,foreveryx\u2208Xandi\u2208[L],theBayes-optimalf\u2217i(x)fortheOVAandPALreductionsaref\u2217OVA,i(x)=P(yi=1|x)f\u2217PAL,i(x)=N(x)\u22121\u00b7P(yi=1|x),whiletheBayes-optimalf\u2217i(x)foreachofthe\u201cnormalised\u201dreductions(OVA-N,PAL-N,POL)aref\u2217i(x)=P(y0i=1|x).Inwords,theunnormalisedreductionsresultinsolutionsthatpreservetheorderingofthemarginalprobabilities,whilethenormalisedreductionsresultinsolutionsthatpreservetheorderingofthe\u201ctransformed\u201dmarginalprobabilities.RecallingCorollary2and4,weseethattheunnormalisedreductionsimplicitlyoptimiseforprecision,whilethenormalisedreductionsimplicitlyoptimiseforrecall.WhilethisfactiswellknownforOVA[Dembczy\u00b4nskietal.,2010,Wydmuchetal.,2018],toourknowledge,thequestionofBayes-optimalityhasnotbeenexploredfortheotherreductions.5.2Multilabelreductions:consistencyCorollary7showsthatthevariousreductions\u2019asymptotictargetscoincidewiththoseforprecisionorrecall.Butwhatcanbesaidabouttheirconsistency(inthesenseofEquation4)?Forthemulticlassreductions,wenowshowthisfollowsowingtoastrongerversionofCorollary7:PALandPAL-Nhaveidenticalrisks(uptoscalingandtranslation)totheprecisionandrecall@k,respectively.Corollary8.Suppose\u2018MCisthetop-klossin(2).Then,usingthislosswithPALandPAL-N,RPAL(f)=Constant\u2212k\u00b7Prec@k(f)RPAL\u2212N(f)=Constant\u2212Rec@k(f).Despitetheprioruseofthesereductions,theaboveconnectionhas,toourknowledge,notbeennotedhitherto.Itexplicatesthattwosuper\ufb01ciallysimilarreductionsoptimiseforfundamentallydifferentquantities,andtheirusageshouldbemotivatedbywhichoftheseisusefulforaparticularapplication.Buildingonthis,wenowshowthatwhenusedwithsurrogatelossesthatareconsistentfortop-kerror(cf.(4)),thereductionsareconsistentwithrespecttotheprecisionandrecall,respectively.Inthefollowing,denotetheregretofascorerf:X\u2192RLwithrespecttoamultilabelloss\u2018MLbyreg(f;\u2018ML).=E(x,y)[\u2018ML(y,f(x))]\u2212infg:X\u2192RLE(x,y)[\u2018ML(y,g(x))].Wesimilarlydenotetheregretfortheprecision@kandrecall@kbyreg(f;P@k)andreg(f;R@k).Proposition9.Suppose\u2018MCisconsistentforthetop-kerror.Foranysequence(fn)\u221en=1ofscorers,reg(fn;\u2018PAL)\u21920=\u21d2reg(fn;P@k)\u21920reg(fn;\u2018PAL\u2212N)\u21920=\u21d2reg(fn;R@k)\u21920.7\fOur\ufb01nalanalysisisregardingtheOVA-Nmethod,whichdoesnothaveariskequivalencetotherecall@k.4Nonetheless,itsBayes-optimalscorerswereseentocoincidewiththatoftherecall;asaconsequence,similartotheconsistencyanalysisofOVAinWydmuchetal.[2018],wemayshowthataccurateestimationofthetransformedprobabilitiesP(y0i=1|x)impliesgoodrecallperformance.ThisconsequentlyimpliesconsistencyofOVA-N,asthelattercanguaranteegoodprobabilityestimateswhenequippedwithastronglyproperloss[Agarwal,2014],suchasforexamplethecross-entropyorsquareloss.Proposition10.Suppose\u2018BCisa\u03bb-stronglyproperloss.Foranyscorerf:X\u2192[0,1],reg(f;R@k)\u22642\u00b7Ex(cid:20)maxi\u2208[L]|P(y0i=1|x)\u2212fi(x)|(cid:21)\u2264p2/\u03bb\u00b7reg(f;\u2018OVA\u2212N).5.3ImplicationsandfurtherconsiderationsWeconcludeouranalysiswithsomeimplicationsoftheaboveresults.OVAversusPAL.Corollary7suggeststhatforoptimisingprecision(orrecall),thereisnoasymptoticdifferencebetweenusingOVAandPAL(ortheirnormalisedcounterparts).However,apotentialadvantageofthePALapproachisthatitallowsforuseoftightsurrogatestothetop-kloss(2),whereinonlythetop-kscoringnegativesareconsideredintheloss.InsettingswhereLislarge,onecanef\ufb01cientlyoptimisesuchalossviathestochasticnegativeminingapproachofReddietal.[2019].Interpretingmodelscores.OnesubtleimplicationofCorollary7isthatforallmethodsbutOVA,thelearnedscoresdonotre\ufb02ectthemarginallabelprobabilities.Inparticular,whiletheoptimalscorerfortheOVAandPALbothpreservetheorderofthemarginalprobabilities,thelatteradditionallyinvolvesaninstance-dependentscalingbyN(x).Consider,then,anextremescenariowhereforsomex\u2208X,alllabelshavemarginalP(yi=1|x)=1.UndertheOVAreduction,wewouldassigneachlabelascoreof1,indicating\u201cperfectrelevance\u201d.However,underthePALreduction,wewouldassignthemascoreof1L,whichwouldna\u00efvelyindicate\u201clowrelevance\u201d.ForPAL,onecanrectifythisbypositingaformgi(x)forP(yi=1|x),e.g.,gi(x)=\u03c3(wTix).Then,ifoneusesfi(x)=gi(x)Pj\u2208[L]gj(x),thelearnedgi\u2019swillmodelthemarginals,uptoscaling.Learningfrommulticlasssamples.Ourfocusinthepaperhasbeenonsettingswherewehaveaccesstomultilabelsamples(x,y),andchoosetoconvertthemtosuitablebinaryormulticlasssamples,e.g.,(x,z)forPAL.Anaturalquestioniswhetheritispossibletolearnwhenweonlyhavepartialknowledgeofthetruemultilabelvectory.Forexample,inaninformationretrievalsetup,wewouldideallyliketoobservethemultilabelvectorofallrelevantdocumentsforaquery.However,inpractice,wemayonlyobserveasinglerelevantdocument(e.g.,the\ufb01rstdocumentclickedbyauserissuingaquery),whichisrandomlysampledaccordingtothemarginallabeldistribution(11).InthenotationofProposition5,suchasettingcorrespondstoobservingmulticlasssamplesfromP(x,z)directly,ratherthanmultilabelsamplesfromP(x,y).Anaturalthoughtistothenminimisethetop-krisk(2)directlyonsuchsamples,inhopesofoptimisingforprecision.Surprisingly,thisdoesnotcorrespondtooptimisingforprecisionorrecall,aswenowexplicate.Lemma11.PickanyP(x,y)overX\u00d7{0,1}L,inducingadistributionP(x,z)asin(12).Then,Rtop\u2212k(f)=P(z/\u2208Topk(f(x)))=Ex\uf8ee\uf8f0Xi/\u2208Topk(f(x))N(x)\u22121\u00b7P(yi=1|x)\uf8f9\uf8fb.Thisriskaboveissimilartotheprecision@k,exceptfortheN(x)\u22121term.Proposition5revealsacrucialmissingingredientthatexplainsthislackofequivalence:whenlearningfrom(x,z),oneneedstoweightthesamplesbyN(x),thenumberofrelevantlabels.ThisisachievedimplicitlybythePALreduction,sincethelossinvolvesatermforeachrelevantlabel.OnemayfurthercontrasttheabovetothePOLreduction:whilethisreductiondoesnotweighsamples,itsampleslabelsaccordingtothetransformeddistribution(10),ratherthanthemarginaldistribution(11).ThisdistinctioniscrucialinensuringthatthePOLmatchestherecall@k.4ConsistencyofOVAforprecision@kwasdoneinWydmuchetal.[2018].8\fFigure1:Precision@kandrecall@kforthePALandPAL-Nreductions.Aspredictedbythetheory,theformeryieldssuperiorprecision@kperformance,whilethelatteryieldssuperiorrecall@kperformance.TheOVAandOVA-Nreductionsshowthesametrend,andareomittedforclarity.Themaximalnumberoflabelsforanyinstanceisk=5,whichisthepointatwhichbothmethodshaveoverlappingcurves.Insum,whenweonlyobserveasinglerelevantlabelforaninstance,careisneededtounderstandwhatdistributionthislabelissampledfrom,andwhetheradditionalinstanceweightingisnecessary.6ExperimentalvalidationWenowpresentempiricalresultsvalidatingtheprecedingtheory.Speci\ufb01cally,weillustratethatusingthepick-all-labels(PAL)reductionversusitsnormalisedcounterpartcanhavesigni\ufb01cantdifferencesintermsofprecisionandrecallperformance.Sincethereductionsstudiedhereareallextantintheliterature,ouraimisnottoproposeoneofthemasbeing\u201cbest\u201dformultilabelclassi\ufb01cation;rather,wewishtoverifythattheysimplyoptimisefordifferentquantities.Toremovepotentialconfoundingfactors,weconstructasyntheticdatasetwherewehavecompletecontroloverthedatadistribution.Ourconstructionisinspiredbyaninformationretrievalscenario,whereintherearetwodistinctgroupsofqueries(e.g.,issuedbydifferentuserbases),whichhavealargelydisjointsetofrelevantdocuments.Withinagroup,queriesmaybeeithergenericorspeci\ufb01c,i.e.,havemanyorfewmatchingdocuments.Formally,wesetX=R2andL=10labels.Wedrawinstancesfromaequal-weightedmixtureoftwoGaussians,wheretheGaussiansarecenteredat(1,1)and(\u22121,\u22121)respectively.Forthe\ufb01rstmixturecomponent,wedrawlabelsaccordingtoP(y|x)whichisuniformovertwopossibley:eithery=(1K\u22121,0,0K),ory=(0K\u22121,1,0K),whereK.=L/2.Forthesecondmixturecomponent,they\u2019saresupportedonthe\u201cswapped\u201dlabelsy=(0K,0,1K\u22121),andy=(0K,1,0K\u22121).Withthissetup,wegenerateatrainingsampleof104(instance,label)pairs.WetraintheOVA,PAL,OVA-NandPAL-Nmethods,andcomputetheirprecisionandrecallonatestsampleof103(instance,label)pairs.Weusealinearmodelforourscorerf,andthesoftmaxcross-entropylossfor\u2018MC.Figure1showstheprecisionandrecall@kcurvesaskisvaried.Aspredictedbythetheory,thereisasigni\ufb01cantgapinperformancebetweenthetwomethodsoneachmetric;e.g.,thePALmethodperformssigni\ufb01cantlybetterthanitsnormalisedcounterpartintermsofprecision.Thisillustratestheimportanceofchoosingthecorrectreductionbasedontheultimateperformancemeasureofinterest.7ConclusionandfutureworkWehavestudied\ufb01vecommonlyusedmultilabelreductionsinauni\ufb01edframework,explicatingtheunderlyingmultilabellosseachofthemoptimises.Wethenshowedthateachreductionisprovablyconsistentwithrespecttoeitherprecisionorrecall,butnotboth.Further,weestablishedthattheBayes-optimalscorersfortheprecisionandrecallonlycoincideinspecialcases(e.g.,whenthelabelsareconditionallyindependent),andsonoreductioncanbeoptimalforboth.Weempiricallyvalidatedthatnormalisedlossfunctionscanyieldrecallgainsoverunnormalisedcounterparts.Consistencyanalysisforothermultilabelmetricsandgeneralisationanalysisarenaturaldirectionsforfuturework.9\fReferencesMart\u00ednAbadi,PaulBarham,JianminChen,ZhifengChen,AndyDavis,JeffreyDean,MatthieuDevin,SanjayGhemawat,GeoffreyIrving,MichaelIsard,ManjunathKudlur,JoshLevenberg,RajatMonga,SherryMoore,DerekG.Murray,BenoitSteiner,PaulTucker,VijayVasudevan,PeteWarden,MartinWicke,YuanYu,andXiaoqiangZheng.Tensor\ufb02ow:Asystemforlarge-scalemachinelearning.InProceedingsofthe12thUSENIXConferenceonOperatingSystemsDesignandImplementation,OSDI\u201916,pages265\u2013283,Berkeley,CA,USA,2016.USENIXAssociation.ShivaniAgarwal.Surrogateregretboundsforbipartiterankingviastronglyproperlosses.JournalofMachineLearningResearch,15:1653\u20131674,2014.RahulAgrawal,ArchitGupta,YashotejaPrabhu,andManikVarma.Multi-labellearningwithmillionsoflabels:Recommendingadvertiserbidphrasesforwebpages.InProceedingsofthe22NdInternationalConferenceonWorldWideWeb,WWW\u201913,pages13\u201324,NewYork,NY,USA,2013.ACM.ISBN978-1-4503-2035-1.Bernardo\u00c1vilaPiresandCsabaSzepesv\u00e1ri.MulticlassClassi\ufb01cationCalibrationFunctions.arXive-prints,art.arXiv:1609.06385,Sep2016.RohitBabbarandBernhardSch\u00f6lkopf.Dismec:Distributedsparsemachinesforextrememulti-labelclassi\ufb01ca-tion.InProceedingsoftheTenthACMInternationalConferenceonWebSearchandDataMining,WSDM\u201917,pages721\u2013729,NewYork,NY,USA,2017.ACM.ISBN978-1-4503-4675-7.KushBhatia,HimanshuJain,PurushottamKar,ManikVarma,andPrateekJain.Sparselocalembeddingsforextrememulti-labelclassi\ufb01cation.InAdvancesinNeuralInformationProcessingSystems28,pages730\u2013738.CurranAssociates,Inc.,2015.MatthewR.Boutell,JieboLuo,XipengShen,andChristopherM.Brown.Learningmulti-labelsceneclassi\ufb01ca-tion.PatternRecognition,37(9):1757\u20131771,2004.KlausBrinker,JohannesF\u00fcrnkranz,andEykeH\u00fcllermeier.Auni\ufb01edmodelformultilabelclassi\ufb01cationandranking.InProceedingsofthe2006ConferenceonECAI2006,pages489\u2013493,Amsterdam,TheNetherlands,TheNetherlands,2006.IOSPress.ISBN1-58603-642-4.A.Buja,W.Stuetzle,andY.Shen.Lossfunctionsforbinaryclassprobabilityestimationandclassi\ufb01cation:Structureandapplications.Technicalreport,UPenn,2005.KrzysztofDembczy\u00b4nski,WeiweiCheng,andEykeH\u00fcllermeier.Bayesoptimalmultilabelclassi\ufb01cationviaprobabilisticclassi\ufb01erchains.InProceedingsofthe27thInternationalConferenceonInternationalConferenceonMachineLearning,ICML\u201910,pages279\u2013286,USA,2010.Omnipress.ISBN978-1-60558-907-7.KrzysztofDembczy\u00b4nski,WillemWaegeman,WeiweiCheng,andEykeH\u00fcllermeier.Onlabeldependenceandlossminimizationinmulti-labelclassi\ufb01cation.MachineLearning,88(1-2):5\u201345,July2012.ISSN0885-6125.HimanshuJain,YashotejaPrabhu,andManikVarma.Extrememulti-labellossfunctionsforrecommendation,tagging,ranking&othermissinglabelapplications.InProceedingsofthe22NdACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,KDD\u201916,pages935\u2013944,NewYork,NY,USA,2016.ACM.ISBN978-1-4503-4232-2.HimanshuJain,VenkateshBalasubramanian,BhanuChunduri,andManikVarma.Slice:Scalablelinearextremeclassi\ufb01erstrainedon100millionlabelsforrelatedsearches.InProceedingsoftheTwelfthACMInternationalConferenceonWebSearchandDataMining,WSDM\u201919,pages528\u2013536,NewYork,NY,USA,2019.ACM.ISBN978-1-4503-5940-5.YacineJernite,AnnaChoromanska,andDavidSontag.Simultaneouslearningoftreesandrepresentationsforextremeclassi\ufb01cationanddensityestimation.InProceedingsofthe34thInternationalConferenceonMachineLearning,volume70ofProceedingsofMachineLearningResearch,pages1665\u20131674,InternationalConventionCentre,Sydney,Australia,06\u201311Aug2017.PMLR.ArmandJoulin,EdouardGrave,PiotrBojanowski,andTomasMikolov.Bagoftricksforef\ufb01cienttextclassi\ufb01ca-tion.InProceedingsofthe15thConferenceoftheEuropeanChapteroftheAssociationforComputationalLinguistics:Volume2,ShortPapers,pages427\u2013431,Valencia,Spain,April2017.AssociationforComputa-tionalLinguistics.PurushottamKar,HarikrishnaNarasimhan,andPrateekJain.Onlineandstochasticgradientmethodsfornon-decomposablelossfunctions.InProceedingsofthe27thInternationalConferenceonNeuralInformationProcessingSystems-Volume1,NIPS\u201914,pages694\u2013702,Cambridge,MA,USA,2014.MITPress.10\fPurushottamKar,HarikrishnaNarasimhan,andPrateekJain.Surrogatefunctionsformaximizingprecisionatthetop.InProceedingsofthe32ndInternationalConferenceonInternationalConferenceonMachineLearning,ICML\u201915,pages189\u2013198.JMLR.org,2015.OluwasanmiKoyejo,PradeepRavikumar,NagarajanNatarajan,andInderjitS.Dhillon.Consistentmultilabelclassi\ufb01cation.InProceedingsofthe28thInternationalConferenceonNeuralInformationProcessingSystems-Volume2,NIPS\u201915,pages3321\u20133329,Cambridge,MA,USA,2015.MITPress.M.Lapin,M.Hein,andB.Schiele.Analysisandoptimizationoflossfunctionsformulticlass,top-k,andmultilabelclassi\ufb01cation.IEEETransactionsonPatternAnalysisandMachineIntelligence,40(7):1533\u20131554,July2018.MaksimLapin,MatthiasHein,andBerntSchiele.Top-kmulticlasssvm.InProceedingsofthe28thInternationalConferenceonNeuralInformationProcessingSystems-Volume1,NIPS\u201915,pages325\u2013333,Cambridge,MA,USA,2015.MITPress.Li-PingLiu,ThomasG.Dietterich,NanLi,andZhi-HuaZhou.Transductiveoptimizationoftopkprecision.InProceedingsoftheTwenty-FifthInternationalJointConferenceonArti\ufb01cialIntelligence,IJCAI\u201916,pages1781\u20131787.AAAIPress,2016.ISBN978-1-57735-770-4.ChristopherD.Manning,PrabhakarRaghavan,andHinrichSch\u00fctze.IntroductiontoInformationRetrieval.CambridgeUniversityPress,NewYork,NY,USA,2008.YashotejaPrabhu,AnilKag,ShrutendraHarsola,RahulAgrawal,andManikVarma.Parabel:Partitionedlabeltreesforextremeclassi\ufb01cationwithapplicationtodynamicsearchadvertising.InProceedingsofthe2018WorldWideWebConference,WWW\u201918,pages993\u20131002,RepublicandCantonofGeneva,Switzerland,2018.InternationalWorldWideWebConferencesSteeringCommittee.ISBN978-1-4503-5639-8.HarishG.Ramaswamy,BalajiSrinivasanBabu,ShivaniAgarwal,andRobertC.Williamson.Ontheconsistencyofoutputcodebasedlearningalgorithmsformulticlasslearningproblems.InProceedingsofThe27thConferenceonLearningTheory,volume35ofProceedingsofMachineLearningResearch,pages885\u2013902,Barcelona,Spain,13\u201315Jun2014.PMLR.SashankJ.Reddi,SatyenKale,FelixYu,DanielHoltmann-Rice,JiecaoChen,andSanjivKumar.Stochasticnegativeminingforlearningwithlargeoutputspaces.InProceedingsofMachineLearningResearch,volume89ofProceedingsofMachineLearningResearch,pages1940\u20131949.PMLR,16\u201318Apr2019.MarkD.ReidandRobertC.Williamson.Compositebinarylosses.JournalofMachineLearningResearch,11:2387\u20132422,December2010.RyanRifkinandAldebaroKlautau.Indefenseofone-vs-allclassi\ufb01cation.J.Mach.Learn.Res.,5:101\u2013141,December2004.ISSN1532-4435.LeonardJ.Savage.Elicitationofpersonalprobabilitiesandexpectations.JournaloftheAmericanStatisticalAssociation,66(336):783\u2013801,1971.DirkTasche.Aplug-inapproachtomaximisingprecisionatthetopandrecallatthetop.CoRR,abs/1804.03077,2018.URLhttp://arxiv.org/abs/1804.03077.AmbujTewariandPeterL.Bartlett.Ontheconsistencyofmulticlassclassi\ufb01cationmethods.J.Mach.Learn.Res.,8:1007\u20131025,December2007.ISSN1532-4435.GrigoriosTsoumakasandIoannisKatakis.Multi-labelclassi\ufb01cation:Anoverview.IntJDataWarehousingandMining,2007:1\u201313,2007.GrigoriosTsoumakasandIoannisVlahavas.Randomk-labelsets:Anensemblemethodformultilabelclas-si\ufb01cation.InMachineLearning:ECML2007,pages406\u2013417,Berlin,Heidelberg,2007.SpringerBerlinHeidelberg.Xi-ZhuWuandZhi-HuaZhou.Auni\ufb01edviewofmulti-labelperformancemeasures.InProceedingsofthe34thInternationalConferenceonMachineLearning,volume70ofProceedingsofMachineLearningResearch,pages3780\u20133788,InternationalConventionCentre,Sydney,Australia,06\u201311Aug2017.PMLR.MarekWydmuch,KalinaJasinska,MikhailKuznetsov,R\u00f3bertBusa-Fekete,andKrzysztofDembczynski.Ano-regretgeneralizationofhierarchicalsoftmaxtoextrememulti-labelclassi\ufb01cation.InAdvancesinNeuralInformationProcessingSystems31,pages6355\u20136366.CurranAssociates,Inc.,2018.J.Xiao,J.Hays,K.A.Ehinger,A.Oliva,andA.Torralba.Sundatabase:Large-scalescenerecognitionfromabbeytozoo.In2010IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition,pages3485\u20133492,June2010.11\fForestYangandSanmiKoyejo.Ontheconsistencyoftop-ksurrogatelosses.CoRR,abs/1901.11141,2019.URLhttp://arxiv.org/abs/1901.11141.IanE.H.Yen,XiangruHuang,WeiDai,PradeepRavikumar,InderjitDhillon,andEricXing.Ppdsparse:Aparallelprimal-dualsparsemethodforextremeclassi\ufb01cation.InProceedingsofthe23rdACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,KDD\u201917,pages545\u2013553,NewYork,NY,USA,2017.ACM.ISBN978-1-4503-4887-4.Hsiang-FuYu,PrateekJain,PurushottamKar,andInderjitS.Dhillon.Large-scalemulti-labellearningwithmissinglabels.InProceedingsofthe31stInternationalConferenceonInternationalConferenceonMachineLearning-Volume32,ICML\u201914,pagesI\u2013593\u2013I\u2013601.JMLR.org,2014.M.ZhangandZ.Zhou.Areviewonmulti-labellearningalgorithms.IEEETransactionsonKnowledgeandDataEngineering,26(8):1819\u20131837,Aug2014.TongZhang.Statisticalbehaviorandconsistencyofclassi\ufb01cationmethodsbasedonconvexriskminimization.Ann.Statist.,32(1):56\u201385,022004.12\f", "award": [], "sourceid": 5615, "authors": [{"given_name": "Aditya", "family_name": "Menon", "institution": "Google"}, {"given_name": "Ankit Singh", "family_name": "Rawat", "institution": "Google"}, {"given_name": "Sashank", "family_name": "Reddi", "institution": "Google"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google"}]}