{"title": "On Multilabel Classification and Ranking with Partial Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 1151, "page_last": 1159, "abstract": "We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show $O(T^{1/2}\\log T)$ regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance.", "full_text": "On Multilabel Classi\ufb01cation and Ranking with\n\nPartial Feedback\n\nClaudio Gentile\n\nDiSTA, Universit`a dell\u2019Insubria, Italy\n\nclaudio.gentile@uninsubria.it\n\nFrancesco Orabona\nTTI Chicago, USA\n\nfrancesco@orabona.com\n\nAbstract\n\nWe present a novel multilabel/ranking algorithm working in partial information\nsettings. The algorithm is based on 2nd-order descent methods, and relies on\nupper-con\ufb01dence bounds to trade-off exploration and exploitation. We analyze\nthis algorithm in a partial adversarial setting, where covariates can be adversarial,\nbut multilabel probabilities are ruled by (generalized) linear models. We show\nO(T 1/2 log T ) regret bounds, which improve in several ways on the existing re-\nsults. We test the effectiveness of our upper-con\ufb01dence scheme by contrasting\nagainst full-information baselines on real-world multilabel datasets, often obtain-\ning comparable performance.\n\nIntroduction\n\n1\nConsider a book recommendation system. Given a customer\u2019s pro\ufb01le, the system recommends a few\npossible books to the user by means of, e.g., a limited number of banners placed at different positions\non a webpage. The system\u2019s goal is to select books that the user likes and possibly purchases.\nTypical feedback in such systems is the actual action of the user or, in particular, what books he has\nbought/preferred, if any. The system cannot observe what would have been the user\u2019s actions had\nother books got recommended, or had the same book ads been placed in a different order within the\nwebpage. Such problems are collectively referred to as learning with partial feedback. As opposed\nto the full information case, where the system (the learning algorithm) knows the outcome of each\npossible response (e.g., the user\u2019s action for each and every possible book recommendation placed\nin the largest banner ad), in the partial feedback setting, the system only observes the response to\nvery limited options and, speci\ufb01cally, the option that was actually recommended. In this and many\nother examples of this sort, it is reasonable to assume that recommended options are not given the\nsame treatment by the system, e.g., large banners which are displayed on top of the page should\nsomehow be more committing as a recommendation than smaller ones placed elsewhere. Moreover,\nit is often plausible to interpret the user feedback as a preference (if any) restricted to the displayed\nalternatives.\nWe consider instantiations of this problem in the multilabel and learning-to-rank settings. Learning\nproceeds in rounds, in each time step t the algorithm receives an instance xt and outputs an ordered\nsubset \u02c6Yt of labels from a \ufb01nite set of possible labels [K] = {1, 2, . . . , K}. Restrictions might apply\nto the size of \u02c6Yt (due, e.g., to the number of available slots in the webpage). The set \u02c6Yt corresponds\nto the aforementioned recommendations, and is intended to approximate the true set of preferences\nassociated with xt. However, the latter set is never observed. In its stead, the algorithm receives\nYt \u2229 \u02c6Yt, where Yt \u2286 [K] is a noisy version of the true set of user preferences on xt. When we are\nrestricted to | \u02c6Yt| = 1 for all t, this becomes a multiclass classi\ufb01cation problem with bandit feedback\n\u2013 see below.\nRelated work. This paper lies at the intersection between online learning with partial feedback and\nmultilabel classi\ufb01cation/ranking. Both \ufb01elds include a substantial amount of work, so we can hardly\ndo it justice here. We outline some of the main contributions in the two \ufb01elds, with an emphasis on\nthose we believe are the most related to this paper.\n\n1\n\n\fA well-known and standard tool of facing the problem of partial feedback in online learning is to\ntrade off exploration and exploitation through upper con\ufb01dence bounds [16]. In the so-called bandit\nsetting with contextual information (sometimes called bandits with side information or bandits with\ncovariates, e.g., [3, 4, 5, 7, 15], and references therein) an online algorithm receives at each time\nstep a context (typically, in the form of a feature vector x) and is compelled to select an action\n(e.g., a label), whose goodness is quanti\ufb01ed by a prede\ufb01ned loss function. Full information about\nthe loss function is not available. The speci\ufb01cs of the interaction model determines which pieces\nof loss will be observed by the algorithm, e.g., the actual value of the loss on the chosen action,\nsome information on more pro\ufb01table directions on the action space, noisy versions thereof, etc. The\noverall goal is to compete against classes of functions that map contexts to (expected) losses in a\nregret sense, that is, to obtain sublinear cumulative regret bounds. For instance, [1, 3, 5, 7] work in\na \ufb01nite action space where the mappings context-to-loss for each action are linear (or generalized\nlinear, as in [7]) functions of the features. They all obtain T 1/2-like regret bounds, where T is\nthe time horizon. This is extended in [15], where the loss function is modeled as a sample from\na Gaussian process over the joint context-action space. We are using a similar (generalized) linear\nmodeling here. Linear multiclass classi\ufb01cation problems with bandit feedback are considered in,\ne.g., [4, 11, 13], where either T 2/3 or T 1/2 or even logarithmic regret bounds are proven, depending\non the noise model and the underlying loss functions.\nAll the above papers do not consider structured action spaces, where the learner is afforded to select\nsets of actions, which is more suitable to multilabel and ranking problems. Along these lines are\nthe papers [10, 14, 19, 20, 22]. The general problem of online minimization of a submodular loss\nfunction under both full and bandit information without covariates is considered in [10], achieving a\nregret T 2/3 in the bandit case. In [22] the problem of online learning of assignments is considered,\nwhere an algorithm is requested to assign positions (e.g., rankings) to sets of items (e.g., ads) with\ngiven constraints on the set of items that can be placed in each position. Their problem shares\nsimilar motivations as ours but, again, the bandit version of their algorithm does not explicitly take\nside information into account, and leads to a T 2/3 regret bound. In [14] the aim is to learn a suitable\nordering of the available actions. Among other things, the authors prove a T 1/2 regret bound in\nthe bandit setting with a multiplicative weight updating scheme. Yet, no contextual information is\nincorporated. In [20] the ability of selecting sets of actions is motivated by a problem of diverse\nretrieval in large document collections which are meant to live in a general metric space. The\ngenerality of this approach does not lead to strong regret guarantees for speci\ufb01c (e.g., smooth) loss\nfunctions. [19] uses a simple linear model for the hidden utility function of users interacting with a\nweb system and providing partial feedback in any form that allows the system to make signi\ufb01cant\nprogress in learning this function. A regret bound of T 1/2 is again provided that depends on the\ndegree of informativeness of the feedback. It is experimentally argued that this feedback is typically\nmade available by a user that clicks on relevant URLs out of a list presented by a search engine.\nDespite the neatness of the argument, no formal effort is put into relating this information to the\ncontext information at hand or to the way data are generated. Finally, the recent paper [2] investigates\nclasses of graphical models for contextual bandit settings that afford richer interaction between\ncontexts and actions leading again to a T 2/3 regret bound.\nThe literature on multilabel learning and learning to rank is overwhelming. The wide attention this\nliterature attracts is often motivated by its web-search-engine or recommender-system applications,\nand many of the papers are experimental in nature. Relevant references include [6, 9, 23], along\nwith references therein. Moreover, when dealing with multilabel, the typical assumption is full\nsupervision, an important concern being modeling correlations among classes. In contrast to that,\nthe speci\ufb01c setting we are considering here need not face such a modeling. Other related references\nare [8, 12], where learning is by pairs of examples. Yet, these approaches need i.i.d. assumptions on\nthe data, and typically deliver batch learning procedures. To summarize, whereas we are technically\nclose to [1, 3, 4, 5, 7, 15], from a motivational standpoint we are perhaps closest to [14, 19, 22].\nOur results. We investigate the multilabel and learning-to-rank problems in a partial feedback\nscenario with contextual information, where we assume a probabilistic linear model over the labels,\nalthough the contexts can be chosen by an adaptive adversary. We consider two families of loss func-\ntions, one is a cost-sensitive multilabel loss that generalizes the standard Hamming loss in several\nrespects, the other is a kind of (unnormalized) ranking loss. In both cases, the learning algorithm is\nmaintaining a (generalized) linear predictor for the probability that a given label occurs, the ranking\nbeing produced by upper con\ufb01dence-corrected estimated probabilities. In such settings, we prove\n\n2\n\n\fT 1/2 log T cumulative regret bounds \u2014 these bounds are optimal, up to log factors, when the label\nprobabilities are fully linear in the contexts. A distinguishing feature of our user feedback model is\nthat, unlike previous papers (e.g., [1, 10, 15, 22]), we are not assuming the algorithm is observing a\nnoisy version of the risk function on the currently selected action. In fact, when a generalized linear\nmodel is adopted, the mapping context-to-risk turns out to be nonconvex in the parameter space.\nFurthermore, when operating on structured action spaces this more traditional form of bandit model\ndoes not seem appropriate to capture the typical user preference feedback. Our approach is based on\nhaving the loss decouple from the label generating model, the user feedback being a noisy version of\nthe gradient of a surrogate convex loss associated with the model itself. As a consequence, the algo-\nrithm is not directly dealing with the original loss when making exploration. Though the emphasis is\non theoretical results, we also validate our algorithms on two real-world multilabel datasets w.r.t. a\nnumber of loss functions, showing good comparative performance against simple multilabel/ranking\nbaselines that operate with full information.\n\n2 Model and preliminaries\nWe consider a setting where the algorithm receives at time t the side information vector xt \u2208 Rd,\nis allowed to output at a (possibly ordered) subset \u02c6Yt \u2286 [K] of the set of possible labels, then the\nsubset of labels Yt \u2286 [K] associated with xt is generated, and the algorithm gets as feedback \u02c6Yt\u2229Yt.\nThe loss suffered by the algorithm may take into account several things: the distance between Yt\nand \u02c6Yt (both viewed as sets), as well as the cost for playing \u02c6Yt. The cost c( \u02c6Yt) associated with\n\u02c6Yt might be given by the sum of costs suffered on each class i \u2208 \u02c6Yt, where we possibly take into\naccount the order in which i occurs within \u02c6Yt (viewed as an ordered list of labels). Speci\ufb01cally,\ngiven constant a \u2208 [0, 1] and costs c = {c(i, s), i = 1, . . . , s, s \u2208 [K]}, such that 1 \u2265 c(1, s) \u2265\nc(2, s) \u2265 . . . c(s, s) \u2265 0, for all s \u2208 [K], we consider the loss function\n\n(cid:96)a,c(Yt, \u02c6Yt) = a|Yt \\ \u02c6Yt| + (1 \u2212 a)(cid:80)\n\ni\u2208 \u02c6Yt\\Yt\n\nc(ji,| \u02c6Yt|),\n\ni\u2208 \u02c6Yt\\Yt\n\n(cid:80)\n\nwhere ji is the position of class i in \u02c6Yt, and c(ji,\u00b7) depends on \u02c6Yt only through its size | \u02c6Yt|. In the\nabove, the \ufb01rst term accounts for the false negative mistakes, hence there is no speci\ufb01c ordering of\nlabels therein. The second term collects the loss contribution provided by all false positive classes,\ntaking into account through the costs c(ji,| \u02c6Yt|) the order in which labels occur in \u02c6Yt. The constant\na serves as weighting the relative importance of false positive vs. false negative mistakes. As a\nspeci\ufb01c example, suppose that K = 10, the costs c(i, s) are given by c(i, s) = (s \u2212 i + 1)/s, i =\n1, . . . , s, the algorithm plays \u02c6Yt = (4, 3, 6), but Yt is {1, 3, 8}. In this case, |Yt \\ \u02c6Yt| = 2, and\nc(ji,| \u02c6Yt|) = 3/3 + 1/3, i.e., the cost for mistakingly playing class 4 in the top slot of \u02c6Yt is\nmore damaging than mistakingly playing class 6 in the third slot. In the special case when all costs\nare unitary, there is no longer need to view \u02c6Yt as an ordered collection, and the above loss reduces to\na standard Hamming-like loss between sets Yt and \u02c6Yt, i.e., a|Yt \\ \u02c6Yt| + (1\u2212 a)| \u02c6Yt \\ Yt|. Notice that\nthe partial feedback \u02c6Yt \u2229 Yt allows the algorithm to know which of the chosen classes in \u02c6Yt are good\nor bad (and to what extent, because of the selected ordering within \u02c6Yt). Yet, the algorithm does not\nobserve the value of (cid:96)a,c(Yt, \u02c6Yt) because Yt \\ \u02c6Yt remains hidden.\nWorking with the above loss function makes the algorithm\u2019s output \u02c6Yt become a ranked list of\nclasses, where ranking is restricted to the deemed relevant classes only. In our setting, only a rele-\nvance feedback among the selected classes is observed (the set Yt \u2229 \u02c6Yt), but no supervised ranking\ninformation (e.g., in the form of pairwise preferences) is provided to the algorithm within this set.\nAlternatively, we can think of a ranking framework where restrictions on the size of \u02c6Yt are set by an\nexogenous (and possibly time-varying) parameter of the problem, and the algorithm is required to\nprovide a ranking complying with these restrictions. More on the connection to the ranking setting\nwith partial feedback is in Section 4.\nThe problem arises as to which noise model we should adopt so as to encompass signi\ufb01-\ncant real-world settings while at the same time affording ef\ufb01cient implementation of the result-\ning algorithms. For any subset Yt \u2286 [K], we let (y1,t, . . . , yK,t) \u2208 {0, 1}K be the cor-\ni=1 yi,t + (1 \u2212\n. Moreover, because the \ufb01rst sum does not de-\n\nresponding indicator vector. Then it is easy to see that (cid:96)a,c(Yt, \u02c6Yt) = a(cid:80)K\na)(cid:80)\n\n(cid:16)\n\nc(ji,| \u02c6Yt|) \u2212(cid:16) a\n\n(cid:17)\n1\u2212a + c(ji,| \u02c6Yt|)\n\nyi,t\n\ni\u2208 \u02c6Yt\n\n(cid:17)\n\n3\n\n\fpend on \u02c6Yt, for the sake of optimizing over \u02c6Yt we can equivalently de\ufb01ne\n1\u2212a + c(ji,| \u02c6Yt|)\n\n(cid:96)a,c(Yt, \u02c6Yt) = (1 \u2212 a)\n\n(cid:16)\n\nc(ji,| \u02c6Yt|) \u2212(cid:16) a\n\n(cid:17)\n\n(cid:17)\n\n.\n\nyi,t\n\n(1)\n\nLet Pt(\u00b7) be a shorthand for the conditional probability Pt(\u00b7| xt), where the side information vec-\ntor xt can in principle be generated by an adaptive adversary as a function of the past. Then\nPt(y1,t, . . . , yK,t) = P(y1,t, . . . , yK,t | xt), where the marginals Pt(yi,t = 1) satisfy1\n\n(cid:88)\n\ni\u2208 \u02c6Yt\n\nPt(yi,t = 1) =\n\ng(\u2212u(cid:62)\ni xt) + g(\u2212u(cid:62)\n\ni xt)\n\n(cid:17)\n\n(cid:17)\n\npi,t\n\n,\n\n(cid:88)\n\ni\u2208 \u02c6Yt\n\n(cid:16)\n\nc(ji,| \u02c6Yt|) \u2212(cid:16) a\n\n,\n\ni xt)\n\ng(u(cid:62)\n\ni = 1, . . . , K,\n\n(2)\nfor some K vectors u1, . . . , uK \u2208 Rd and some (known) function g : D \u2286 R \u2192 R+. The\ni x \u2208 D for all i and all x \u2208 Rd chosen by the adversary. We assume\nmodel is well de\ufb01ned if u(cid:62)\nfor the sake of simplicity that ||xt|| = 1 for all t. Notice that at this point the variables yi,t need\nnot be conditionally independent. We are only de\ufb01nining a family of allowed joint distributions\nPt(y1,t, . . . , yK,t) through the properties of their marginals Pt(yi,t).\nThe function g above will be instantiated to the negative derivative of a suitable convex and nonin-\ncreasing loss function L which our algorithm will be based upon. For instance, if L is the square\nloss L(\u2206) = (1\u2212 \u2206)2/2, then g(\u2206) = 1\u2212 \u2206, resulting in Pt(yi,t = 1) = (1 + u(cid:62)\ni xt)/2, under the\nassumption D = [\u22121, 1]. If L is the logistic loss L(\u2206) = ln(1 + e\u2212\u2206), then g(\u2206) = (e\u2206 + 1)\u22121,\nand Pt(yi,t = 1) = eu(cid:62)\nSet for brevity \u2206i,t = u(cid:62)\nexpected loss of the algorithm playing \u02c6Yt as\n\ni xt/(eu(cid:62)\ni xt. Taking into account (1), this model allows us to write the (conditional)\n\ni xt + 1), with domain D = R.\n\ng(\u2212\u2206i,t)\n\n1\u2212a + c(ji,| \u02c6Yt|)\n\nt = argminY =(j1,j2,...,j|Y |)\u2286[K]\n\nEt[(cid:96)a,c(Yt, \u02c6Yt)] = (1 \u2212 a)\n(3)\ng(\u2206i,t)+g(\u2212\u2206i,t), and the expectation Et above is w.r.t. the generation of labels Yt,\nwhere pi,t =\nconditioned on both xt, and all previous x and Y . A key aspect of this formalization is that the\nBayes optimal ordered subset Y \u2217\nEt[(cid:96)a,c(Yt, Y )] can be computed\nef\ufb01ciently when knowing \u22061,t, . . . , \u2206K,t. This is handled by the next lemma. In words, this lemma\nsays that, in order to minimize (3), it suf\ufb01ces to try out all possible sizes s = 0, 1, . . . , K for Y \u2217\nand, for each such value, determine the sequence Y \u2217\nt\ns,t that minimizes (3) over all sequences of size\ns,t can be computed just by sorting classes i \u2208 [K] in decreasing order of pi,t, sequence\ns. In turn, Y \u2217\nY \u2217\ns,t being given by the \ufb01rst s classes in this sorted list.2\nLemma 1. With the notation introduced so far, let pi1,t \u2265 pi2,t \u2265 . . . piK ,t be the sequence of pi,t\nsorted in nonincreasing order. Then we have that Y \u2217\ns,t)], where\nY \u2217\ns,t = (i1, i2, . . . , is), and Y \u2217\nNotice the way costs c(i, s) in\ufb02uence the Bayes optimal computation. We see from (3) that placing\nclass i within \u02c6Yt in position ji is bene\ufb01cial (i.e., it leads to a reduction of loss) if and only if pi,t >\nc(ji,| \u02c6Yt|)/( a\n1\u2212a + c(ji,| \u02c6Yt|)). Hence, the higher is the slot ij in \u02c6Yt the larger should be pi,t in order\nfor this inclusion to be convenient.3 It is Y \u2217\nthat we interpret as the true set of user preferences on\nxt.\nWe would like to compete against the above Y \u2217\n\nin a cumulative regret sense, i.e., we would like to\nt )] with high probability. We use a similar but\nlargely more general analysis than [4]\u2019s, to devise an online second-order descent algorithm whose\nupdating rule makes the comparison vector U = (u1, . . . , uK) \u2208 RdK de\ufb01ned through (2) be Bayes\noptimal w.r.t. a surrogate convex loss L(\u00b7) such that g(\u2206) = \u2212L(cid:48)(\u2206). Observe that the expected\nloss function (3) is, generally speaking, nonconvex in the margins \u2206i,t (consider, for instance the\nlogistic case g(\u2206) = 1\n\nbound RT =(cid:80)T\n\ne\u2206+1). Thus, we cannot directly minimize this expected loss.\n\nEt[(cid:96)a,c(Yt, \u02c6Yt)] \u2212 Et[(cid:96)a,c(Yt, Y \u2217\n\nt = argmins=0,1,...K\n\nEt[(cid:96)a,c(Yt, Y \u2217\n\n0,t = \u2205.\n\nt\n\nt\n\nt=1\n\n1The reader familiar with generalized linear models will recognize the derivative of the function p(\u2206) =\ng(\u2212\u2206)\ng(\u2206)+g(\u2212\u2206) as the (inverse) link function of the associated canonical exponential family of distributions [17].\n\n2Due to space limitations, all proofs are given in the supplementary material.\n3Notice that this depends on the actual size of \u02c6Yt, so we cannot decompose this problem into K independent\nproblems. The decomposition does occur if the costs c(i, s) are constants, independent of i and s, and the\ncriterion for inclusion becomes pi,t \u2265 \u03b8, for some constant threshold \u03b8.\n\n4\n\n\f1. Get instance xt \u2208 Rd : ||xt|| = 1;\n\ni,t = x(cid:62)\n\nt w(cid:48)\n\ni,t, where\n\nw\n\n(cid:48)\ni,t =\n\n2. For i \u2208 [K], set (cid:98)\u2206(cid:48)\n\uf8f1\uf8f2\uf8f3wi,t\nwhere : (cid:98)pi,t =\n\n3. Output\n\ni,t = x(cid:62)\n\u00012\nt A\n\n\u22121\ni,t\u22121xt\n\n\u02c6Yt = argminY =(j1,j2,...j|Y |)\u2286[K]\n\nParameters: loss parameters a \u2208 [0, 1], cost values c(i, s), interval D = [\u2212R, R], function g :\nD \u2192 R, con\ufb01dence level \u03b4 \u2208 [0, 1].\nInitialization: Ai,0 = I \u2208 Rd\u00d7d, i = 1, . . . , K, wi,1 = 0 \u2208 Rd, i = 1, . . . , K;\nFor t = 1, 2 . . . , T :\n\nA\n\n\u22121\ni,t\u22121xt\n\n(cid:19)\nc(ji,|Y |) \u2212(cid:16) a\n(cid:16) c(cid:48)\n(cid:1) + 12\n\nL\n\n,\n\nc(cid:48)(cid:48)\n\nL\n\nc(cid:48)(cid:48)\n\nL\n\ni,txt \u2208 [\u2212R, R],\n\nif w(cid:62)\notherwise;\n\n1\u2212a + c(ji,|Y |)\n(cid:17)\n\n+ 3L(\u2212R)\n\n,\n\n(cid:17) (cid:98)pi,t\n\n(cid:17)(cid:17)\n(cid:17)\ni,t \u2207i,t, where\n\u22121\nA\n\nln K(t+4)\n\n;\n\n\u03b4\n\n(cid:18) w(cid:62)\n\ni,txt)\n\nwi,t \u2212\n\ni,txt\u2212R sign(w(cid:62)\n\u22121\ni,t\u22121xt\n\nx(cid:62)\nt A\n\ni\u2208Y\n\ni,t+\u0001i,t]D )\n\n(cid:16)\n(cid:16)(cid:80)\ng(\u2212[(cid:98)\u2206(cid:48)\n(cid:16)\nL)2 ln(cid:0)1 + t\u22121\ni,t+\u0001i,t]D )+g(\u2212[(cid:98)\u2206(cid:48)\ng([(cid:98)\u2206(cid:48)\nU 2 + d c(cid:48)\n(c(cid:48)(cid:48)\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f31\n\nsi,t =\n\nL\n\nd\n\ni,t+\u0001i,t]D )\n\n4. Get feedback Yt \u2229 \u02c6Yt;\n5. For i \u2208 [K], update Ai,t = Ai,t\u22121 + |si,t|xtx(cid:62)\nIf i \u2208 Yt \u2229 \u02c6Yt\n\nt , wi,t+1 = w(cid:48)\n\ni,t \u2212 1\nc(cid:48)(cid:48)\n\nL\n\nand \u2207i,t = \u2207wL(si,t w(cid:62)xt)|w=w(cid:48)\n\ni,t\n\n\u22121 If i \u2208 \u02c6Yt \\ Yt = \u02c6Yt \\ (Yt \u2229 \u02c6Yt)\n0\n\notherwise;\n\n= \u2212g(si,t(cid:98)\u2206(cid:48)\n\ni,t) si,t xt.\n\nFigure 1: The partial feedback algorithm in the (ordered) multiple label setting.\n\n3 Algorithm and regret bounds\n\nt w(cid:48)\n\ni,t+\u0001i,t]D)\n\ni,t+\u0001i,t]D)\n\ng(\u2212\u2206i,t)\n\ng([(cid:98)\u2206(cid:48)\n\ni,t = x(cid:62)\n\n1,t, . . . , w(cid:48)\n\nK,t, being w(cid:48)\n\ni,t, and \u2206i,t = u(cid:62)\n\ni xt, i \u2208 [K]. The algorithm uses (cid:98)\u2206(cid:48)\n\nwe let (cid:98)\u2206(cid:48)\nderlying \u2206i,t according to the (upper con\ufb01dence) approximation scheme \u2206i,t \u2248 [(cid:98)\u2206(cid:48)\n\nIn Figure 1 is our bandit algorithm for (ordered) multiple labels. The algorithm is based on replacing\nthe unknown model vectors u1, . . . , uK with prototype vectors w(cid:48)\ni,t the time-\nt approximation to ui, satisying similar constraints we set for the ui vectors. For the sake of brevity,\ni,t as proxies for the un-\ni,t + \u0001i,t]D,\nwhere \u0001i,t \u2265 0 is a suitable upper-con\ufb01dence level for class i at time t, and [\u00b7]D denotes the\nclipping-to-D operation, i.e., [x]D = max(min(x, R),\u2212R). The algorithm\u2019s prediction at time\nt has the same form as the computation of the Bayes optimal sequence Y \u2217\nt , where we replace\nthe true (and unknown) pi,t =\ng(\u2206i,t)+g(\u2212\u2206i,t) with the corresponding upper con\ufb01dence proxy\n. Computing \u02c6Yt can be done by mimicking the computation of\n\ni,t+\u0001i,t]D)+g(\u2212[(cid:98)\u2206(cid:48)\n\ng(\u2212[(cid:98)\u2206(cid:48)\nthe Bayes optimal Y \u2217\nThus the algorithm is producing a ranked list of relevant classes based on upper-con\ufb01dence-corrected\n\n(cid:98)pi,t =\nt (just replace pi,t by(cid:98)pi,t), i.e., order of K log K running time per prediction.\nscores(cid:98)pi,t. Class i is deemed relevant and ranked high among the relevant ones when either (cid:98)\u2206(cid:48)\n\ni,t\nis a good approximation to \u2206i,t and pi,t is large, or when the algorithm is not very con\ufb01dent on its\nown approximation about i (that is, the upper con\ufb01dence level \u0001i,t is large).\nThe algorithm receives in input the loss parameters a and c(i, s), the model function g(\u00b7) and the\nassociated margin domain D = [\u2212R, R], and maintains both K positive de\ufb01nite matrices Ai,t of\ndimension d (initially set to the d \u00d7 d identity matrix), and K weight vector wi,t \u2208 Rd (initially\nset to the zero vector). At each time step t, upon receiving the d-dimensional instance vector xt\nthe algorithm uses the weight vectors wi,t to compute the prediction vectors w(cid:48)\ni,t. These vectors\ncan easily be seen as the result of projecting wi,t onto the space of w where |w(cid:62)xt| \u2264 R w.r.t.\ni,t = argminw\u2208Rd : w(cid:62)xt\u2208D di,t\u22121(w, wi,t), i \u2208 [K], where\nthe distance function di,t\u22121, i.e., w(cid:48)\n(cid:98)\u2206(cid:48)\ndi,t(u, w) = (u \u2212 w)(cid:62) Ai,t (u \u2212 w) . Vectors w(cid:48)\ni,t are then used to produce prediction values\ni,t involved in the upper-con\ufb01dence calculation of \u02c6Yt \u2286 [K]. Next, the feedback Yt \u2229 \u02c6Yt is\nobserved, and the algorithm in Figure 1 promotes all classes i \u2208 Yt \u2229 \u02c6Yt (sign si,t = 1), demotes\n\n5\n\n\fall classes i \u2208 \u02c6Yt \\ Yt (sign si,t = \u22121), and leaves all remaining classes i /\u2208 \u02c6Yt unchanged (sign\ni,t \u2192 wi,t+1 is based on the gradients \u2207i,t of a loss function L(\u00b7) satisfying\nsi,t = 0). The update w(cid:48)\nL(cid:48)(\u2206) = \u2212g(\u2206). On the other hand, the update Ai,t\u22121 \u2192 Ai,t uses the rank one matrix4 xtx(cid:62)\nt .\nIn both the update of w(cid:48)\ni,t and the one involving Ai,t\u22121, the reader should observe the role played\nby the signs si,t. Finally, the constants c(cid:48)\nL and c(cid:48)(cid:48)\ni,t are related to\nsmoothness properties of L(\u00b7) \u2013 see next theorem.\nTheorem 2. Let L : D = [\u2212R, R] \u2286 R \u2192 R+ be a C 2(D) convex and nonincreasing function\nof its argument, (u1, . . . , uK) \u2208 RdK be de\ufb01ned in (2) with g(\u2206) = \u2212L(cid:48)(\u2206) for all \u2206 \u2208 D, and\nsuch that (cid:107)ui(cid:107) \u2264 U for all i \u2208 [K]. Assume there are positive constants cL, c(cid:48)\nL such that:\ni. L(cid:48)(\u2206) L(cid:48)(cid:48)(\u2212\u2206)+L(cid:48)(cid:48)(\u2206) L(cid:48)(\u2212\u2206)\nL hold for all\n\u2206 \u2208 D. Then the cumulative regret RT of the algorithm in Figure 1 satis\ufb01es, with probability at\nleast 1 \u2212 \u03b4,\n\nL and c(cid:48)(cid:48)\nL, and iii. L(cid:48)(cid:48)(\u2206) \u2265 c(cid:48)(cid:48)\n\n\u2265 \u2212cL and ii. (L(cid:48)(\u2206))2 \u2264 c(cid:48)\n\nL occurring in the expression for \u00012\n\n(L(cid:48)(\u2206)+L(cid:48)(\u2212\u2206))2\n\n(cid:16)\nL)2 ln(cid:0)1 + T\n\nRT = O\n\nL\n\n(cid:113)\nT C d ln(cid:0)1 + T\n(cid:17)\n(cid:16) c(cid:48)\nL)2 + L(\u2212R)\nc(cid:48)(cid:48)\n(c(cid:48)(cid:48)\n\nln KT\n\u03b4\n\n(cid:17)\n\nL\n\nL\n\nd\n\n(cid:1)(cid:17)\n\n,\n\n(1 \u2212 a) cL K\n\n(cid:1) +\n\n(cid:16)\n\nU 2 + d c(cid:48)\n(c(cid:48)(cid:48)\n\n.\n\nd\n\nL = 4 and c(cid:48)(cid:48)\n\nwhere C = O\nIt is easy to see that when L(\u00b7) is the square loss L(\u2206) = (1 \u2212 \u2206)2/2 and D = [\u22121, 1], we\nL = 1; when L(\u00b7) is the logistic loss L(\u2206) = ln(1 + e\u2212\u2206) and\nhave cL = 1/2, c(cid:48)\nD = [\u2212R, R], we have cL = 1/4, c(cid:48)\nL =\nRemark 1. A drawback of Theorem 2 is that,\nthe upper con\ufb01-\ndence levels \u0001i,t, we assume prior knowledge of the norm upper bound U. Because this\ninformation is often unavailable, we present here a simple modi\ufb01cation to the algorithm\nthat copes with this limitation. We change the de\ufb01nition of \u00012\ni,t =\ni,t\n\n2(1+cosh(R)), where cosh(x) = ex+e\u2212x\nin order to properly set\n\nL \u2264 1 and c(cid:48)(cid:48)\n\nin Figure 1 to \u00012\n\n.\n\n2\n\n1\n\n(cid:17)\n\n(cid:41)\n\n(cid:17)\n\n(cid:40)\n\nmax\n\nx(cid:62)A\u22121\n\ni,t\u22121x\n\n+ 3L(\u2212R)\n\nln K(t+4)\n\n\u03b4\n\n, 4 R2\n\n. This immedi-\n\n(cid:1) + 12\n\nc(cid:48)(cid:48)\n\nL\n\n(cid:16) c(cid:48)\n\nL\n\nc(cid:48)(cid:48)\n\nL\n\nL\n\n(c(cid:48)(cid:48)\n\n(cid:16) 2 d c(cid:48)\nL)2 ln(cid:0)1 + t\u22121\n(cid:113)\n\nd\n\nately leads to the following result.\nTheorem 3. With the same assumptions and notation as in Theorem 2, if we replace \u00012\nabove we have that, with probability at least 1 \u2212 \u03b4, RT satis\ufb01es\n\nT C d ln(cid:0)1 + T\n\n(cid:1) + (1 \u2212 a) cL K R d\n\nd\n\n(cid:16)\n\n(cid:16) (c(cid:48)(cid:48)\n\nexp\n\nL)2 U 2\nc(cid:48)\nL d\n\ni,t as explained\n\n(cid:17) \u2212 1\n(cid:17)(cid:17)\n\n.\n\nRT = O\n\n(1 \u2212 a) cL K\n\n(cid:16)\n\ng(\u2212\u2206)\n\n4 On ranking with partial feedback\nAs Lemma 1 points out, when the cost values c(i, s) in (cid:96)a,c are stricly decreasing then the Bayes\noptimal ordered sequence Y \u2217\nt on xt can be obtained by sorting classes in decreasing values of pi,t,\nand then decide on a cutoff point5 induced by the loss parameters, so as to tell relevant classes\ng(\u2206)+g(\u2212\u2206) is increasing in \u2206, this ordering\napart from irrelevant ones. In turn, because p(\u2206) =\ncorresponds to sorting classes in decreasing values of \u2206i,t. Now, if parameter a in (cid:96)a,c is very close6\nt | = K, and the algorithm itself will produce ordered subsets \u02c6Yt such that | \u02c6Yt| = K.\nto 1, then |Y \u2217\nMoreover, it does so by receiving full feedback on the relevant classes at time t (since Yt \u2229 \u02c6Yt =\nYt). As is customary (e.g., [6]), one can view any multilabel assignment Y = (y1, . . . , yK) \u2208\nyi > yj. The (unnormalized) ranking loss function (cid:96)rank(Y,(cid:98)f ) between the multilabel Y and a\n{0, 1}K as a ranking among the K classes in the most natural way: i preceeds j if and only if\nranking function (cid:98)f : Rd \u2192 RK, representing degrees of class relevance sorted in a decreasing\norder (cid:98)fj1 (xt) \u2265 (cid:98)fj2(xt) \u2265 . . . \u2265 (cid:98)fjK (xt) \u2265 0, counts the number of class pairs that disagree\n2 {(cid:98)fi(xt) = (cid:98)fj(xt)}(cid:17)\nin the two rankings: (cid:96)rank(Y,(cid:98)f ) =(cid:80)\n\n(cid:16){(cid:98)fi(xt) < (cid:98)fj(xt)} + 1\n\ni,j\u2208[K] : yi>yj\n\n,\n\n4Notice that A\n\n\u22121\ni,t can be computed incrementally in O(d2) time per update. [4] and references therein also\n\nuse diagonal approximations thereof, reporting good empirical performance with just O(d) time per update.\n\n5This is called the zero point in [9].\n6If a = 1, the algorithm only cares about false negative mistakes, the best strategy being always predicting\n\n\u02c6Yt = [K]. Unsurprisingly, this yields zero regret in both Theorems 2 and 3.\n\n6\n\n\fwhere {. . .} is the indicator function of the predicate at argument. As pointed out in [6], the ranking\n\nfunction (cid:98)f (xt) = (p1,t, . . . , pK,t) is also Bayes optimal w.r.t. (cid:96)rank(Y,(cid:98)f ), no matter if the class\n\nwe can set (the factor St below serves as balancing the contribution of the two main terms):\n\nlabels yi are conditionally independent or not. Hence we can use this algorithm for tackling ranking\nproblems derived from multilabel ones, when the measure of choice is (cid:96)rank and the feedback is\nfull.\nIn fact, a partial information version of the above can easily be obtained. Suppose that at each\ntime t, the environment discloses both xt and a maximal size St for the ordered subset \u02c6Yt =\n(j1, j2, . . . , j| \u02c6Yt|) (both xt and St can be chosen adaptively by an adversary). Here St might be\nthe number of available slots in a webpage or the number of URLs returned by a search engine in\nresponse to query xt. Then it is plausible to compete in a regret sense against the best time-t of\ufb02ine\nranking of the form f (xt) = (f1(xt), f2(xt), . . . , fh(xt), 0, . . . , 0), with h \u2264 St. Further, the rank-\ning loss could be reasonably restricted to count the number of class pairs disagreeing within \u02c6Yt plus a\n\nquantity related to number of false negative mistakes. E.g., if (cid:98)fj1(xt) \u2265 (cid:98)fj2(xt) \u2265 . . . \u2265 (cid:98)fj| \u02c6Yt|(xt),\n(cid:96)rank,t(Y,(cid:98)f ) =(cid:80)\nIt is not hard to see that if classes are conditionally independent, Pt(y1,t, ..., yK,t) = (cid:81)\ni\u2208[K] pi,t,\nthen the Bayes optimal ranking for (cid:96)rank,t is given by f\u2217(xt; St) = (pi1,t, . . . , piSt ,t, 0, . . . , 0). If\naccording to decreasing values of(cid:98)pi,t), one can prove the following ranking version of Theorem 2.\nwe put on the argmin (Step 3 in Figure 1) the further constraint |Y | \u2264 St (we are still sorting classes\nconditionally independent, i.e., Pt(y1,t, ..., yK,t) = (cid:81)\nRT =(cid:80)T\nEt[(cid:96)rank,t(Yt, ((cid:98)pj1,t, ...,(cid:98)pjSt ,t, 0, ..., 0))] \u2212 Et[(cid:96)rank,t(Yt, (pi1,t, ..., piSt ,t, 0, ..., 0))],\nwhere(cid:98)pj1,t \u2265 . . . \u2265(cid:98)pjSt ,t \u2265 0 and pi1,t \u2265 . . . \u2265 piSt ,t \u2265 0. Then, with probability at least 1 \u2212 \u03b4,\n(cid:16)\n\nTheorem 4. With the same assumptions and notation as in Theorem 2, let the classes in [K] be\ni\u2208[K] pi,t for all t, and let the cumulative\n\nregret RT w.r.t. (cid:96)rank,t be de\ufb01ned as\n\n(cid:16){(cid:98)fi(xt) < (cid:98)fj(xt)} + 1\n\n2 {(cid:98)fi(xt) = (cid:98)fj(xt)}(cid:17)\n\n+ St |Yt \\ \u02c6Yt| .\n\ni,j\u2208 \u02c6Yt : yi>yj\n\n(cid:113)\n\nS K T C d ln(cid:0)1 + T\n\nd\n\n(cid:1)(cid:17)\n\nt=1\n\nwe have RT = O\n\ncL\n\n, where S = maxt=1,...,T St.\n\n\u221a\n\nThe proof (see the appendix) is very similar to the one of Theorem 2. This suggests that, to some\nextent, we are decoupling the label generating model from the loss function (cid:96) under consideration.\nNotice that the linear dependence on the total number of classes K (which is often much larger\nthan S in a multilabel/ranking problem) is replaced by\nSK. One could get similar bene\ufb01ts out of\nTheorem 2. Finally, one could also combine Theorem 4 with the argument contained in Remark 1.\n5 Experiments and conclusions\nThe experiments we report here are meant to validate the exploration-exploitation tradeoff imple-\nmented by our algorithm under different conditions (restricted vs. nonrestricted number of classes),\nloss measures ((cid:96)a,c, (cid:96)rank,t, and Hamming loss) and model/parameter settings (L = square loss, L =\nlogistic loss, with varying R).\nDatasets. We used two multilabel datasets. The \ufb01rst one, called Mediamill, was introduced in a\nvideo annotation challenge [21]. It comprises 30,993 training samples and 12,914 test ones. The\nnumber of features d is 120, and the number of classes K is 101. The second dataset is Sony CSL\nParis [18], made up of 16,452 train samples and 16,519 test samples, each sample being described\nby d = 98 features. The number of classes K is 632. In both cases, feature vectors have been\nnormalized to unit L2 norm.\nParameter setting and loss measures. We used the algorithm in Figure 1 with two different\nloss functions, the square loss and the logistic loss, and varied the parameter R for the latter. The\nsetting of the cost function c(i, s) depends on the task at hand, and for this preliminary experiments\nwe decided to evaluate two possible settings only. The \ufb01rst one, denoted by \u201cdecreasing c\u201d is\nc(i, s) = s\u2212i+1\n, i = 1, . . . , s, the second one, denoted by \u201cconstant c\u201d, is c(i, s) = 1, for all i and\ns. In all experiments, the a parameter was set to 0.5, so that (cid:96)a,c with constant c reduces to half\nthe Hamming loss. In the decreasing c scenario, we evaluated the performance of the algorithm on\nthe loss (cid:96)a,c that the algorithm is minimizing, but also its ability to produce meaningful (partial)\n\ns\n\n7\n\n\fFigure 2: Experiments on the Sony CSL Paris dataset.\n\nFigure 3: Experiments on the Mediamill dataset.\n\nrankings through (cid:96)rank,t. On the constant c setting, we evaluated the Hamming loss. As is typical\nof multilabel problems, the label density, i.e., the average fraction of labels associated with the\nexamples, is quite small. For instance, on Mediamill this is 4.3%. Hence, it is clearly bene\ufb01cial\nto impose an upper bound S on | \u02c6Yt|. For the constant c and ranking loss experiments we tried out\ndifferent values of S, and reported the \ufb01nal performance.\nBaseline. As baseline, we considered a full information version of Algorithm 1 using the square\nloss, that receives after each prediction the full array of true labels Yt for each sample. We call\nthis algorithm OBR (Online Binary Relevance), because it is a natural online adaptation of the\nbinary relevance algorithm, widely used as a baseline in the multilabel literature. Comparing to\nOBR stresses the effectiveness of the exploration/exploitation rule above and beyond the details of\nunderlying generalized linear predictor. OBR was used to produce subsets (as in the Hamming loss\ncase), and restricted rankings (as in the case of (cid:96)rank,t).\nResults. Our results are summarized in Figures 2 and 3. The algorithms have been trained by\nsweeping only once over the training data. Though preliminary in nature, these experiments allow\nus to draw a few conclusions. Our results for the avarage (cid:96)a,c(Yt, \u02c6Yt) with decreasing c are contained\nin the two left plots. We can see that the performance is improving over time on both datasets, as\npredicted by Theorem 2. In the middle plots are the \ufb01nal cumulative Hamming losses with constant\nc divided by the number of training samples, as a function of S. Similar plots are on the right with\nthe \ufb01nal average ranking losses (cid:96)rank,t divided by S. In both cases we see that there is an optimal\nvalue of S that allows to balance the exploration and the exploitation of the algorithm. Moreover\nthe performance of our algorithm is always pretty close to the performance of OBR, even if our\nalgorithm is receiving only partial feedback. In many experiments the square loss seems to give\nbetter results. Exception is the ranking loss on the Mediamill dataset (Figure 3, right).\nConclusions. We have used generalized linear models to formalize the exploration-exploitation\ntradeoff in a multilabel/ranking setting with partial feedback, providing T 1/2-like regret bounds un-\nder semi-adversarial settings. Our analysis decouples the multilabel/ranking loss at hand from the\n\nlabel-generation model. Thanks to the usage of calibrated score values(cid:98)pi,t, our algorithm is capable\n\nof automatically inferring where to split the ranking between relevant and nonrelevant classes [9],\nthe split being clearly induced by the loss parameters in (cid:96)a,c. We are planning on using more gen-\neral label models that explicitly capture label correlations to be applied to other loss functions (e.g.,\nF-measure, 0/1, average precision, etc.). We are also planning on carrying out a more thorough ex-\nperimental comparison, especially to full information multilabel methods that take such correlations\ninto account. Finally, we are currenty working on extending our framework to structured output\ntasks, like (multilabel) hierarchical classi\ufb01cation.\n\n8\n\n100101102103104105050100150Number of samplesRunning average la,cSony CSL Paris \u2212 Loss(a,c) and decreasing c SquareLogistic (R=1.5)Logistic (R=2.0)Logistic (R=2.5)Logistic (R=3.0)100101102303540455055SFinal Average Hamming lossSony CSL Paris \u2212 Hamming loss and constant c OBRSquareLogistic (R=1.5)Logistic (R=2.0)Logistic (R=2.5)Logistic (R=3.0)510152025303524252627282930313233SFinal Average Rank loss / SSony CLS Paris \u2212 Ranking loss OBRSquareLogistic (R=1.5)Logistic (R=2.0)Logistic (R=2.5)Logistic (R=3.0)1001011021031041050510152025Number of samplesRunning average la,cMediamill \u2212 Loss(a,c) and decreasing c SquareLogistic (R=1.5)Logistic (R=2.0)Logistic (R=2.5)Logistic (R=3.0)10010110233.544.555.566.57SFinal Average Hamming lossMediamill \u2212 Hamming loss and constant c OBRSquareLogistic (R=1.5)Logistic (R=2.0)Logistic (R=2.5)Logistic (R=3.0)51015202511.21.41.61.82SFinal Average Rank loss / SMediamill \u2212 Ranking loss OBRSquareLogistic (R=1.5)Logistic (R=2.0)Logistic (R=2.5)Logistic (R=3.0)\fReferences\n[1] Y. Abbasi-Yadkori, D. Pal, and C. Szepesv\u00b4ari. Improved algorithms for linear stochastic ban-\n\ndits. In 25th NIPS, 2011.\n\n[2] K. Amin, M. Kearns, and U. Syed. Graphical models for bandit problems. In UAI, 2011.\n[3] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. JMLR, 3, 2003.\n[4] K. Crammer and C. Gentile. Multiclass classi\ufb01cation with bandit feedback using adaptive\n\nregularization. In 28th ICML, 2011.\n\n[5] V. Dani, T. Hayes, and S. Kakade. Stochastic linear optimization under bandit feedback. In\n\n21th Colt, 2008.\n\n[6] K. Dembczynski, W. Waegeman, W. Cheng, and E. Hullermeier. On label dependence and loss\n\nminimization in multi-label classi\ufb01cation. Machine Learning, to appear.\n\n[7] S. Filippi, O. Capp\u00b4e, A. Garivier, and C. Szepesv\u00b4ari. Parametric bandits: The generalized\n\nlinear case. In NIPS, pages 586\u2013594, 2010.\n\n[8] Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for\n\ncombining preferences. JMLR, 4:933\u2013969, 2003.\n\n[9] J. Furnkranz, E. Hullermeier, E. Loza Menca, and K. Brinker. Multilabel classi\ufb01cation via\n\ncalibrated label ranking. Machine Learning, 73:133\u2013153, 2008.\n\n[10] E. Hazan and S. Kale. Online submodular minimization. In NIPS 22, 2009.\n[11] E. Hazan and S. Kale. Newtron: an ef\ufb01cient bandit algorithm for online multiclass prediction.\n\nIn NIPS, 2011.\n\n[12] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regres-\n\nsion. In Advances in Large Margin Classi\ufb01ers. MIT Press, 2000.\n\n[13] S. Kakade, S. Shalev-Shwartz, and A. Tewari. Ef\ufb01cient bandit algorithms for online multiclass\n\nprediction. In 25th ICML, 2008.\n\n[14] S. Kale, L. Reyzin, and R. Schapire. Non-stochastic bandit slate problems. In 24th NIPS, 2010.\n[15] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In 25th NIPS,\n\n2011.\n\n[16] T. H. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Adv. Appl. Math,\n\n6, 1985.\n\n[17] P. McCullagh and J.A. Nelder. Generalized linear models. Chapman and Hall, 1989.\n[18] F. Pachet and P. Roy. Improving multilabel analysis of music titles: A large-scale validation\nof the correction approach. IEEE Trans. on Audio, Speech, and Lang. Proc., 17(2):335\u2013343,\nFebruary 2009.\n\n[19] P. Shivaswamy and T. Joachims. Online structured prediction via coactive learning. In 29th\n\nICML, 2012, to appear.\n\n[20] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large\n\ndocument collections. In 27th ICML, 2010.\n\n[21] C. G. M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and A. W. M. Smeulders.\nThe challenge problem for automated detection of 101 semantic concepts in multimedia. In\nProc. of the 14th ACM international conference on Multimedia, MULTIMEDIA \u201906, pages\n421\u2013430, New York, NY, USA, 2006.\n\n[22] M. Streeter, D. Golovin, and A. Krause. Online learning of assignments. In 23rd NIPS, 2009.\n[23] G. Tsoumakas, I. Katakis, and I. Vlahavas. Random k-labelsets for multilabel classi\ufb01cation.\n\nIEEE Transactions on Knowledge and Data Engineering, 23:1079\u20131089, 2011.\n\n9\n\n\f", "award": [], "sourceid": 556, "authors": [{"given_name": "Claudio", "family_name": "Gentile", "institution": null}, {"given_name": "Francesco", "family_name": "Orabona", "institution": null}]}