{"title": "Online combinatorial optimization with stochastic decision sets and adversarial losses", "book": "Advances in Neural Information Processing Systems", "page_first": 2780, "page_last": 2788, "abstract": "Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can be blocked or goods that are out of stock. In this paper we study learning algorithms that are able to deal with stochastic availability of such unreliable composite actions. We propose and analyze algorithms based on the Follow-The-Perturbed-Leader prediction method for several learning settings differing in the feedback provided to the learner. Our algorithms rely on a novel loss estimation technique that we call Counting Asleep Times. We deliver regret bounds for our algorithms for the previously studied full information and (semi-)bandit settings, as well as a natural middle point between the two that we call the restricted information setting. A special consequence of our results is a significant improvement of the best known performance guarantees achieved by an efficient algorithm for the sleeping bandit problem with stochastic availability. Finally, we evaluate our algorithms empirically and show their improvement over the known approaches.", "full_text": "Online combinatorial optimization with stochastic\n\ndecision sets and adversarial losses\n\nGergely Neu\n\nMichal Valko\n\nSequeL team, INRIA Lille \u2013 Nord Europe, France\n{gergely.neu,michal.valko}@inria.fr\n\nAbstract\n\nMost work on sequential learning assumes a \ufb01xed set of actions that are available\nall the time. However, in practice, actions can consist of picking subsets of read-\nings from sensors that may break from time to time, road segments that can be\nblocked or goods that are out of stock. In this paper we study learning algorithms\nthat are able to deal with stochastic availability of such unreliable composite ac-\ntions. We propose and analyze algorithms based on the Follow-The-Perturbed-\nLeader prediction method for several learning settings differing in the feedback\nprovided to the learner. Our algorithms rely on a novel loss estimation technique\nthat we call Counting Asleep Times. We deliver regret bounds for our algorithms\nfor the previously studied full information and (semi-)bandit settings, as well as a\nnatural middle point between the two that we call the restricted information set-\nting. A special consequence of our results is a signi\ufb01cant improvement of the best\nknown performance guarantees achieved by an ef\ufb01cient algorithm for the sleeping\nbandit problem with stochastic availability. Finally, we evaluate our algorithms\nempirically and show their improvement over the known approaches.\n\n1\n\nIntroduction\n\nIn online learning problems [4] we aim to sequentially select actions from a given set in order to\noptimize some performance measure. However, in many sequential learning problems we have to\ndeal with situations when some of the actions are not available to be taken. A simple and well-\nstudied problem where such situations arise is that of sequential routing [8], where we have to select\nevery day an itinerary for commuting from home to work so as to minimize the total time spent\ndriving (or even worse, stuck in a traf\ufb01c jam). In this scenario, some road segments may be blocked\nfor maintenance, forcing us to work with the rest of the road network. This problem is isomorphic to\npacket routing in ad-hoc computer networks where some links might not be always available because\nof a faulty transmitter or a depleted battery. Another important class of sequential decision-making\nproblems where the decision space might change over time is recommender systems [11]. Here,\nsome items may be out of stock or some service may not be applicable at some time (e.g., a movie\nnot shown that day, bandwidth issues in video streaming services). In these cases, the advertiser may\nrefrain from recommending unavailable items. Other reasons include a distributor being overloaded\nwith commands or facing shipment problems.\nLearning problems with such partial-availability restrictions have been previously studied in the\nframework of prediction with expert advice. Freund et al. [7] considered the problem of online\nprediction with specialist experts, where some experts\u2019 predictions might not be available from time\nto time, and the goal of the learner is to minimize regret against the best mixture of experts. Kleinberg\net al. [15] proposed a stronger notion of regret measured against the best ranking of experts and gave\nef\ufb01cient algorithms that work under stochastic assumptions on the losses, referring to this setting as\nprediction with sleeping experts. They have also introduced the notion of sleeping bandit problems\nwhere the learner only gets partial feedback about its decisions. They gave an inef\ufb01cient algorithm\n\n1\n\n\ffor the non-stochastic case, with some hints that it might be dif\ufb01cult to learn ef\ufb01ciently in this\ngeneral setting. This was later reaf\ufb01rmed by Kanade and Steinke [14], who reduce the problem of\nPAC learning of DNF formulas to a non-stochastic sleeping experts problem, proving the hardness\nof learning in this setup. Despite these negative results, Kanade et al. [13] have shown that there\nis still hope to obtain ef\ufb01cient algorithms in adversarial environments, if one introduces a certain\nstochastic a assumption on the decision set.\nIn this paper, we extend the work of Kanade et al. [13] to combinatorial settings where the action\nset of the learner is possibly huge, but has a compact representation. We also assume stochastic\naction availability: in each decision period, the decision space is drawn from a \ufb01xed but unknown\nprobability distribution independently of the history of interaction between the learner and the envi-\nronment. The goal of the learner is to minimize the sum of losses associated with its decisions. As\nusual in online settings, we measure the performance of the learning algorithm by its regret de\ufb01ned\nas the gap between the total loss of the best \ufb01xed decision-making policy from a pool of policies\nand the total loss of the learner. The choice of this pool, however, is a rather delicate question in our\nproblem: the usual choice of measuring regret against the best \ufb01xed action is meaningless, since not\nall actions are available in all time steps. Following Kanade et al. [13] (see also [15]), we consider\nthe policy space composed of all mappings from decision sets to actions within the respective sets.\nWe study the above online combinatorial optimization setting under three feedback assumptions.\nBesides the full-information and bandit settings considered by Kanade et al. [13], we also consider a\nrestricted feedback scheme as a natural middle ground between the two by assuming that the learner\ngets to know the losses associated only with available actions. This extension (also studied by [15])\nis crucially important in practice, since in most cases it is unrealistic to expect that an unavailable\nexpert would report its loss. Finally, we also consider a generalization of bandit feedback to the\ncombinatorial case known as semi-bandit feedback.\nOur main contributions in this paper are two algorithms called SLEEPINGCAT and SLEEPINGCAT-\nBANDIT that work in the restricted and semi-bandit information schemes, respectively. The best\nknown competitor of our algorithms is the BSFPL algorithm of Kanade et al. [13] that works in\ntwo phases. First, an initial phase is dedicated to the estimation of the distribution of the available\nactions. Then, in the main phase, BSFPL randomly alternates between exploration and exploitation.\nOur technique improves over the FPL-based method of Kanade et al. [13] by removing the costly ex-\nploration phase dedicated to estimate the availability probabilities, and also the explicit exploration\nsteps in their main phase. This is achieved by a cheap alternative loss estimation procedure called\nCounting Asleep Times (or CAT) that does not require estimating the distribution of the action sets.\nThis technique improves the regret bound of [13] after T steps from O(T 4/5) to O(T 2/3) in their\n\u221a\nsetting, and also provides a regret guarantee of O(\n\nT ) in the restricted setting.1\n\n2 Background\n\nWe now give the formal de\ufb01nition of the learning problem. We consider a sequential interaction\nscheme between a learner and an environment where in each round t \u2208 [T ] = {1, 2, . . . , T}, the\nlearner has to choose an action Vt from a subset St of a known decision set S \u2286 {0, 1}d with\n(cid:107)v(cid:107)1 \u2264 m for all v \u2208 S. We assume that the environment selects St according to some \ufb01xed (but\nunknown) distribution P, independently of the interaction history. Unaware of the learner\u2019s decision,\nthe environment also decides on a loss vector (cid:96)t \u2208 [0, 1]d that will determine the loss suffered by the\nlearner, which is of the form V (cid:62)\nt (cid:96)t. We make no assumptions on how the environment generates\nthe sequence of loss vectors, that is, we are interested in algorithms that work in non-oblivious (or\nadaptive) environments. At the end of each round, the learner receives some feedback based on the\nloss vector and the action of the learner. The goal of the learner is pick its actions so as to minimize\nthe losses it accumulates by the end of the T \u2019th round. This setup generalizes the setting of online\ncombinatorial optimization considered by Cesa-Bianchi and Lugosi [5], Audibert et al. [1], where\nthe decision set is assumed to be \ufb01xed throughout the learning procedure. The interaction protocol\nis summarized on Figure 1 for reference.\n\n1While not explicitly proved by Kanade et al. [13], their technique can be extended to work in the restricted\n\nsetting, where it can be shown to guarantee a regret of O(T 3/4).\n\n2\n\n\fParameters:\nfull set of decision vectors S = {0, 1}d, number of rounds T , unknown distribution P \u2208 \u22062S\nFor all t = 1, 2, . . . , T repeat\n1. The environment draws a set of available actions St \u223c P and picks a loss vector\n2. The set St is revealed to the learner.\n3. Based on its previous observations (and possibly some source of randomness), the\n\n(cid:96)t \u2208 [0, 1]d.\n\nlearner picks an action Vt \u2208 St.\n\n4. The learner suffers loss V (cid:62)\n\nt (cid:96)t and gets some feedback:\n\n(a) in the full information setting, the learner observes (cid:96)t,\n(b) in the restricted setting, the learner observes (cid:96)t,i for all i \u2208 Dt,\n(c) in the semi-bandit setting, the learner observes (cid:96)t,i for all i such that Vt,i = 1.\n\nFigure 1: The protocol of online combinatorial optimization with stochastic action availability.\n\nWe distinguish between three different feedback schemes, the simplest one being the full information\nscheme where the loss vectors are completely revealed to the learner at the end of each round. In\nthe restricted-information scheme, we make a much milder assumption that the learner is informed\nabout the losses of the available actions. Precisely, we de\ufb01ne the set of available components as\n\nDt = {i \u2208 [d] : \u2203v \u2208 St : vi = 1}\n\nand assume that the learner can observe the i-th component of the loss vector (cid:96)t if and only if i \u2208 Dt.\nThis is a sensible assumption in a number of practical applications, e.g., in sequential routing prob-\nlems where components are associated with links in a network. Finally, in the semi-bandit scheme,\nwe assume that the learner only observes losses associated with the components of its own decision,\nthat is, the feedback is (cid:96)t,i for all i such that Vt,i = 1. This is the case in in online advertising set-\ntings where components of the decision vectors represent customer-ad allocations. The observation\nhistory Ft is de\ufb01ned as the sigma-algebra generated by the actions chosen by the learner and the\ndecision sets handed out by the environment by the end of round t: Ft = \u03c3(Vt,St, . . . , V1,S1).\nThe performance of the learner is measured with respect to the best \ufb01xed policy (otherwise known\nas a choice function in discrete choice theory [16]) of the form \u03c0 : 2S \u2192 S. In words, a policy\n\u03c0 will pick action \u03c0( \u00afS) \u2208 \u00afS whenever the environment selects action set \u00afS. The (total expected)\nregret of the learner is de\ufb01ned as\n\nT(cid:88)\n\nE(cid:104)\n\nt=1\n\nRT = max\n\n\u03c0\n\n(cid:105)\n\n(Vt \u2212 \u03c0(St))\n\n(cid:62)\n\n(cid:96)t\n\n.\n\n(1)\n\nNote that the above expectation integrates over both the randomness injected by the learner and the\nstochastic process generating the decision sets. The attentive reader might notice that this regret\ncriterion is very similar to that of Kanade et al. [13], who study the setting of prediction with expert\nadvice (where m = 1) and measure regret against the best \ufb01xed ranking of experts. It is actually\neasy to show that the optimal policy in their setting belongs to the set of ranking policies, making\nour regret de\ufb01nition equivalent to theirs.\n\n3 Loss estimation by Counting Asleep Times\n\nIn this section, we describe our method used for estimating unobserved losses that works without\nhaving to explicitly learn the availability distribution P. To explain the concept on a high level, let\nus now consider our simpler partial-observability setting, the restricted-information setting. For the\nformal treatment of the problem, let us \ufb01x any component i \u2208 [d] and de\ufb01ne At,i = 1{i\u2208Dt} and\nai = E [At,i |Ft\u22121 ]. Had we known the observation probability ai, we would be able to estimate\nthe i\u2019th component of the loss vector (cid:96)t by \u02c6(cid:96)\u2217\nt,i = ((cid:96)t,iAt,i)/ai, as the quantity (cid:96)t,iAt,i is observable.\nIt is easy to see that the estimate \u02c6(cid:96)\u2217\nt,i is unbiased by de\ufb01nition \u2013 but, unfortunately, we do not\nknow ai, so we have no hope to compute it. A simple idea used by Kanade et al. [13] is to devote\n\n3\n\n\f((cid:80)T0\n\nthe \ufb01rst T0 rounds of interaction solely to the purpose of estimating ai by the sample mean \u02c6ai =\nt=1 At,i)/T0. While this trick gets the job done, it is obviously wasteful as we have to throw\n\naway all loss observations before the estimates are suf\ufb01ciently concentrated. 2\nWe take a much simpler approach based on the observation that the \u201casleep-time\u201d of component i\nis a geometrically distributed random variable with parameter ai. The asleep-time of component i\nstarting from time t is formally de\ufb01ned as\n\nNt,i = min{n > 0 : i \u2208 Dt+n} ,\n\nwhich is the number of rounds until the next observation of the loss associated with component i.\nUsing the above de\ufb01nition, we construct our loss estimates as the vector \u02c6(cid:96)t whose i-th component is\n(2)\n\n\u02c6(cid:96)t,i = (cid:96)t,iAt,iNt,i.\nIt is easy to see that the above loss estimates are unbiased as\n\nE [(cid:96)t,iAt,iNt,i |Ft\u22121 ] = (cid:96)t,iE [At,i |Ft\u22121 ] E [Nt,i |Ft\u22121 ] = (cid:96)t,iai \u00b7 1\nai\n\n= (cid:96)t,i\n\nfor any i. We will refer to this loss-estimation method as Counting Asleep Times (CAT).\nLooking at the de\ufb01nition (2), the attentive reader might worry that the vector \u02c6(cid:96)t depends on future\nrealizations of the random decision sets and thus could be useless for practical use. However, ob-\nserve that there is no reason that the learner should use the estimate \u02c6(cid:96)t,i before component i wakes\nup in round t + Nt,i \u2013 which is precisely the time when the estimate becomes well-de\ufb01ned. This\nsuggests a very simple implementation of CAT: whenever a component is not available, estimate its\nloss by the last observation from that component! More formally, set\n\n(cid:40)\n\n\u02c6(cid:96)t,i =\n\nif i \u2208 Dt\n(cid:96)t,i,\n\u02c6(cid:96)t\u22121,i, otherwise.\n\nIt is easy to see that at the beginning of any round t, the two alternative de\ufb01nitions match for all\ncomponents i \u2208 Dt. In the next section, we con\ufb01rm that this property is suf\ufb01cient for running our\nalgorithm.\n\n4 Algorithms & their analyses\n\nFor all information settings, we base our learning algorithms on the Follow-the-Perturbed-Leader\n(FPL) prediction method of Hannan [9], as popularized by Kalai and Vempala [12]. This algorithm\nworks by additively perturbing the total estimated loss of each component, and then running an op-\ntimization oracle over the perturbed losses to choose the next action. More precisely, our algorithms\n\nmaintain the cumulative sum of their loss estimates (cid:98)Lt =(cid:80)t\n\u03b7(cid:98)Lt\u22121 \u2212 Zt\n\n\u02c6(cid:96)t and pick the action\n\nvT(cid:16)\n\n(cid:17)\n\ns=1\n\n,\n\nVt = arg min\n\nv\u2208St\n\nwhere Zt is a perturbation vector with independent exponentially distributed components with unit\nexpectation, generated independently of the history, and \u03b7 > 0 is a parameter of the algorithm. Our\nalgorithms for the different information settings will be instances of FPL that employ different loss\nestimates suitable for the respective settings. In the \ufb01rst part of this section, we present the main\ntools of analysis that will be used for each resulting method.\nAs usual for analyzing FPL-based methods [12, 10, 18], we start by de\ufb01ning a hypothetical fore-\n\ncaster that uses a time-independent perturbation vector (cid:101)Z with standard exponential components\nthe decision set: we introduce the time-independent decision set (cid:101)S \u223c P (drawn independently of\n\nand peeks one step into the future. However, we need an extra trick to deal with the randomness of\nthe \ufb01ltration (Ft)t) and de\ufb01ne\n\n(cid:101)Vt = arg min\nv\u2208(cid:101)S\n\nvT(cid:16)\n\n\u03b7(cid:98)Lt \u2212 (cid:101)Z\n\n(cid:17)\n\n.\n\n2Notice that we require \u201csuf\ufb01cient concentration\u201d from 1/\u02c6ai and not only from \u02c6ai! The deviation of such\n\nquantities is rather dif\ufb01cult to control, as demonstrated by the complicated analysis of Kanade et al. [13].\n\n4\n\n\fClearly, this forecaster is infeasible as it uses observations from the future. Also observe that\n\n(cid:101)Vt\u22121 \u223c Vt given Ft\u22121. The following two lemmas show how analyzing this forecaster can help in\n\nestablishing the performance of our actual algorithms.\nLemma 1. For any sequence of loss estimates, the expected regret of the hypothetical forecaster\nagainst any \ufb01xed policy \u03c0 : 2S \u2192 S satis\ufb01es\n\nE\n\n(cid:34) T(cid:88)\n\n(cid:35)\n(cid:16)(cid:101)Vt \u2212 \u03c0((cid:101)S)\n(cid:17)T \u02c6(cid:96)t\nLemma 3.1]) and using the upper bound E(cid:2)(cid:13)(cid:13)(cid:101)Z(cid:13)(cid:13)\u221e\n(cid:3) \u2264 log d + 1.\n(cid:20)(cid:16)(cid:101)V T\n(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:105) \u2264 \u03b7 E\n\n((cid:101)Vt\u22121 \u2212 (cid:101)Vt)T \u02c6(cid:96)t\n\nE(cid:104)\n\nt\u22121\n\nt=1\n\n\u03b7\n\n\u2264 m (log d + 1)\n\n.\n\n(cid:21)\n\n(cid:17)2(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n\n.\n\n\u02c6(cid:96)t\n\nThe statement is easily proved by applying the follow-the-leader/be-the-leader lemma3 (see, e.g., [4,\n\nThe following result can be extracted from the proof of Theorem 1 of Neu and Bart\u00b4ok [18].\nLemma 2. For any sequence of nonnegative loss estimates,\n\nIn the next subsections, we apply these results to obtain bounds for the three information settings.\n\n4.1 Algorithm for full information\n\nIn the simplest setting, we can use \u02c6(cid:96)t = (cid:96)t, which yields the following theorem:\nTheorem 1. De\ufb01ne\n\n(cid:35)\n\n(cid:41)\n\n(cid:40)\n\n\u03c0(St)T(cid:96)t\n\n, 4(log d + 1)\n\n.\n\nSetting \u03b7 =(cid:112)(log d + 1)/L\u2217\n\nL\u2217\nT = max\n\nE\n\nmin\n\n\u03c0\n\n(cid:34) T(cid:88)\n(cid:113)\n\nt=1\n\nT , the regret of FPL in the full information scheme satis\ufb01es\n\nRT \u2264 2m\n\n2L\u2217\n\nT (log d + 1).\n\nAs this result is comparable to the best available bounds for FPL [10, 18] in the full information\nsetting and a \ufb01xed decision set, it reinforces the observation of Kanade et al. [13], who show that the\nsleeping experts problem with full information and stochastic availability is no more dif\ufb01cult than\nthe standard experts problem. The proof of Theorem 1 follows directly from combining Lemmas 1\nand 2 with some standard tricks. For completeness, details are provided in Appendix A.\n\n4.2 Algorithm for restricted feedback\n\nT(cid:88)\n\nIn this section, we use the CAT loss estimate de\ufb01ned in Equation (2) as \u02c6(cid:96)t in FPL, and call the\nresulting method SLEEPINGCAT. The following theorem gives the performance guarantee for this\nalgorithm.\nE [ Vt,i| i \u2208 Dt]. The total expected regret of SLEEPINGCAT\n\nTheorem 2. De\ufb01ne Qt = (cid:80)d\n\ni=1\n\nagainst the best \ufb01xed policy is upper bounded as\n\nRT \u2264 m(log d + 1)\n\n+ 2\u03b7m\n\nQt.\n\n\u03b7\n\nt=1\n\n(cid:3) = E [\u03c0(St)T(cid:96)t], where we used that \u02c6(cid:96)t is independent of\nProof. We start by observing E(cid:2)\u03c0((cid:101)S)T \u02c6(cid:96)t\n(cid:101)S and is an unbiased estimate of (cid:96)t, and also that St \u223c (cid:101)S. The proof is completed by combining\n(cid:17)2(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:20)(cid:16)(cid:101)V T\n\nthis with Lemmas 1 and 2, and the bound\n\n\u2264 2mQt.\n\nThe proof of this last statement follows from a tedious calculation that we defer to Appendix B.\n\n3This lemma can be proved in the current case by virtue of the \ufb01xed decision set (cid:101)S, allowing the necessary\n\n(cid:21)\n\nt\u22121\n\n\u02c6(cid:96)t\n\nE\n\nrecursion steps to go through.\n\n5\n\n\fBelow, we provide two ways of further bounding the regret under various assumptions. The \ufb01rst one\nprovides a universal upper bound that holds without any further assumptions.\n\nCorollary 1. Setting \u03b7 =(cid:112)(log d + 1)/(2dT ), the regret of SLEEPINGCAT against the best \ufb01xed\n\npolicy is bounded as\n\nRT \u2264 2m(cid:112)2dT (log d + 1).\n\n\u221a\n\nThe proof follows from the fact that Qt \u2264 d no matter what P is. A somewhat surprising feature\nof our bound is its scaling with\nd log d, which is much worse than the logarithmic dependence\nexhibited in the full information case. It is easy to see, however, that this bound is not improvable in\ngeneral \u2013 see Appendix D for a simple example. The next bound, however, shows that it is possible\nto improve this bound by assuming that most components are reliable in some sense, which is the\ncase in many practical settings.\n\nCorollary 2. Assuming ai \u2265 \u03b2 for all i, we have Qt \u2264 1/\u03b2, and setting \u03b7 =(cid:112)\u03b2(log d + 1)/(2T )\n\nguarantees that the regret of SLEEPINGCAT against the best \ufb01xed policy is bounded as\n\n(cid:115)\n\nRT \u2264 2m\n\n2T (log d + 1)\n\n\u03b2\n\n.\n\n4.3 Algorithm for semi-bandit feedback\n\nWe now turn our attention to the problem of learning with semi-bandit feedback where the learner\nonly gets to observe the losses associated with its own decision. Speci\ufb01cally, we assume that the\nlearner observes all components i of the loss vector such that Vt,i = 1. The extra dif\ufb01culty in this\nsetting is that our actions in\ufb02uence the feedback that we receive, so we have to be more careful when\nde\ufb01ning our loss estimates. Ideally, we would like to work with unbiased estimates of the form\n\nP( \u00afS)E(cid:2)Vt,i\n\n(cid:12)(cid:12)Ft\u22121,St = \u00afS(cid:3) .\n\n(3)\n\n(cid:88)\n\n\u00afS\u22082S\n\n\u02c6(cid:96)\u2217\nt,i =\n\n(cid:96)t,i\nq\u2217\n\nt,i\n\nVt,i,\n\nwhere\n\nt,i = E [ Vt,i|Ft\u22121] =\nq\u2217\n\nfor all i \u2208 [d]. Unfortunately though, we are in no position to compute these estimates, as this would\nrequire perfect knowledge of the availability distribution P! Thus we have to look for another way\nto compute reliable loss estimates. A possible idea is to use\n\nqt,i \u00b7 ai = E [ Vt,i|Ft\u22121,St] \u00b7 P [i \u2208 Dt] .\n\nt,i in Equation 3 to normalize the observed losses. This choice yields another unbiased\n\ninstead of q\u2217\nloss estimate as\n\n(cid:20) (cid:96)t,iVt,i\n\nE\n\nqt,iai\n\n(cid:21)\n\n(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n\n(cid:20)\n\n(cid:20) Vt,i\n\nqt,i\n\n(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121,St\n\n(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n\n(cid:21)\n\n=\n\nE\n\n(cid:96)t,i\nai\n\nE\n\n=\n\n(cid:96)t,i\nai\n\nE [ At,i|Ft\u22121] = (cid:96)t,i,\n\n(4)\n\nwhich leaves us with the problem of computing qt,i and ai. While this also seems to be a tough\nchallenge, we now show to estimate this quantity by generalizing the CAT technique presented in\nSection 3.\nBesides our trick used for estimating the 1/ai\u2019s by the random variables Nt,i, we now also have to\nface the problem of not being able to \ufb01nd a closed-form expression for the qt,i\u2019s. Hence, we follow\nthe geometric resampling approach of Neu and Bart\u00b4ok [18] and draw an additional sequence of M\nperturbation vectors Z(cid:48)\n\nt(M ) and use them to compute\n\nt(1), . . . , Z(cid:48)\nV (cid:48)\nt (k) = arg min\n\n(cid:110)\n\u03b7(cid:98)Lt\u22121 \u2212 Z(cid:48)\nKt,i = min(cid:0)(cid:8)k \u2208 [M ] : V (cid:48)\n\nt,i(k) = Vt,i\n\nv\u2208St\n\n(cid:111)\n(cid:9) \u222a {M}(cid:1) .\n\nt(k)\n\nfor all k \u2208 [M ]. Using these simulated actions, we de\ufb01ne\n\nand\n\n(5)\nfor all i. Setting M = \u221e makes this expression equivalent to (cid:96)t,iVt,i\nin expectation, yielding yet\nanother unbiased estimator for the losses. Our analysis, however, crucially relies on setting M to\n\n\u02c6(cid:96)t,i = (cid:96)t,iKt,iNt,iVt,i\n\nqt,iai\n\n6\n\n\fT(cid:88)\n\nt=1\n\n(cid:12)(cid:12)(cid:12)Ft\u22121\ne \u00b7(cid:16)\n\n(cid:16)\u221a\n\n(cid:17)2/3\n\n(cid:17)1/3\n\na \ufb01nite value so as to control the variance of the loss estimates. We are not aware of any other\nwork that achieves a similar variance-reduction effect without explicitly exploring the action space\n[17, 6, 5, 3], making this alternative bias-variance tradeoff a unique feature of our analysis. We call\nthe algorithm resulting from using the loss estimates above SLEEPINGCATBANDIT. The following\ntheorem gives the performance guarantee for this algorithm.\n\nTheorem 3. De\ufb01ne Qt =(cid:80)d\n\ni=1\n\nDIT against the best \ufb01xed policy is bounded as\n\nE [ Vt,i| i \u2208 Dt]. The total expected regret of SLEEPINGCATBAN-\n\nRT \u2264 m(log d + 1)\n\n+ 2\u03b7M m\n\nQt +\n\ndT\neM\n\n.\n\n\u03b7\n\nlar argument that we used in the proof of Theorem 2. After yet another long and tedious calculation\n(see Appendix C), we can prove\n\n(cid:12)(cid:12)Ft\u22121\nProof. First, observe that E(cid:2)\u02c6(cid:96)t,i\n(cid:3) \u2264 (cid:96)t,i as E [ Kt,iVt,i|Ft\u22121,St] \u2264 At,i and\nE [ At,iNt,i|Ft\u22121] = 1 by de\ufb01nition. Thus, we can get E(cid:2)\u03c0((cid:101)S)T \u02c6(cid:96)t\n(cid:3) \u2264 E [\u03c0(St)T(cid:96)t] by a simi-\n(cid:17)2(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:20)(cid:16)(cid:101)V T\nt (cid:96)t|Ft\u22121] \u2264 E(cid:104)(cid:101)V T\n\nThe proof is concluded by combining this bound with Lemmas 1 and 2 and the upper bound\n\n\u2264 2M mQt.\n\nE [ V T\n\n(cid:21)\n\n(cid:105)\n\nt\u22121\n\nt\u22121\n\n\u02c6(cid:96)t\n\n(6)\n\n\u02c6(cid:96)t\n\nE\n\n+\n\nd\neM\n\n,\n\nwhich can be proved by following the proof of Theorem 1 in Neu and Bart\u00b4ok [18].\n\nCorollary 3. Setting \u03b7 =\n2m(log d+1)\nregret of SLEEPINGCATBANDIT against the best \ufb01xed policy is bounded as\n\nand M = 1\u221a\n\nm(log d+1)\n\n2dT\n\n\u221a\n\ndT\n\nguarantees that the\n\nRT \u2264 (2mdT )2/3 \u00b7 (log d + 1)1/3.\n\nThe proof of the corollary follows from bounding Qt \u2264 d and plugging the parameters into the\nbound of Theorem 3. Similarly to the improvement of Corollary 2, it is possible to replace the factor\nd2/3 by (d/\u03b2)1/3 if we assume that ai \u2265 \u03b2 for all i and some \u03b2 > 0.\nThis corollary implies that SLEEPINGCATBANDIT achieves a regret of (2KT )2/3 \u00b7 (log K + 1)1/3\nin the case when S = [K], that is, in the K-armed sleeping bandit problem considered by Kanade\net al. [13]. This improves their bound of O((KT )4/5 log T ) by a large margin, thanks to the fact\nthat we did not have to explicitly learn the distribution P.\n\n5 Experiments\n\nIn this section we present the empirical evaluation of our algorithms for bandit and semi-bandit\nsettings, and compare them to its counterparts [13]. We demonstrate that the wasteful exploration of\nBSFPL does not only result in worse regret bounds but also degrades its empirical performance.\nFor the bandit case, we evaluate SLEEPINGCATBANDIT using the same setting as Kanade et al. [13].\nWe consider an experiment with T = 10, 000 and 5 arms, each of which are available independenly\nof each other with probability p. Losses for each arm are constructed as random walks with Gaussian\nincrements of standard deviation 0.002, initialized uniformly on [0, 1]. Losses outside [0, 1] are trun-\ncated. In our \ufb01rst experiment (Figure 2, left), we study the effect of changing p on the performance\nof BSFPL and SLEEPINGCATBANDIT. Notice that when p is very low, there are few or no arms to\nchoose from. In this case the problems are easy by design and all algorithms suffer low regret. As\np increases, the policy space starts to blow up and the problem becomes more dif\ufb01cult. When p ap-\nproaches one, it collapses into the set of single arms and the problem gets easier again. Observe that\nthe behavior of SLEEPINGCATBANDIT follows this trend. On the other hand, the performance of\nBSFPL steadily decreases with increasing availability. This is due to the explicit exploration rounds\nin the main phase of BSFPL, that suffers the loss of the uniform policy scaled by the exploration\nprobability. The performance of the uniform policy is plotted for reference.\n\n7\n\n\fFigure 2: Left: Multi-arm bandits - varying availabilities. Middle: Shortest paths on a 3 \u00d7 3 grid.\nRight: Shortest paths on a 10 \u00d7 10 grid.\n\nTo evaluate SLEEPINGCATBANDIT in the semi-bandit setting, we consider the shortest path prob-\nlem on grids of 3 \u00d7 3 and 10 \u00d7 10 nodes, which amounts to 12 and 180 edges respectively. For\neach edge, we generate a random-walk loss sequence in the same way as in our \ufb01rst experiment.\nIn each round t, the learner has to choose a path from the lower left corner to the upper right one\ncomposed from available edges. The individual availability of each edge is sampled with probability\n0.9, independently of the others. Whenever an edge gets disconnected from the source, it becomes\nunavailable itself, resuling in a quite complicated action-availability distribution. Once a learner\nchooses a path, the losses of chosen road segments are revealed and the learner suffers their sum.\nSince [13] does not provide a combinatorial version of their approach, we compare against COMBB-\nSFPL, a straightforward extension of BSFPL. As in BSFPL, we dedicate an initial phase to esti-\nmate the availabilities of each component, requiring d oracle calls per step. In the main phase, we\nfollow BSFPL and alternate between exploration and exploitation. In exploration rounds, we test\nfor the reachability of a randomly sampled edge and update the reward estimates as in BSFPL.\nFigure 2 (middle and right) shows the performance of COMBBSFPL and SLEEPINGCATBANDIT\nfor a \ufb01xed loss sequence, averaged over 20 samples of the component availabilities. We also plot the\nperformance of a random policy that follows the perturbed leader with all-zero loss estimates. First\nobserve that the initial exploration phase sets back the performance of COMBBSFPL signi\ufb01cantly.\nThe second drawback of COMBBSFPL is the explicit separation of exploration and the exploitation\nrounds. This drawback is far more apparent when the number of components increases, as it is the\ncase for the 10 \u00d7 10 grid graph with 180 components. As COMBBSFPL only estimates the loss of\none edge per exploration step, sampling each edge as few as 50 times eats up 9, 000 rounds from\nthe available 10, 000. SLEEPINGCATBANDIT does not suffer from this problem as it uses all its\nobservations in constructing the loss estimates.\n\n6 Conclusions & future work\n\nIn this paper, we studied the problem of online combinatorial optimization with changing decision\nsets. Our main contribution is a novel loss-estimation technique that enabled us to prove strong\nregret bounds under various partial-feedback schemes. In particular, our results largely improve on\nthe best known results for the sleeping bandit problem [13], which suffers large losses from both\nfrom an initial exploration phase and from explicit exploration rounds in the main phase. These\n\ufb01ndings are also supported by our experiments.\n\nStill, one might ask if it is possible to ef\ufb01ciently achieve a regret of order\nT under semi-bandit\nfeedback. While the EXP4 algorithm of Auer et al. [2] can be used to obtain such regret guarantee,\nrunning this algorithm is out of question as its time and space complexity can be double-exponential\nin d (see also the comments in [15]). Had we had access to the loss estimates (3), we would be able\n\u221a\nto control the regret of FPL as the term on the right hand side of Equation (6) could be replaced\nby md, which is suf\ufb01cient for obtaining a regret bound of O(m\ndT log d). In fact, it seems that\nlearning in the bandit setting requires signi\ufb01cantly more knowledge about P than the knowledge of\nthe ai\u2019s. The question if we can extend the CAT technique to estimate all the relevant quantities of\nP is an interesting problem left for future investigation.\n\n\u221a\n\nAcknowledgements The research presented in this paper was supported by French Ministry\nof Higher Education and Research, by European Community\u2019s Seventh Framework Programme\n(FP7/2007-2013) under grant agreement no270327 (CompLACS), and by FUI project Herm`es.\n\n8\n\navailabitycumulative regret at time T = 10000sleeping bandits, 5 arms, varying availabity, average over 20 runs0.10.20.30.40.50.60.70.80.9100.050.10.150.20.25BSFPLSleepingCatRandomGuess\fReferences\n[1] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2014). Regret in online combinatorial optimization.\n\nMathematics of Operations Research. to appear.\n\n[2] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multi-\n\narmed bandit problem. SIAM J. Comput., 32(1):48\u201377.\n\n[3] Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. (2012). Towards minimax policies for online\n\nlinear optimization with bandit feedback. In COLT 2012, pages 1\u201314.\n\n[4] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge Univer-\n\nsity Press, New York, NY, USA.\n\n[5] Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and\n\nSystem Sciences, 78:1404\u20131422.\n\n[6] Dani, V., Hayes, T. P., and Kakade, S. (2008). The price of bandit information for online\n\noptimization. In NIPS-20, pages 345\u2013352.\n\n[7] Freund, Y., Schapire, R., Singer, Y., and Warmuth, M. (1997). Using and combining predictors\nthat specialize. In Proceedings of the 29th Annual ACM Symposium on the Theory of Computing,\npages 334\u2013343. ACM Press.\n\n[8] Gy\u00a8orgy, A., Linder, T., Lugosi, G., and Ottucs\u00b4ak, Gy.. (2007). The on-line shortest path problem\n\nunder partial monitoring. Journal of Machine Learning Research, 8:2369\u20132403.\n\n[9] Hannan, J. (1957). Approximation to bayes risk in repeated play. Contributions to the theory of\n\ngames, 3:97\u2013139.\n\n[10] Hutter, M. and Poland, J. (2004). Prediction with expert advice by following the perturbed\n\nleader for general weights. In ALT, pages 279\u2013293.\n\n[11] Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010). Recommender Systems: An\n\nIntroduction. Cambridge University Press.\n\n[12] Kalai, A. and Vempala, S. (2005). Ef\ufb01cient algorithms for online decision problems. Journal\n\nof Computer and System Sciences, 71:291\u2013307.\n\n[13] Kanade, V., McMahan, H. B., and Bryan, B. (2009). Sleeping experts and bandits with stochas-\n\ntic action availability and adversarial rewards. In AISTATS 2009, pages 272\u2013279.\n\n[14] Kanade, V. and Steinke, T. (2012). Learning hurdles for sleeping experts. In Proceedings of\nthe 3rd Innovations in Theoretical Computer Science Conference (ITCS 12), pages 11\u201318. ACM.\n[15] Kleinberg, R. D., Niculescu-Mizil, A., and Sharma, Y. (2008). Regret bounds for sleeping\n\nexperts and bandits. In COLT 2008, pages 425\u2013436.\n\n[16] Koshevoy, G. A. (1999). Choice functions and abstract convex geometries. Mathematical\n\nSocial Sciences, 38(1):35\u201344.\n\n[17] McMahan, H. B. and Blum, A. (2004). Online geometric optimization in the bandit setting\n\nagainst an adaptive adversary. In COLT 2004, pages 109\u2013123.\n\n[18] Neu, G. and Bart\u00b4ok, G. (2013). An ef\ufb01cient algorithm for learning with semi-bandit feedback.\n\nIn ALT 2013, pages 234\u2013248.\n\n9\n\n\f", "award": [], "sourceid": 1447, "authors": [{"given_name": "Gergely", "family_name": "Neu", "institution": "INRIA"}, {"given_name": "Michal", "family_name": "Valko", "institution": "INRIA Lille - Nord Europe"}]}